Voice activity detectors always discriminate between features of speech and noise. The spectral flux approach is used to produce a decision rule aimed at minimizing the number of decision errors based on the observation that speech is quasi-stationary over short temporal spans. Thus, this feature can be leveraged to improved the decision making for detection of speech.

Suppose the received signal at the microphone is given as:

y(t)= s(t) + \nu(t)

where s(t) is the desired speech signal and \nu(t) is i.i.d zero mean Gaussian noise. The frequency domain representation then becomes:

y(t, \omega) = s(t,\omega) + \nu{(t,\omega)}

The spectral flux $latex\psi(t,\omega)$ is given as:

\psi(t,\omega) = ||y(t, \omega) -y(t-1, \omega) ||1

where \|.\|1 is the L_1 norm. Because of the quasi stationary nature of speech signals, the spectral flux will exhibit on average nulls corresponding to noise intervals while maintaining a flat spectrum for speech intervals. A specific frequency in can be targeted or an aggregate over all frequency bins can also be used. A sample performance of this approach is depicted in Figure 1 below:

speech activity detection using spectral flux

speech activity detection using spectral flux

A simple threshold could be deployed for detection. The speech detection rate can further be improved if it is used together with other well known metrics.

VOCAL Technologies offers custom designed solutions for beamforming with a robust voice activity detector, acoustic echo cancellation and noise suppression. Our custom implementations of such systems are meant to deliver optimum performance for your specific beamforming task. Contact us today to discuss your solution!