Standard methods of VAD (Voice Activity Detection) are used in telecommunications applications to determine the presence of voice.
- energy thresholds (non-adaptive and adaptive)
- waveform and spectrum analysis
- pitch and harmonic detection
- periodicity measures
- zero-crossing rates
- high order statistics of the LPC residuals
- Based on statistical models
The most simple VAD schemes are based on a energy detector. If the energy of the signal rises a threshold amount above the noise floor, then the increase in energy is assumed to be to associated with voice. Since the noise floor in most applications is not known a priori and is time-varying, it has be to estimated throughout the call. An adaptive method for calculating the noise floor is the dual time constant integrator.
In other words, if the energy of the signal at instance N, is less than the noise floor then the noise floor will lower quickly, but if the energy is greater than the noise floor, then the noise floor rises slowly. This method is useful in situations in which the noise floor varies slower the speech envelope. To further improve the performance of the VAD based on energy thresholds, state machines with varying thresholds can be used. For example, speech has a strong time correlation. Thus, the states of the previous frames can help determine probability of speech in the current frame.
In waveform and spectral analysis, voice activity detection makes use of the known characteristics of the speech. Applying VAD in this method is more computational intensive than energy based solutions, but are better able to detect noise in non-stationary noise and low SNR scenarios. For example, voiced speech contains a strong fundamental frequency with it’s harmonics. Thus, the analysis of cepstrum of a signal can reveal source of the signal energy.
That is, if the spectral energy energy has a periodic nature to it, the cepstrum will have a peak related to that periodicity and to voice. If the cepstrum is flat, then signal energy could be from something like a door slamming or clapping. The kurtosis of the LPC residuals also reveal similar characteristics of speech. Clean speech residuals have a large kurtosis. Thus, if the residuals of the LPC have a low kurtosis (more uniform distribution PDF), then signal is less likely to represent voiced speech.
VAD is used in signal processing with echo cancellers for control and various estimation routines, noise reduction for estimations of the noise spectrum and the probability of speech presence, vocoders for determining when silence suppression packets can be sent, and speech recognition for removing periods of noise that will lower recognition rates.