The application of speech enhancement and noise reduction algorithms in adverse acoustic environments presents many challenges. For example, in scenarios in which the SNR is 0dB or even less, important parts of the speech signal energy are below the noise floor. Meaning most speech enhancement routines will not be able to detect that there is voice present, and will suppress these regions as if they contained only noise. When this occurs, the “enhanced” speech will sound muffled and unintelligible. There are two approaches to combating this problem.
The first would require the speech enhancement algorithms to use more advanced methods for voice activity detection. The second method would be to reconstruct the missing speech components from the available speech components. The first approach requires analysis of long frames of data. Voice speech has a very well defined structure that is quasi-stationary. While most noise source exhibit a more random behavior. Thus, by analyzing the input noisy signal over a long window allows for the voice activity detection routine to decipher whether or not the signal contains speech. This information can then be used to adjust the gains of the noise reduction filter. The downfall to this approach is that a latency of over 100ms is required for proper detection. This sort of latency is destructive to the quality of a full-duplex conversation, but it may be tolerable in a half-duplex, push-to-talk communication system.
An alternative approach to improve the speech quality in adverse environments is the reconstruction of the harmonic components of speech from the existing components of speech. Although the SNR may be less than 0dB, in some frequency buckets the speech energy will contain enough energy that it becomes visible in the noisy input signal. These buckets represent windows of opportunities for constructing the harmonic structure of original voiced speech. The figure below will help with the explanation of this concept: