Using Statistical Models for Speech Enhancement

Using statistical models for speech enhancement is fraught with practical difficulties. Often, these models are derived with very specific assumptions, such as uncorrelated noise, Gaussian clean speech priors, or the assumption that the noise and speech are independent and identically distributed. Once such example is the famous Minimum Mean Square Error Log-Spectral Amplitude Estimator by Ephraim and Malah. Such issues and those related have been addressed after the publication of their landmark paper, but not necessarily all at once due to mathematical intractability that results, and so it is still important to carefully digest all assumptions made in the derivation of so called optimal estimators.

For instance, it has been shown that Laplace or Gamma random variables may be better suited to model noisy speech fourier transform coefficients. From this observation, there has been work deriving estimators minimizing various error criteria assuming these kinds of priors. While the resulting expressions are on the surface more complicated than those derived from a Gaussian prior, such estimates are often written in the form of a gain that is itself a function of the a posteriori and a priori SNR. Using the Decision Directed or Predicted Estimation methods for a priori SNR estimation, such gains can be precomputed and stored in a lookup table for easy use. The disadvantage of this approach is the required resolution, as not all signals will perfectly match the SNR patterns in the table.

Never the less, being critical of the assumptions made can lead to improved performance because other decisions, such as Voice Activity Detection, can be improved through using alternative statistical assumptions, and the proper functioning of such modules is key in the actual application of the enhancement algorithms.

Complete Communications Engineering

Using Statistical Models for Speech Enhancement

More Information