Noise Reduction using Minimum Mean Square Estimators (MMSE) can be used where the enhancement of noisy speech signals is essentially an estimation problem in which the clean signal is estimated from a given sample function of the noisy signal. The goal is to minimize the expected value of some distortion measure between the clean and estimated signals. For this approach to be successful, a perceptually meaningful distortion measure must be used, and a reliable statistical model for the signal and noise must be specified. At present, the best statistical model for the signal and noise, and the most perceptually meaningful distortion measure, are not known.
A variety of speech enhancement approaches have been proposed. They differ in the statistical model, distortion measure, and in the manner in which the signal estimators are being implemented. Perhaps the simplest scenario is obtained when the signal and noise are assumed statistically independent Gaussian processes, and the mean squared error (MSE) distortion measure is used. For this case, the optimal estimator of the clean signal is obtained by the Wiener filter. Since speech signals are not strictly stationary, a sequence of Wiener filters is designed and applied to vectors of the noisy signal. MMSE estimation under Gaussian assumptions leads to linear estimation in the form of Wiener filtering.
An optimal MMSE estimation of the short time spectral amplitude (STSA) has been proposed; its structure is the same as that of spectral subtraction but in contrast to the Wiener filtering motivation of spectral subtraction, it optimizes the estimate of the real rather than complex spectral amplitudes. Central to the procedures is the estimate of SNR in each frequency bin for which they proposed two algorithms: a maximum likelihood approach and a “decision directed” approach, which they found performed better.
The maximum likelihood (ML) approach estimates the SNR (or a priori SNR) by subtracting unity from the low-pass filtered ratio of noisy-signal to noise power (the a postiori or instantaneous SNR) and half-wave rectifying the result so that it is non-negative. The decision-directed approach forms the SNR estimate by taking a weighted average of this ML estimate and an estimate of the previous frame’s SNR determined from the enhanced speech. Both algorithms assume that the mean noise power spectrum is known in advance.
Modifications have been proposed to the decision-directed approach which are claimed to improve performance further and showed that a delayed response to speech onsets could be avoided by making the estimator non-causal. Subsequently, an improved version of the procedure has been introduced, which minimized the mean square error of the log spectrum, rather than that of the power spectrum itself. It is reported that this gave noticeably lower background noise levels without introducing additional distortion.
The techniques for missing feature estimation to improve the performance on non-stationary noise can also be utilized for noise reduction purposes. If the frame SNR is less than 10 dB, it is assumed that only the voiced components of speech will be above the noise floor. Therefore broad sub bands are classified as speech or noise using one of several spectral flatness measures and apply additional attenuation in the “noise” sub bands. It is claimed that this noise reduction approach works well for sources consisting of a stationary component added to impulsive bursts; the MMSE enhancer removes the stationary component while the postprocessor identifies and attenuates the impulsive bursts.