VOCAL's Phase Reconstruction Digest

For phase reconstruction of speech, many speech enhancement algorithms use the phase of the observed signal to reconstruct the spectrum of enhanced speech. This phase signal has been corrupted by noise and reverberations, and is therefore not the cleanest information available. As studies have shown, using a good estimate of the clean phase spectrum makes a huge difference in the quality of speech enhancement output.

In general, phase information should be captured with low dynamic range windows. This can easily be achieved by using Dolph-Chebyshev windows or their Kaiser approximations. Since the generation of these window coefficients is generally computationally demanding, they should be computed offline and stored for the various window lengths used in your system. A high dynamic range window should be used for magnitude information capture. This split approach will deliver superior phase knowledge at the cost of an extra windowing operation and an extra fft.

The easiest way to estimate the clean phase is to use the Phase Spectrum Compensation (PSC) method, which works on the complex fft spectrum. Since an audio signal is real, it is conjugate symmetric, and therefore the spectrum and the spectrum’s conjugate tend to reinforce each other when converting back to the time domain. Breaking it down further, the magnitude spectrum is symmetric while the phase spectrum is antisymmetric. Using this more detailed knowledge, we can alter the relationship between the complex fft spectrum and it’s conjugate by making them more out of phase thereby greatly reducing the effect of high noise on low magnitude spectral components.

Another approach for phase reconstruction lies in trying to correct for phase distortions. A clean speech signal has energy tightly colocated in the harmonic comb structure, and is spread out by interference aka by phase distortion. This distortion happens across frequency but also across time. If you can reliably estimate the fundamental frequency of your speech, you can correct for energy outside the harmonic bands caused by the distorted phase in short time frequency domain representation, and recursively smooth this estimate along time. The actual phase reconstruction itself is computationally effcient, but relies on more computationally expensive and potentially unreliable voice activity detection and fundamental frequency estimation.

Never the less, by cleaning up the noisy phase spectrum, the results of speech enhancement algorithms can be greatly improved.