One method for performing Blind Signal Separation (BSS) via time-frequency masking is based on the Degenerate Unmixing Estimation Technique (DUET) algorithm. The DUET algorithm has a major advantage over other BSS approaches such as independent-component analysis (ICA) and inverse channel equalization, in that it can handle the underdetermined case (i.e. the number of sources is greater than the number of microphones).
In an audio recording setting there are several sources of noise that disrupt the quality of the desired audio signal. The sources of noise include sensor noise, general environmental ambient noise and other audio sources, and in addition there are reverberations (reflections) and acoustic echoes from loudspeakers to the microphone path. The goal of the acoustic beamforming and blind signal separation (BSS) is to be able to filter out signals generating from different locations in space, and to provide noise-reduction in the acquired signal.Being aware of the characteristics of the desired signal and the sources of noise and how they mix, assists in the separation process.
In order for the DUET algorithm to handle underdetermined scenario, it makes some assumptions about the signals in the recording environment. Those assumptions are: an anechoic mixing environment and that the signals are W-disjoint orthogonality. The anechoic model does not take into account the reflections, reverberations of the environment, it assumes that just the direct path of the source signals reach the microphones.The second assumption, W-disjoint orthogonality, also alluded to as the sparseness assumption, is crucial to the success of the DUET algorithm. It assumes that only one source signal will contain certain frequency components at a particular time instant. Therefore when the Short-Time Fourier Transform (STFT) is taken of the received microphone signals, at most one source will be present at each frequency bin. With these assumptions it is now possible to use multiple microphones to reliably perform some feature extraction to separate the source signals.
There are two spatial signatures in the anechoic mixing model with W-disjoint orthogonality that can be used for feature extraction; they are the level ratio and relative phase difference. The importance of the microphone spacing will be illuminated in the description of the spatial signatures. The level ratio uses the relative difference of attenuation from one microphone to the next as a signature of a source. This spatial signature is not useful when the distance between the microphones is short; this makes the far-field located relatively close to the microphone array. Therefore, there exists no attenuation of a source between the microphones. The second signature is the frequency normalized phase difference between two microphones.Whether the sources are located in the far-field or the near-field the relative phase difference provides a unique value for a source signal.The limitation of this feature is that microphones must be spaced less than one half-wavelength of the highest frequency of interest; otherwise there will be phase wrapping causing phase ambiguity of sources.
There are several techniques used for estimating signal signatures from the time-frequency data. Most techniques discussed in research are the maximum likelihood (ML)-motivated or histogram-based estimators. It is shown that histogram-based approaches tend to be more reliable in practice because ML-motivated estimators have to rely on a particular model for interfering sources (e.g. independently Gaussian distribution). In practice, generation of useful histograms becomes more difficult because in real-world applications, the recording environment is not anechoic and sensor noise can be quite disruptive. The goal is to make and find well localized (tall and narrow) histogram peaks relating to the source signals. It is common practice to use the weighted histogram approach to emphasize the time-frequency bins with the most energy and/or to emphasize certain frequencies of interest.
In order to determine what set of data should be consider include as part of a particular source, typical peaking find or data clustering algorithms can be applied, and then by creating a binary mask for each peak or cluster, signal separation can be achieved. In speech applications, a binary mask may provide the best method in terms of signal separation, but a softer mask should be used give frequency bins not related to a source a little bit of energy to make speech sound more natural.