Speech separation or segregation is the separation of a desired speech signal from a mix of environmental signals. These can include ambient room noise, other talkers and any other non-stationary noise. The majority of speech separation techniques try to reduce noise by replicating the signal processing performed naturally by human auditory sensory system. Speech segregation can be separated into two categories. The first is monaural approaches, which includes speech enhancement techniques and computational auditory scene analysis (CASA). The second category is binaural approaches, these include acoustic beamforming and independent component analysis. This paper will focus on the monaural approaches.
Among the applications of speech separation is the improvement of automatic speaker recognition systems, and in the development of hearing aids and cochlear implants where noise reduction is essential for speech intelligibility. Another useful application of speech separation is for private conversations during conference sessions. In a large conference it is typical that several distinct conversations are occurring at the same time, but the channel of the communication is disrupted by other conversations. Therefore, speech separation techniques allow for individual users’ voices to be separated into their own data streams.
CASA speech signal processing systems model how the human auditory system groups together sound components that overlap in time and frequency. As Bregman states, “if two nearby bits of signal resemble one another on either fundamental or formant frequency, one ought to judge that they are parts of the same signal.” In other words, the auditory system groups together signals that have a common periodicity and have a proximity in time in order to “focus” on those components.
The first step for CASA is to transform data recorded in the time domain into a representation as to how the sound is perceived naturally by the human ear. This represents the cochleagram. Cochleagram, like a spectrogram but uses the cochlear as a model for how to represent the time-cochlear response. An important step for grouping together sound components of human speech is determining the pitch of the speaker. The correlogram is the autocorrelation of each frequency channel of the cochleagram and it is useful for determining the fundamental pitch frequency. Once the pitch has been determined the harmonics of this frequency can be grouped together.
To complete a primitive CASA, an ideal binary mask is generated for the time-frequency components. The time-frequency components grouped together get assigned a 1 and kept, while all other components are eliminated. For enhanced CASA, one can add speech modeling into the masking generation. Based on the coherent fragments of speech mentioned previously, these get processed by a speech modeling search by finding the word sequence that best matches the coherent fragments.