Hearing: Cocktail Party Problem
The cocktail party problem can be described by one simple sentence: How do we recognize what one person is saying when others are speaking at the same time? Finding the answers to this question has been an important goal of human hearing research for many decades. In a multi-party social environment, human voices present often overlap in both frequency and time. Energy masking impairs the perception of speech.
How does human resolve the “cocktail party” problem? The answers to this question are important for our understanding of human hearing mechanism and may provide new leads to acoustic signal separation and speech enhancement techniques.
The problem can be investigated on two sides, the transmitter side and the receiver side.
Cocktail Party Problem from the Transmitter Perspective
Effective acoustic communication, such as speech for human, depends heavily on the perceptual mechanisms that the receivers possess for solving cocktail-party-like problems. The structure of the acoustic signal and the behavior of the signalers represent adaptations that have evolved as a result of selection pressures associated with ameliorating cocktail-party-like problems for the receiver.
Cocktail Party Problem from the Receiver Perspective
Acoustic techniques mostly focus on the receiver side. They are generally referred to as acoustic scene analysis that attempts to form coherent and functional perceptual representations of distinct sound sources in multiparty environments.
Sounds that share common properties are more likely to be integrated together by the auditory system. When properties differ enough between sound elements, they probably arise from different sources. Such properties may include fundamental frequency, harmonic relationships, temporal onsets/offsets, timbre, and patterns of amplitude modulation. Integration and separation sound sources must be a process of applying the mentioned cues grouping and separating.
It would be reasonable to assume that spatial cues, such as, interaural time difference (ITD) or interaural level difference (ILD) would play an important role in perceptual integration and segregation. Surprisingly, recent studies show that spatial cues have limited impact on human cocktail hearing ability. This is not to say that cocktail party problem can not be solved by exploiting the spatial dimension, however, context-based segmentation and integration in time domain is likely to be the human hearing mechanism for the cocktail party problem. Therefore, pattern recognition-based learning algorithms may be the future from a machine point of view.