Double-Talk Detection in Echo Cancellation

Double-talk Detector (DTD) is one of the critical functional elements of the echo canceller system. Without a properly working DTD, the echo canceller does not cancel echo reliably and speech transmitted upstream may be distorted. Please contact us to discuss your double-talk detection requirements.

Figure 1: Voice channel and echo cancellers; signal levels indicate double-talk at both ends of voice connection

Figure 1 depicts a general system view of the voice communication channel with emphasis on echo cancellation operation. Echo cancellers (either LEC/NEC or AEC) are located at both ends of the voice connection. When both near-end and far-end speakers are talking at the same time, the condition is called double talk (DT). The condition is generally illustrated by pointing to “high-level” condition of the uncorrelated signals illustrated in Figure 1.

Using notation adopted in Figure 2, the “high level” condition is defined by the formula as below:

|s(i)| = |x(i) + r(i)| ≥ c . max{|y(i)|, |y(i-1)|, …, |y(i-N)|}, (1)

where N is the FIR filter length on the adaptive filter block; c a predefined constant; i is any time point (in sample intervals).

If the above inequality (e.g. (1)) does not hold, the single-talk condition is declared.

Practical implementation of computing max{|y(i)|, |y(i-1)|, …, |y(i-N)|} requires putting the samples into a FIFO buffer, with a predetermined step (i.e., every sample or every K samples; if K>1, then there are options of using the sliding window approach with pre-defined window overlap.

Figure 2: Near-end echo canceller – signal names

In the original version of the algorithm the constant c is set to 0.5, which corresponds to ERL of 6dB. This form of the double-talk algorithm was proposed by Geigel. There are numerous versions of the Geigel algorithm and they include:

(a) the use of short-term norm estimates of signals s(i) and y(i) instead of their sample values;
(b) adaptive adjustment of value c based on observed characteristics of the echo path (for example, ERL estimate);
(c) widening the buffer size i.e., going beyond the order of the FIR filter, (and other versions as well).

Figure 3: The N^th order FIR filter used in the adaptation process

The main limitation of the Geigel algorithm is its sensitivity to input data waveform characteristics and, as a result, often it does declare a DTD condition falsely or it does not detect the DTD condition while it should. These false detections or missed detections can be minimized by careful tuning of an adaptive version of the algorithm. However, in the case of AEC or other cases when the condition of ERL or 6dB (or any close value to that) cannot be assumed as representing true use cases, other approaches have been developed.

Figure 4: Buffer structure used for Geigel algorithm

Geigel algorithm (and its versions) is often referred to as an energy-based DTD algorithm; it is known for accommodating lower values of ERL_THRESH up to 3dB and still working reasonably well. If the reflection signal energy is comparable with the reference signal energy (R_out).i.e., when ERL_THRESH = 0dB, approximately (or it becomes negative), then the DTD operations become unreliable.

One of these non-Geigel approaches is the one that uses a cross-correlation measure of similarity between R_out and S_in (that is, between y and s). Note that other signal pairs can be chosen for that purpose. For example, signals s and r^, which are very similar during the single talk and they are not very similar during the double talk.

The double-talk condition (typically indicated by a binary DTD “flag” that indicates whether AF adaptation of the AF as allowed or should be halted) is determined by the maximum value of the short-term normalized cross-correlation function between two signals, y and s. The high-level algorithm functionality can be illustrated by the following pseudo-code:

if max(norm_xcorr(y,s)) > max_xcorr_thresh,
dtd_flag = 0;
else
dtd_flag = 1; (2)

Note that max_xcorr_thresh can be a constant or variable (if an adaptive approach is used).

Frequency-domain version of the cross-correlation function based DTD is an algorithm that uses signal coherence function F(f) defined as

where Sss(.), Syy(.), Ssx(.) are auto/cross spectral of the corresponding signals.

Similarly to the normalized cross-correlation version, the high-level DTD algorithm functionality included tests of the DTD using Fsx(f) function against a predetermine threshold. There is one difference though: in the case of the cross-correlation function we use its max value (i.e., max(norm_xcorr(y,s)), as in (2)). In the case of the coherence-based method (3), there are several options regarding selection of frequency f . One of the versions uses the value of f that is representative for the spectrum of interest (for example 600Hz). Another version uses a multitude of tests for pre-defined discrete values of f .

More advanced versions of double-talk detectors use statistical methods based on statistical properties of human speech. The main limitation of such methods is their time latency: in order to declare DTD flag value (dtd_flag of zero, for single-talk, and dtd_flag of one for double talk), substantial statistics have to be collected and evaluated and that may take relatively long time (measured in seconds).

Another advanced version of the DTD uses the second adaptive filter that is treated as a reference. The main adaptive filter coefficients are compared to the reference. If there is mismatch between these two sets of coefficients, there is possibility that filter divergence has occurred due to the double talk. Of course there are other possible scenarios, such as echo path change, that may lead to the mismatch. Therefore, additional discriminating functional elements have to be included in this version of the DTD in order to explore all situations that may lead to the coefficient mismatch.

Regardless of which particular DTD algorithm is chosen, all have common elements of functionality such as logic that ensures smooth transitions between DTD state and non-DTD state. These smooth transitions are characterized by the hang-over time from one state to another along with the effect of hysteresis that ensures that transitions do not happen too often. Also, a hold feature is used, i.e. if double-talk is declared for a sample, the detector continues to declare double-talk condition for the next N hold samples, no matter what the actual DTD status.

VOCAL Technologies’ echo cancellation solutions are equipped with robust double-talk detector functionality. If necessary, the DTD can be configured to meet specific customer requirements.

Complete Communications Engineering

Double-Talk Detection in Echo Cancellation

More Information