When voice travels over packet switched networks, the round-trip delay may go as high as several hundreds of milliseconds. If the local speaker hears his/her own voice after such a long delay, it will be unnatural or annoying.

The above diagram demonstrates the general situation. On the left, the talker is speaking to a microphone. The voice goes through the network and drives a loudspeaker on the remote site. The loudspeaker output is then picked up by a remote microphone and sent back through the network again to the local speaker. The incoming signal will be delayed and distorted version of the original outgoing signal.

To reduce the feedback effect, a canceler must be implemented in one or both ends. As shown in the diagram, an acoustic echo canceler (AEC) is implemented at the local end. There are two issues that have to be addressed before the AEC unit can perform efficiently, 1) the round-trip delay from outgoing signal to the incoming signal, 2) the nonlinear distortion introduced by the transmission network and the transducers on both ends. In this small article, various methods to track the round-trip delay are discussed.

There are three commonly used approaches to estimate the time delay between the outgoing and incoming signals. They are cross-correlation, normalized cross-correlation and frequency domain cross correlation.  We denote y(t) as the incoming signal and x(t) the outgoing signal.

Cross-correlation

The cross correlation is defined as, $R_{xy}\left(n,m\right)=\sum_{t\ =\ n}^{n+L\ -1}{x\left(t\right)y\left(t-m\right),}$

where L is the block length for computing the correlation, n is the local time index.

Normalized Cross-correlation

The cross correlation is defined as, $R_{xy}\left(n,m\right)= \sum_{t\ =\ n}^{n+L\ -1}x\left(t\right)y\left(t-m\right)$

where S(n, f) is the cross spectrum at the time instance n. It can be easily derived from the Fourier transform of x(t) and y(t), $S_{xy}\left(n,f\right)=X\left(n,f\right)Y^\ast\left(n,\ f\right)$

The advantage with the frequency domain cross correlation is that it can handle fractional delay while the other two measures can only produce integer delay.