As telepresence technology advances, the need for high definition video and audio becomes more apparent. In order to combat the increased bandwidth of these HD systems, time-domain waveform coding is replaced by transform coding. Transform coding allows video and audio to be represented by a more compressed representation. Unfortunately transform coding, especially in audio and voice can introduce artifacts called pre/post echoes. These artifacts are most notable when there is a sharp increase or decrease in signal energy. This artifact can degrade the perceived sound quality. Modules have been built into ITU standards to handle these echo (e.g. G.729.1).
The cause of the pre-echoes is the quantization noise in the frequency domain when translated into the time domain for decoding. This is a classical case of the Gibb’s phenomenon. Gibb’s phenomenon is most apparent during a sudden increase of energy (discontinuity) because when using a Fourier Series to represent a time domain signal, an infinitely long series is needed. In reality, only a finite number of coefficients can be used to model this signal. This lossy model creates quantization noise that appears as an artificial signal just before the original onset. The same problem exists after a sudden decrease in energy, and this is called post-echo, where the artificial signal appears just after the offset of the original signal. The effect of these artifacts results in a reduction of the sharpness of the signal.
There are three main factors that contribute to the intensity of the pre/post echoes. They are: the quantization noise, the window length of the transform and the strength of the onset/offset. Clearly, the sharper the increase the larger the affect of the pre/post echo. Quantization and windowing length are coding design features that can be modified to help prevent the artifact from occurring.
Quantization noise is the result of representing an infinite number of coefficients of the basis functions with a finite number of coefficients with limited precision. The goal of any compression scheme is to use the least amount of bits to represent the original data with a minimal loss of information. Therefore, psychoacoustic modeling is used in determining the level of quantization. The critical frequency bands in the perception of sound are given the most bits for representation. The obvious trade-off here is that less bits given for each frequency results in an increase in quantization noise, thus the echo artifact.
Another factor that affects the creation of the artifact is the length of the window of the transform. Since the quantization noise is a linear sum of the basis functions, the quantization noise will span the entire length of the window. Thus, the shorter the window length, the shorter the effect of the echo. If the echo can be kept to less than 5ms it will be inaudible. The downfall having a window that short is you lose frequency resolution. Frequency resolution is important in performing effective transform coding. Some coding artifacts will be inevitable, so mechanisms for detecting and reducing them will have to be in place.