In telecommunication, speech quality is an important contributing factor to the success of a product and to the success of the communication itself. High speech quality guarantees that the effort the users have to put forward in order to correctly perceive the communication is low. There are two aspects or divisions that affect speech quality. The first, is the physical acoustic and hardware design of the device. The second aspect is the signal processing that is involved in the communication between devices. This aspect includes vocoders, voice quality enhancement algorithms (acoustic echo cancellation (AEC), noise reduction, automatic level control (ALC or AGC), etc), and error concealment methods.
Several methods have been developed to measure speech quality and to evaluate how different algorithms affect the speech quality. The voice quality of a communication can only truly be described by the human being perceiving it. Therefore, subjective listening tests are preferred, but in reality, subjective tests are not always feasible and objective tests that attempt to predict how users will evaluate the speech quality is required.
For subjective tests, human listeners use a telecommunication device and are asked to evaluate the quality of the system. Subjective tests include conversations tests, which involve the real-time communication of the system to be evaluated. In listening-only tests (LOT), the user evaluates the speech quality by listening to speech signals. In LOT tests, users can be subject to comparative test signals(clean versus processed or two differently processed signals), or listen to a single processed signal and rate the quality to what the user thinks is considered good quality.
In practice, an average of integer values from the standardized scale, ITU P.800, received from the subjects is used to generate a mean opinion score (MOS). Although, real-time communication tests are able to provide overall performance of the actual system, it does not allow for evaluation of individual components of the system, it can lead to some misleading results. For example, there may be little noise in speech signals but the round trip delay of the system may be very long leading to a degradation in the quality of the communication. Thus, individual quality features (or attributes of speech) should be evaluated as well, both subjective and objectively.
The main attributes of speech in terms of speech quality are continuity (short-time distortions), noisiness, directness and frequency content. Short-time distortions resulting from packet or signal loss adversely affect the perceived quality. Thus, packet loss concealment (PLC) methods may be required. Evaluations of PLC methods can be used by making a direct comparison between the clean and processed signals. Short-time distortions such as musical noise from noise reduction techniques also result in a degradation of speech quality, even though the overall SNR has been improved. Thus, tonal-noise detection is important in objective measurement of short-time distortions.
Directness and frequency content metrics evaluate the frequency response of the processed signal. An improper balance of frequency content can lead to reduced speech intelligibility by creating a processed form of lisping of the speech. Evaluation of the frequency content of a signal is made by measuring the distance between the frequency response of the signal to the idealized response.
Auditory noise plays a negative role on speech quality. The signal-to-noise ratio loosely correlates with speech quality (0.6) and is not entirely effective for an objective metrics. Signal-correlated and colored noise can cause short interruptions of less than 5ms. The ability to detect to these short interruptions and the rate in which they occur, can be used as a measurement for speech quality.
ITU P.862, the Perceptual Evaluation of Speech Quality (PESQ), is considered the standard objective measurement for predicting MOS and the benchmark for evaluation of vocoders. It attempts to evaluate some of the speech attributes mentioned above to generate an estimated MOS. The disadvantage of PESQ is that it is not designed to evaluate noise reduction and speech enhancement routines. PESQ measures the effects of one-way speech distortion and noise on speech quality, and does not take into affect loudness loss, delay, sidetone, and echo into the generated scores.