Automatic Speech Recognition (ASR) is a process of converting speech into text by decoding audio sounds representing the speech and spoken (or played back) to the audio front end of the ASR system (cf. Ref [1]; Chapter 15).
The ASR process is challenging because there are many aspects of natural speech that vary (even within the same language base) and all these have to be taken into account when designing and implementing an ASR application. These aspects are related to various features of the given ASR systems. These features include vocabulary size and language and pronunciation confusability, speaker independence, language complexity, phrase/sentence complexity (even within the predefined vocabulary), and last but not least the input speech quality. The input speech quality, regardless of the complexity and performance of the given ASR system, has tremendous influence on the speech recognition accuracy and therefore Speech Enhancement Solutions play an important part in overall solutions or products that involve both groups of functionality, ASR and Speech Enhancement.
Research on ASR has been pursued for more than five decades. The literature covering this subject and many other closely related topics is very large. There are many different techniques using in ASR systems and currently we observe how successfully different techniques are when working with specific ASR solutions. There several different measures of speech recognition accuracy; one of the popular ones is Word Error Rate (WER). By using this measure, we can state that different ASR products implementing specific speech recognition technologies manifest sometime substantially different performance.
There are several well-known ASR solutions in a form of products available on the market. Some are available as stand-alone products. Others are in a form of solutions accessible remotely. Regardless of the actual form of access, we can mention several of such products as examples reflecting the current state of the ASR technology. One of the popular ASR engines is Google ASR which is can be accessed from personal computers, tablets, smart phones etc. Other examples include AT&T WatsonSM (cf. [2]), Microsoft ASR, Nuance ASR (available through access via cloud or as a stand-alone product), Voice Box (cf. [3]); some of these have English version only. Others operate using speech spoken in other languages. An example of Chinese (Mandarin) version of ASR is iFlyTek ASR. Japanese version of ASR is known as Fuetrek ASR. Naturally, the list of ASR solutions in forms of products is longer than examples quoted in this brief note.
Figure 1 depicts one of the general blocks diagrams (cf. [1]) illustrating functional blocks and their topology that constitute ASR engine. EPD is End-Point Detection (as a version of Speech Activity Detection). The interface parts are not included in detail in this diagram.
As already indicated the role of Speech Enhancement, although not being directly related to the core of the ASR functionality, has a tremendous impact on WER. This is why, one of the objectives of modern Speech Enhancement solutions is to help and enhance performance of ASR-based products.
VOCAL Technologies’ Speech Enhancement Solutions (that include AEC, NR, AGC and other specific technologies) are designed to assist in telecommunication between human end users as well as between the human and the machine. Because of the current state of art of ASR and Speech Enhancement, versions of Speech enhancement for the purpose of using them with ASR-based products have to set to the ASR mode which is slightly different than the default mode. Contact us to discuss your application with our engineering staff.
More Information
References
- Microphone Arrays; Signal Processing Techniques and Applications, M. Barndstein and D. Ward (Editors); Springer Verlag 2001 (Signal and Communication Technology series), Patrick A. Naylor and Nikolay D. Gaubitch (editors), Springer-Verlag London Limited 2010
- AT&T Watson SM Speech Recognition Technology
- Voice Box Technologies