Far field direction of arrival estimation needs a robust voice activity detector (VAD) for the accurate collection of speech statistics [1]. Voice picked up from the far field will contain more reverberation than voice picked up in the near field, therefore distorting the distribution of the received speech. This reverberation in the received speech is dependent on the transfer function between the talker and the microphones and therefore a change in talker location also influences the speech distribution. In this environment, a fixed statistical model for the noisy speech distribution with hard thresholds is not appropriate. An alternative approach is to use multiple statistical models [2] and compare them against the empirical distribution.
Why Multiple Statistical Models?
Empirical distributions of spectral amplitudes are simply not a good fit for many off-the-shelf statistical models [3]. The exact shape of the distribution is dependent on factors such as the frame length, the dft length, and the time domain window, as well as all the spatial considerations mentioned previously. Figure 1 illustrates this vividly.
As the figure shows, the filled-in empirical distribution does not exactly match any of the standard statistical models. The main lobe is too small, while the tails are too long. When other real-world considerations are taken into account as discussed previously, the distribution may change unpredictably. Using multiple models assures the best fit possible while still maintaining the efficiency of likelihood ratio tests and soft decisions.
VOCAL Improvements
Using VOCAL Technologies advanced understanding of statistical techniques, we can offer improvements on these basic ideas. In [2], basic statistical tests were used for the comparisons. In [3], the weighting schemes are derived in an ad-hoc manner. Our implementations use advanced statistical tests to make the best decision possible along with optimal weighting schemes based on data driven and adaptive approaches.
Product Offerings
VOCAL Technologies offers custom designed direction of arrival estimation solutions with a robust voice activity detector. Our custom implementations of such systems are meant to deliver optimum performance for your specific beamforming task. Contact us today to discuss your solution!
References
[1] H.L. Van Trees, Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory, New York, NY, Wiley-Interscience, 2002.
[2] Joon-Hyuk Chang et al, “Voice activity detection based on multiple statistical models,” in IEEE Transactions on Signal Processing, vol. 54, no. 6, pp. 1965-1976, June 2006.
[3] T. Petsatodis et al, “Convex Combination of Multiple Statistical Models With Application to VAD,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 8, pp. 2314-2327, Nov. 2011.