Voice activity detection with long term spectral divergence

Voice activity detectors are designed to always discriminate between features of speech and noise. The long term spectral divergence approach is used to produce a decision rule aimed at minimizing the number of decision errors. It is inherently a non-causal procedure since the decision of a frame depends on features of future temporal frames.

Suppose the received signal at the microphone is given as:

$y(t)= s(t) + \nu(t)$

where $s(t)$ is the desired speech signal and $\nu(t)$ is i.i.d zero mean Gaussian noise. The frequency domain representation then becomes

$y(t, \omega) = s(t,\omega) + \nu{(t,\omega)}$

The long term spectral envelope, denoted $\alpha(\omega)$ is given as

$\alpha(t,\omega) = \underset{\tau}{argmax} ~~|y(t+\tau, \omega)|,~~ \tau \in [-T,T]$

It is clear that $\alpha(t,\omega)$ is non-causal and a such a buffer has to be used for real time implementation. The size of $T$ will impact the overall systems latency. Too big a $T$ means large latency whilst too small a $T$ will mean not enough averaging which may cut off some speech frames or result in abrupt transitions. The long term spectral estimate, denoted $\gamma(t)$ then is extracted from $\alpha(t,\omega)$ as

$\gamma(t) = \frac{1}{|\omega|_k} \sum\limits_{w} \left( \frac{\alpha(t,\omega) }{\alpha_n(\omega) } \right)^2$

where $\alpha_n(\omega)$ is an estimate of the noise spectrum and $||_k$ denotes cardinality .
A detection threshold, $\beta_T(t)$ is used and is defined as:

$\beta_T(t)= \begin{cases}\beta_0, & \text{if}\ \gamma(t) \le \gamma_0\\\frac{\beta_0 - \beta_1}{\gamma_0 - \gamma_1} \gamma(t) + \gamma_0 -\gamma_0 \frac{\gamma_0 - \gamma_1}{\gamma_0-\gamma_1}, & \text{if}\ \gamma_0 < \gamma(t) < \gamma_1 \\\beta_1, & \text{if}\ \gamma(t) \ge \gamma_1\end{cases}$

Here, $\gamma_0$ is the expected noise floor for clean speech, $\gamma_1$ is the noise floor for high noise condition. $\beta_0$ and $\beta_1$ are constants used to ensure a sigmoid like activation function. A smoothening function can be applied to both the threshold and the noise estimates to prevent spurious transitions.

VOCAL Technologies offers custom designed solutions for beamforming with a robust voice activity detector, acoustic echo cancellation and noise suppression. Our custom implementations of such systems are meant to deliver optimum performance for your specific beamforming task. Contact us today to discuss your solution!

More Information