Blind source localization, also known as Blind source separation, and speaker diarization are becoming more relevant due to the explosion of mobile applications and its large consumer base. The concept of blind source localization is to recover independent signal sources by using only sensor measurements. We describe the so called DUET like algorithms below. Suppose we have $N$ measurement sensors and $M$ independent sources such that the frequency domain representation can be as below: $X_i(w) = \sum\limits_{m=1}^M a_{m,i} S_m (w) e^{-jw \tau_{m,i}} + \nu_i(w), a_{m,1} = 1 \forall w$

where $X_i(w)$ is the received recordings at sensor $i$, $S_m(w)$ denotes the non-corrupted speech signal from speaker $m$, $\tau_{m,i}$ is the delay at sensor $i$ from speaker $m$ and $\nu_i(w)$ is the zero mean i.i.d noise.

The key idea is that, if the frequency representations do not overlap too much, then any arbitrary number of sources can be adequately separated using a minimum of two microphones. In this regard, any frequency bin can be represented, for some sound source $k \in \{1, \cdots, M\}$ as: $X_i(w) = a_{k,i} S_k (w) e^{-jw \tau_{k,i}} + \underbrace{\sum\limits_{m=1, m \neq k}^M a_{m,i} S_m (w) e^{-jw \tau_{m,i}} + \nu_i(w)}_{n_i(w)} \forall i$

The blind source separation problem then becomes estimation of the $k^{th}$ signal for each discrete frequency bin and the accompanied scaling and rotation $a_{k,i}$ and $w \tau_{k,i}$.

For two microphones, a root-MUSIC algorithm can be deployed efficiently. The sample co-variance matrix for the two signals is given as: $\mathcal{R(\omega)} \approx \begin{bmatrix} |S_k(\omega)|^2 a_{k,2}^{*}+ |n_1(\omega)|^2 & |S_k(\omega)|^2 a_{k,i}^{*} e^{jw \tau_{k,i}}a_{k,i} + \gamma_i(w)\\|S_k(\omega)|^2 a_{k,i} e^{-jw \tau_{k,i}}a_{k,i} + \gamma_i^{*}(w) & |S_k(\omega)|^2 a_{k,2} +|n_2(\omega)|^2 \end{bmatrix}$

where $\gamma_i(w) = \sum\limits_{m=1, m \neq k}^M |S_m (w)|^2 a_{m,i}^{*} e^{jw \tau_{m,i}} \rightarrow 0$ for each frequency bin with the assumptions made. Thus, the sample co-variance matrix can be said to approximate: $\mathcal{R(\omega)} \approx |S_k(\omega)|^2 \begin{bmatrix} a_{k,2}^{*} + \frac{1}{SNR_{k}(w)} & a_{k,i}^{*} e^{jw \tau_{k,i}}a_{k,i} \\ a_{k,i} e^{-jw \tau_{k,i}}a_{k,i} & a_{k,2} +\frac{1}{SNR_{k}(w)} \end{bmatrix}$

Eigen decomposition of the sample co-variance matrix will delineate the signal and noise subspace. The noise subspace is a $2 \times 1$ subspace, denoted $\mathbf{E}_N(\omega) = \begin{bmatrix} \mathcal{E}_1(\omega) \\ \mathcal{E}_2(\omega) \end{bmatrix}$. The steering vector can simply be denoted as $a(\theta)= \begin{bmatrix}1 \\ z^{-1} \end{bmatrix}$

The root music manifold is then defined as: $f(\theta,\omega) = a(\theta)^H \mathbf{E}_N(\omega) \mathbf{E}_N^H(\omega) a(\theta) = z^{-1}\left(\mathcal{E}_1(\omega) \mathcal{E}_2^*(\omega) +(\mathcal{E}_1(\omega) \mathcal{E}_1^*(\omega)+\mathcal{E}_2(\omega) \mathcal{E}_2^*(\omega) )z+\mathcal{E}_2(\omega) \mathcal{E}_1^*(\omega) z^2\right)$

The solution of the above equation can be reduced to: $z \in \left \{-\frac{\mathcal{E}_2(\omega) }{\mathcal{E}_1(\omega) }, -\frac{\mathcal{E}_1^*(\omega) }{\mathcal{E}_2^*(\omega) }\right \}$

The solution close to the unit circle is the desired solution and the angle can be evaluated straight away using $z = e^{-j w \frac{d}{c} \sin{\theta}}$ . With $z$ known, the rotation of the source is known and a ratio of the two sensor readings gives and approximate value of the scaling. Any beamforming algorithm can then be applied to suppress the noise and assign the output to source $k$.

VOCAL Technologies offers custom designed direction of arrival estimation solutions for beamforming with a robust voice activity detector, acoustic echo cancellation and noise suppression. Our custom implementations of such systems are meant to deliver optimum performance for your specific beamforming task. Contact us today to discuss your solution!