Noise Reduction using SVD and PSO

In many applications, such as hands-free telephony and teleconferencing, captured speech signals include significant background noise. This noise can make processing the speech more difficult. Methods are necessary to remove, or at least reduce, the noise in the signal.

One method for Noise Reduction is the Singular Value Decomposition (SVD). This method assumes that the desired signal, y(n), and noise signal, n(n), have low cross-correlation, and that the noise is white Gaussian. We begin be taking a portion of the noisy signal of length N,x(n) = y(n) + n(n) with n = 0,1,…,N – 1, and write it as the N – M + 1 by M Hankel Matrix

$X = \begin{bmatrix} x(0) & x(1) & \cdots & x(N - M) \\ x(1) & x(2) & \cdots & x(N - M + 1) \\ \vdots & \vdots & \ddots & \vdots \\ x(M-1) & x(M) & \cdots & x(N - 1) \end{bmatrix}$

From this, we use the singular value decomposition X = UΦV*, where U is an orthogonal N – M + 1 by M matrix, V is an orthogonal M by M matrix and Φ is a diagonal M by M matrix whose entries are φ0 ≥ φ1 ≥ ⋯ ≥ φM – 1. We will create Φ̃ by keeping only the largest K entries of Φ. This is basically removing the low power, i.e. noise, portion of the signal. We then define Ỹ by Ỹ = UΦ̃V*. This is almost the Hankel matrix of the estimated signal ŷ(n). We arrive at that matrix by averaging the entries of Ỹ over the diagonals, i.e.

$\hat{y}(j)= \frac{ \sum_{i=0}^{j} \tilde{Y}_{i, j-i}}{j+1}$

The SVD algorithm requires tuning in order to work. That is, a decision needs to be made about what the values for N, M, and K should be. N can be chosen so as to easily integrate this algorithm into other processing. For example, if the speech is going to be processed by a vocoder that operates on 20ms frames and samples at 8kHz, it makes sense to choose N = 160. The choices of M and K can make drastic differences in the effectiveness of the algorithm. This is where PSO becomes effective. We can use PSO to search for the optimal values of M and K. As the objective function, we will use ƒ(M,K) = |ŷ| + |x – ŷ| – |x|. This function measures the gain achieved by separating the noise from the desired signal. Maximizing this function will lead to improved noise reduction. We can then use the decision from that portion of the signal to begin examining the next portion of the signal. Since speech signals are processed in frames of time where the properties remain mostly stationary, the signal and noise portions of the noisy signal will remain similar from one frame to the next. This allows us to use PSO to track the speech signal amid the noise from frame to frame.