Much of the analysis in vocoders that goes into converting speech to parameters is based on the frequency domain. This is despite the fact that a Fast Fourier Transform (FFT) is rarely explicitly used. Often autocorrelation, which carries much of the same information, is used instead. Autocorrelation is used to determine, for example, pitch lags and Linear Predictive Coding (LPC) coefficients, which are both best viewed as analyzing the frequencies of the signal.
This frequency based analysis creates issues because the signal is worked on in blocks, usually of 5 to 30ms. The abrupt start and end of these blocks creates discontinuities. In order to make these discontinuities fit into a frequency based framework, high frequency anomalies appear. For pitch analysis, this masks any true high pitch energy in these anomalies. And for LPC analysis, the coefficients are trying to make predictions for the first few samples based on little information. This causes larger values for the coefficients that go along with the shortest lags. These anomalies can create audio artifacts: sounds that should not be present.
A solution to these problems is to use windowing. This is a technique where the signal is multiplied by a smooth function that equals 1 most of the time, but is 0 at the start and end of the block, and transitions smoothly between 0 and 1. This removes the discontinuities which cause the artifacts. A typical windowing function that is used is a sine window, which uses a sine curve to smoothly transition from 0 to 1 in the first part of the block and from 1 to 0 in the last part of the block. The sine window has the added advantage of being easily implementable. This is due to the recursive formula
sin(n θ) = 2cos(θ) * sin((n – 1) θ) – sin((n – 2) θ) |
which allows successive values of the windowing function to be computed easily. Here θ = π/length where length is how many samples we want to use to transition from 0 to 1.