Wavelets are an alternative to the Short Time Fourier Transform (STFT) for representing localized events. The STFT works by doing frequency analysis on short segments of a signal. Using the STFT, we can then associate a block of time with a particular frequency spectrum. Making this association over multiple time blocks leads us to a 2D representation of frequency vs time, known as the familiar spectrogram.

The key assumption is that this short segment is wide sense stationary, i.e., its expectation and covariance are time invariant. For typical 32 ms frames during voiced speech segments, this assumption is often valid. For 32 ms segments with mixed excitation, such as voice onset periods, this assumption is no longer valid and the STFT will produce a frequency spectrum with extraneous frequencies and ill defined peaks. These frequencies look like grass around the true frequency peaks, as if you were using a window with very high sidelobes, and these peaks are analogously spread in frequency.

Wavelets on the other hand, can solve this problem of frequency resolution in non-stationary segments. The Wavelet Transform works by creating a 2D representation of scale vs translation in much the same way the STFT works. The difference is that the Fourier transform is not actually computed, and our segment length can vary. In other words, starting at sample X, instead of multiplying by a fixed window length and taking the Fourier Transform, we multiply by a set of varying length wavelets. The length of these wavelets determines the scale, as the longer the wavelet, the higher the scale. We then shift our starting sample from sample X, and perform another set of multiplications.

The Wavelet Transform is defined for continuous time, and hence is most often called the Continuous Wavelet Transform (CWT). To implement the Continuous Wavelet Transform on a computer, we need to discretize the shifts and the scaling. In practice then, we only scale by factors of two. Therefore, the second set of wavelets is twice the length of the original, and the third set is twice the length of the second and four times the length of the second. The fourth set is then twice the length of the third, four times the length of the second, and eight times the length of the first. Therefore, we have a uniform sampling of an artificial quantity of scale. When we scale by an additional factor of two, we need to increase our shift by a factor of two also. Then, since our shift is dependent on our scale, we also have of a uniform sampling of our artificial quantity called translation.

By changing the scale, we are changing the amount of time being considered. Longer wavelets, those with higher scale, can capture more low frequency information as a result. Similarly, shorter wavelets, those with lower scale, are best able to capture high frequency information. By shifting in time by different amounts, we can accurately detect transient events such as the onset of a vowel, or the transition between an unvoiced segment and a voiced segment. The event is found in time by the shift, and then isolated by scaling the wavelet. In continuous time this can be made precise, but on a computer, we can only approximate it.

While these shifting and dilating operations certainly show us how wavelets can be used to capture high frequency and low frequency information and transient events, we don’t actually do anything like this in practice. That is, we do not shift and dilate a wavelet, instead, we pass the signal through a set of filterbanks to create what is called the Discrete Wavelet Transform (DWT).