Pitch Detection using Cepstral Method

In speech processing, pitch detection using the Cepstral method is used to determine who is talking, for speaker separation, and for phase based speech reconstruction. Pitch detection is often done in the Cepstral domain because the Cepstral domain represents the frequency in the logarithmic magnitude spectrum of a signal. The Cepstrum is formed by taking the FFT (or IFFT) of log magnitude spectrum of a signal. The reason for using the FFT or IFFT interchangeably is because one will just give you a reversed version of the other, so each is equally valid for the processing we wish to do.

Once in the cepstral domain, the pitch can be estimated by picking the peak of the resulting signal within a certain range. The Cepstrum is given in term of “quefrency” which, besides being a terrible name, represents pitch lag. Therefore, the lag at which there is the most energy represents the dominant frequency in the log magnitude spectrum thereby giving you the pitch.

There are of course some caveats to this approach. First of all, pitch and fundamental frequency are not actually the same thing, so depending on which peak your algorithm picks, you may be getting F0 (the fundamental) of FI (one of the formants). Secondly, the Cepstrum is time shift variant. Therefore, you cannot just apply this method blindly. Instead, you need to precisely line up your time domain windows such that they start and stop exactly over a voiced speech segment. This is not a trivial task as most VADs often have errors and thus your cepstrum will suffer from phase ambiguity.

To get around this problem, we can use the differential cepstrum and its variants such as the mean differential cepstrum. This method is widely used and represents an important step in understanding the usefulness of this second Fourier domain.