
Machine Learning for Optimal and Adaptive Filters
ADWC
Abstract
In this paper, we introduce Wiener filters and optimal filters in general along with an explanation of the traditional difficulties involved in calculating them. We then introduce gradient descent, an iterative machine learning algorithm, and explain how it can be used to calculate optimal filters efficiently. We expand on this by introducing genetic algorithms which allow machine learning methods to be applied to non-convex optimization problems. We conclude by showing how these machine learning tools can be used to implement adaptive filters, which are digital filters that change over time and react to their input. We also discuss various applications of adaptive filters.
1 Introduction to optimal filters

Digital filters are often used to convert an observed signal into a desired signal. A common scenario is removing noise which is shown in Figure 1. In this situation the desired signal is , the observed signal is
, and the output signal is
. The digital filter
is optimal when
. This is achieved when
is equal to the Wiener filter
(1)
where denotes the cross power spectral density (CPSD) between two discrete sequences
and
. Additional information including a derivation of equation 1 is available in [Moon and Stirling, 1999]. Although the Wiener filter is mathematically optimal, it’s not always the best filter to use in practice for two main reasons. First, calculating
requires knowledge of
. If this knowledge is available, a filter wouldn’t be necessary in the first place. Second, calculating CPSDs is computationally expensive which limits their use in real-time applications and dynamic systems. Because of these practical difficulties, methods have been developed to approximate Wiener filters.

The issue of the optimal filter requiring knowledge of the source can be solved by reformulating the problem as shown in Figure 2. A filter
is used to convert a representative
of the noise
into a close approximation of
. This reformulation is only useful if the representative
is highly correlated with the noise
and highly uncorrelated with the source
. For the technical definition of correlation, see [Li and Cox, 2019]. There are many methods to obtain noise representatives in real applications. For example, if
is a speech signal,
can be obtained by observing
during gaps in speech.
The approximation of the noise is subtracted from the observed signal
to produce an output signal
. In this case, the optimal filter becomes
(2)
where the approximate equality relies on the correlation assumptions on . Notice this last formula only depends on signals that can be observed, namely
and
, so the issue of the optimal filter requiring unavailable information has been solved. The most popular methods which address the computational difficulty of calculating the optimal filter are machine learning algorithms which we will now introduce in Section 2.
2 Machine learning methods for approximating digital filters
The optimal filter problem can be reformulated in the context of machine learning by introducing a loss function. A loss function describes how far the system’s output is from the desired output, and the objective of machine learning algorithms is to minimize the loss function. Squared error is a simple, common, and effective loss function for many machine learning applications. In the context of Figure 2, a squared error loss function could be formulated as
(3)
assuming the sequences are of length . In real noise cancellation applications, the loss function is often based on frequency domain representations of the signals and is minimized on a frame-by-frame basis.
2.1 Gradient descent
Gradient descent is an iterative algorithm for minimizing a convex differentiable loss function. Convex functions have a unique minimum which makes them easier to minimize, but this also means they can’t describe complicated behaviors. One of the reasons squared error is so effective in practice is because it’s both convex and differentiable, so gradient descent can be used to minimize it. Being an iterative algorithm means gradient descent produces a sequence of approximations to the optimal filter. The algorithm works by calculating the derivative of the loss function which provides information about how the system needs to be updated to minimize the loss function. For example, expanding
in equation 3 and calculating the derivative of
with respect to each filter coefficient yields
(4)
These derivatives can be used to update the filter coefficients
as
(5)
where is a small constant called the learning rate.
So long as the loss function is convex and the learning rate is small enough, gradient descent is guaranteed to converge to the minimum of the loss function. Thus, it’s important the loss function is designed such that its minimum is the desired solution. A -transform analysis can be performed on equation 5 to show gradient descent is stable as long as
where is the variance of the signal
. The approximate equality again relies on the correlation assumptions on
, and it also relies on the assumption that the noise
has zero mean. A similar calculation with more details is included in [Li and Cox, 2019]. The rate of convergence is determined by the learning rate, with higher learning rates leading to faster convergence as long as the algorithm remains stable.
2.2 Genetic algorithms

Although gradient descent is immensely useful, it becomes less effective when the loss function isn’t convex. This is because gradient descent can converge to any local minimum, so the more complicated the loss function the more likely gradient descent is to converge to a non-optimal solution. Neural networks and adaptive filters can have highly complex loss functions, so their increase in popularity necessitates more robust optimization algorithms. One interesting class of non-convex optimization techniques is genetic algorithms.
Genetic algorithms work by modeling the theory of natural selection. A population of potential solutions called chromosomes is kept. Each chromosome is defined by its genes, and the efficacy of each chromosome is represented by its fitness. The algorithm proceeds by having the chromosomes breed and die in successive generations. Offspring inherit genes from their parents, but they also have a small chance to develop mutations. Various termination criteria are possible, but a common one is when every chromosome has the same fitness. The flow diagram of a generic genetic algorithm is shown in Figure 3, and a specific example is described in the next paragraph.
We’ll now describe an example of how a genetic algorithm could be used for filter optimization. The initial population would consist of filters
. These filters are the chromosomes, their coefficients are their genes, and their fitness is determined by a loss function such as equation 3. These filters could be initialized randomly, or they could be guesses of what the optimal filter is likely to be. In each generation,
filters are chosen to breed with more fit filters, i.e. smaller loss function, being more likely to be chosen. For each coefficient of each newborn filter, the value of the coefficient may have a 49\% chance of coming from either parent and a 2\% chance of being a random value. After breeding is complete,
filters are chosen to be discarded with less fit filters being more likely to be thrown out. This process is repeated until all filters have similar fitness scores at which point the most fit filter is chosen as the solution.
Genetic algorithms are one class of non-convex optimization techniques that can be used in situations which require complex loss functions. They avoid local minima by keeping a diverse population of chromosomes rather than a single solution like gradient descent. In the next section, we generalize the setup of Section 1 to more types of adaptive filters which require robust optimization techniques like genetic algorithms.
3 Adaptive filters

In Sections 1 and 2 we introduced machine learning as a tool to speed up computation of digital filters. Computing Wiener filters is expensive, and using machine learning allows cheaper iterative refinements which closely approximate the desired filter. These methods can be repurposed to create adaptive filters which are intended to change over time. Adaptive filters are able to react to their input which gives them much more flexibility than traditional linear time-invariant (LTI) filters. Although adaptive filters are usually implemented using an LTI filter, their adaptation makes them time-variant. Furthermore, their adaptation depends on their input which can also make them nonlinear.
Figure 4 shows how noise cancellation can be represented using an adaptive filter model. Note the similarity to Figure 4. The adaptive filter is implemented using an LTI filter , but it’s updated using an algorithm which takes
as an input. This update algorithm can be implemented using machine learning techniques like gradient descent or a genetic algorithm.
3.1 System identification

One of the simplest yet most interesting applications of adaptive filters is system identification. One possible model for system identification is shown in Figure 5. The idea is to use an adaptive filter to approximate some other outside system. The outside system could be a room with reverberation, an individual’s head-related transfer function (see [Li and Cox, 2019]), or even another digital filter. The output of the outside system and the adaptive filter are subtracted from each other, and this error
is fed into an adaptation algorithm which updates the adaptive filter coefficients.
An important observation about system identification is the object of interest isn’t the output of the filter. Indeed, system identification relies on being able to obtain the output from the outside system already. The object of interest is instead the filter itself, or rather the filter’s coefficients. This has the obvious application of measuring unknown systems, but it has another important application as well. Even if the outside system is known completely, system identification can be useful for creating a low-order approximation to it. For example, an FIR filter with 256 taps could be approximated by an FIR filter with 32 taps obtained using an adaptive filter. Alternatively, it could be approximated by a low-order IIR filter. Changing the order or type of filter in this way is a simple alternative to the tedious and mistake-prone mathematical manipulations which would be required to do the conversion by hand.
3.2 Inverse modeling

Another application of adaptive filters is inverse modeling, a possible model for which is shown in Figure 6. In contrast to system identification, inverse modeling approximates the inverse of an outside system rather than the system itself. This can be accomplished by passing an input through both the outside system and the adaptive filter in sequence, subtracting the result from the original signal, and using the resulting error
as input to an adaptation algorithm which updates the adaptive filter coefficients. Similarly to system identification, the object of interest in inverse modeling is the filter itself.
It’s worth mentioning that although inverse modeling allows approximating a system’s inverse directly, the inverse could also be approximated by first estimating the outside system as in Section 3.1, and then computing the analytic inverse. The advantage of inverse modeling isn’t just removing this last step, however. The form of the adaptive filter can be chosen explicitly when designing the adaptation algorithm. This flexibility in filter design allows the inverse to be approximated using an FIR filter, for example, which is unconditionally stable. This is a major advantage of inverse modeling because it facilitates designing stable approximations to the inverse of a system whose true inverse is unstable.
4 Conclusion and review of applications
In this paper, we introduced machine learning algorithms and described how they can be applied to compute optimal and adaptive filters efficiently. In Section 1 we introduced Wiener filters and the two major problems with computing them directly. We also showed how a reformulation can solve the problem of Wiener filters requiring unavailable information. In Section 2 we addressed the second problem of computing optimal filters by introducing gradient descent as an efficient convex optimization algorithm. We also introduced genetic algorithms as a class of non-convex optimization algorithms which can be used on the complex loss functions required by neural networks and adaptive filters. In section 3 we introduced three example models of adaptive filters including noise cancellation, system identification, and inverse modeling. Beyond measuring unknown systems, system identification and inverse modeling are also useful for approximating systems with low-order or stable filters.
References
Information is available upon request.