Choice of Loss Function for Neural Networks

Neural networks are often attempting to model non-linear systems. It is still desirable for cost (or loss) functions to use linear or logistic regression to simplify the supervised training process. Non-linearity can be introduced by the activation functions applied within the network. The choice of the loss function for neural networks is dependent on the values of the output layer of the neural network.

Using NN-based speech enhancement as an example, if the output of the network is the time domain waveform signal, then linear regression cost functions, such as lasso or ridge regression can be used.

$L_1=\sum_{i=0}^{N}\left|y_i-\widehat{y_i}\right|\$

$L_2=\sum_{i\ =\ 0}^{N}\left(y_i-\widehat{y_i}\right)^2\$

y is the target value and \hat{y} is the predicted value, and N is the number of training cases. Generally, L2 ridge regression, is preferred because squaring the difference results in faster learning. Any real-valued output node can use the mean square error as the loss function.

When the output layer is a binary target or a categorical target, the output node(s) represent the probability of the predicted classification. The most common cost function used is cross entropy, a logistic regression. Cross entropy is a measure of difference between predicted probability distribution and the true probability distribution. This cost function is defined as:

$L_{ce}=-\sum_{n=0}^{N}\sum_{k=1}^{K}{y_{nk}log({\hat{y}}_{nk})}\$

Where K is the number of output classes. K = 2 represents a binary classification. In NN based speech enhancement, the ideal binary mask, which makes the classified if a given T-F unit is speech or not-speech, a cross entropy loss function can be used.

Complete Communications Engineering

Choice of Loss Function for Neural Networks

More Information