When we train models like neural networks, we seek the parameters that produce the best possible mapping from input to output for the task that we are considering.
Definition (Loss Function)
For a dataset of input/output pairs, a Loss Function returns a single number that describes the mismatch between the mode predictions and their corresponding ground-truth outputs .
During training, we seek parameter values that minimize the loss and hence map the training inputs to outputs as closely as possible.
Constructing Loss Functions
The recipe for constructing loss functions for training data using the maximum likelihood approach is:
Choose a suitable probability distribution defined over the domain of the prediction with distribution parameters .
Set the machine learning model to predict one or more of these parameters, so and .
To train the model, find the network parameters that minimizes the negative log-likelihood loss function over the training dataset pairs :
To perform inference for a new test example , return either the full distribution or the value where this distribution is maximized.
Univariate Regression
Loss Function
The goal is to predict a single scalar output from input using a model with parameters . We first choose a probability distribution over the output domain as the normal distribution. Second, we set the machine learning model to compute one or more of the parameters of the distribution. Here, we just compute the mean, so :
We aim to find the parameters that make the training data most probable under this distribution. To accomplish this, we choose a loss function based on the negative log-likelihood:
When we train the model, we seek the parameters that minimize this loss
The network no longer directly predicts but instead predicts the mean of the normal distribution over . When we perform inference, we usually want a single best point estimate , so we take the maximum of the predicted distribution:
For the univariate normal distribution, the maximum position is determined by the mean parameter which is precisely what the model computed, so .
Heteroscedastic Regression
The model above assumes that the variance of the data is constant everywhere. When the uncertainty of the model varies of the input data, we refer to this as heteroscedastic. A simple way to model this is to train a neural network that computes both the mean and variance. The above loss function can then be modified where we do not factor out the previously assumed variance, and have a loss function of two learned parameters.
Binary Classification
Loss Function
In binary classification, the goal is to assign the data to one of two discrete classes . In this context, we refer to as a label. First, we choose a probability distribution over the output space . A suitable choice is the Bernoulli distribution, which is defined over the domain.
Second, we set the machine learning model to predict the single distribution parameter . However, can only take on values in the range , and we cannot guarantee that the network output will lie in this range. Consequently, we pass the network output through a function that maps . A suitable function is the logistic sigmoid. Hence, we predict the distribution parameter as . The likelihood is now
Definition (Binary Cross Entropy Loss)
The loss function is the negative log-likelihood of the training set, which is also known as the binary cross-entropy loss
Note
This assumes a single model output, in which case you would use BCELoss. For two outputs (where each output is one of the two binary classes) you would use Cross Entropy Loss.
Inference
The transformed model output predicts the parameter of the Bernoulli distribution. This represents the probability that , and it follows that represents the probability that .
When we perform inference, we may want a point estimate of , so we set if , and otherwise.
Multiclass Classification
Loss Function
The goal of multiclass classification is to assign an input data example to one of classes. First, we choose a distribution over the prediction space . In this case, , so we choose the categorical distribution. This has parameters which determines the probability of each category:
The parameters are constrained to take the values between zero and one, and they must collectively sum to one to ensure a valid probability distribution.
Then we use a network with outputs to compute these parameters from the input . Similar to binary classification, we need to constrain the network output to satisfy the constraints of a probability distribution. A suitable choice is the softmax function:
Definition (Softmax)
The softmax function takes an arbitrary vector of length and returns a vector of the same length but where the elements are now in the range and sum to one. The th output of the softmax function is
where the exponential function ensures positivity, and the sum in the denominator ensures the numbers sum to one.
The likelihood that input has label is hence
Definition (Multiclass Cross Entropy Loss)
The Multiclass cross-entropy loss function is the negative log-likelihood of the training data:
where and denote the th and th outputs of the network respectively.
The transformed model output represents a categorical distribution over possible classes . For a point estimate, we take the most probable category .
Cross-Entropy as Distribution Matching
The negative log-likelihood objective used in maximum likelihood can also be interpreted as a distribution-matching objective. Instead of thinking only in terms of individual examples, we can think of learning as trying to make the model distribution match the empirical distribution induced by the observed dataset.
For a dataset of observations, the negative log-likelihood is
Dividing by gives the average loss per example,
which is the empirical expectation of under the data distribution. This is precisely the cross-entropy from to :
Cross-entropy decomposes as
where is the entropy of the data distribution and is the Kullback-Leibler divergence. Since does not depend on the model parameters , minimizing cross-entropy is equivalent to minimizing the KL divergence:
So minimizing negative log-likelihood, minimizing cross-entropy, and making the model distribution as close as possible to the empirical data distribution in the KL sense are all the same optimization problem.
Sources
Prince, S. (2023). Understanding Deep Learning. Chapter 5.