Regularization

Regularization techniques are a family of methods that reduce the generalization gap between training and testing performance. Strictly speaking, regularization involves adding explicit terms to the loss function that favor certain parameter choices.

Explicit Regularization

Consider fitting a model using a training set . We seek the parameters that minimizes the loss function

where individual loss terms measure the mismatch between the network predictions and output targets for each training pair. To bias the minimization towards certain solutions, we include an additional term

where is a function that returns a scalar which takes larger values when the parameters are less preferred. The term is a positive scalar that controls the relative contribution of the original loss function and the regularization term. The minima of the regularized loss function usually differ from those in the original, so the training procedure converges to different parameter values.

Probabilistic Interpretation

The regularization term can be considered as a prior that represents knowledge about the parameters before we observe the data. We not have the maximum a posteriori or MAP criterion:

Moving back to the negative log-likelihood loss function by taking the log and multiplying by minus one, we see that .

L2 Regularization

Definition (L2 Norm)

The most commonly used regularization term is the L2 Norm, which penalizes the sum of squares of the parameter values

where indexes the parameters. This is also referred to as ridge regression.

For neural networks, L2 regularization is usually applied to the weights but not the biases and is hence referred to as weight decay. The effect is to encourage smaller weights, so the output function is smoother.

This might improve test performance for two reasons:

  • If the network is overfitting, then adding the regularization term means that the network must trade off slavish adherence to the data against the desire to be smooth. One way to think about this is that the error due to variance reduces (the model no longer needs to pass through every data point) at the cost of increased bias (the model can only describe smooth functions).
  • When the network is over-parameterized, some of the extra model capacity describes areas with no training data. Hence, the regularization term will favor functions that smoothly interpolate between the nearby points. This is reasonable behavior in the absence of knowledge about the true function.

Implicit Regularization

An intriguing recent finding is that neither gradient descent no stochastic gradient descent moves neutrally to the minimum of the loss function; each exhibits a preference for some solutions over another. This is known as implicit regularization.

Implicit Regularization in Gradient Descent

Consider a continuous version of gradient descent where the step size is infinitesimal. The change in parameters will be governed by the differential equation

Gradient descent approximates this process with a series of discrete steps of size

The discretization causes a deviation from the continuous path.

This deviation can be understood by deriving a modified loss term for the continuous case that arrives at the same place as the discretized version on the original loss . It can be shown that this modified loss is

Intuition

The extra term penalizes places where the slope is large. So gradient descent is trying to get to low loss, but also tends to avoid regions where the loss changes too sharply.

This doesn’t change the position of the minima where the gradients are zero anyway. However, it changes the effective loss function elsewhere and modifies the optimization trajectory, which potentially converges to a different minimum.

Tip

Implicit regularization due to gradient descent may be responsible for the observation that full batch gradient descent generalized better with larger step sizes.

Implicit Regularization in Stochastic Gradient Descent

A similar analysis can be applied to stochastic gradient descent. Now we seek a modified loss function such that the continuous version reaches the same place as the average of the possible random SGD updates. This can be shown to be

Here, is the loss for the th of the batches in an epoch, and both and now represent the means of the individual losses in the full dataset and the individual losses in the batch, respectively:

The equation for reveals an extra regularization term, which corresponds to the variance of the gradients of the batch losses with respect to the full gradient .

Intuition

In other words, SGD implicitly favors places where the gradients are stable (where all the batches agree on the slope).

Once more, this modifies the trajectory of the optimization process but does not necessarily change the position of the global minimum. If the model is over-parameterized, then it may fit all the training data exactly, so each of the se gradient terms will be zero at the global minimum.

Tip

SGD generalized better than gradient descent, and smaller batch sizes generally perform better than larger ones. One possible explanation is the inherent randomness allows the algorithm to reach different parts of the loss function. However, its also possible that some or all of this performance increase is due to implicit regularization; this encourages solutions where all data fits well (so the batch variance is small) rather than solutions where some of the data git extremely well and other data less well. The former solutions are likely to generalize better.

Heuristics to Improve Performance

Early Stopping

Definition (Early Stopping)

Early stopping refers to stopping the training procedure before it has fully converged. A single hyperparameter represents the number of steps after which the learning is terminated, which can be selected without the need to train multiple models. The model is trained once, the performance on the validation set is monitored every iterations, and the associated parameters are stored. The stored parameters where the validation performance was best are selected.

This can reduce overfitting if the model has already captured the coarse shape of the underlying function but has not yet had time to overfit to the noise.

Intuition

Since weights are initialized to small values, they simply don’t have time to become large, so early stopping has a similar effect to explicit L2 regularization. This reduces the model complexity, and we move back down the bias/variance trade-off curve from the critical region, and performance improves.

Ensembling

Definition (Ensembling)

Another approach to reducing the generalization gap between training and testing data is to build several models and average their predictions. A group of such models is known as an ensemble. One way to train different models is just to use different random initializations. A second approach is to generate several different datasets by re-sampling the training data with replacement. This is known as bootstrapping.

This technique reliably improves test performance at the cost of training and storing multiple models and performing inference multiple times. The model output can be combined by taking the mean (of pre-softmax activations for classification).

Intuition

The assumption is that the model errors are independent and will cancel out.

Dropout

Definition (Dropout)

Dropout clamps a random subset of hidden units to zero at each iteration of SGD. This makes the network less dependent on any given hidden unit and encourages the weights to have smaller magnitudes so that the change in the function due to the presence or absence of any specific hidden unit is reduced.

Tip

This technique has the positive benefit that it can eliminate undesirable kinks in the function that re far from the training data and don’t affect the loss.

Caution

At test time, we can run the network as usual with all the hidden units active; however, the network now has more hidden units than it was trained on, so we multiple the weights by one minus the dropout probability to compensate.

Applying Noise

Dropout can be interpreted as applying multiplicative Bernoulli noise to the network activations. This leads to the idea of applying noise to other parts of the network during training to make the final model more robust.

  • One option is to add noise to the input data; this smoothers out the learned functions. An extreme variant is adversarial training, in which the optimization algorithm actively searches for small perturbations of the input that cause large changes to the output
  • A second possibility is to add noise to the weights. This encourages the network to make sensible predictions even for small perturbations of the weights. The result is that the training converges to local minima in the middle of wide, flat regions, where changing the individual weights does not matter much
  • Finally, we can perturb the labels. The maximum likelihood criterion for multiclass classification aims to predict the correct class with absolute certainty. This means the final network activations before softmax are pushed to very large values for the correct class and small values for the wrong class. We could discourager this overconfident behavior by assuming that a portion of the training labels are incorrect and belong with equal probability to the other classes. This could be done by randomly changing the labels at each training iteration, or by changing the loss function to minimize the cross-entropy between the predicted distribution and a distribution where the true label has probability , and the other classes have equal probability. This is known as label smoothing.

Bayesian Inference

The maximum likelihood approach is generally overconfident; it selects the most likely parameters during training and uses these to make predictions. However, many parameter values may be broadly compatible with the data and only slightly less likely. The Bayesian approach treats the parameters as unknown variables and computes a distribution over these parameters conditioned on the training set using Bayes rule:

where is the prior probability of the parameters, and the denominator is a normalizing term. The prediction for new input is an infinite weighted sum of the predictions for each parameter set, where the weights are the associated probabilities

This is effectively a weighted ensemble, where the weight depends on:

  • the prior probability of the parameters, and
  • their agreement with the data

Caution

For complex models like neural networks, there is no practical way to represent the full probability distribution over the parameters or to integrate over it during the inference phase. Consequently, all current methods of this type make approximations of some kind, and typically these add considerable complexity to learning and inference.

Transfer Learning and Multi-Task Learning

When training data are limited, other datasets can be exploited to improve performance:

Transfer Learning

Definition (Transfer Learning)

In transfer learning, the network is pre-trained to perform a related secondary task for which data are more plentiful. The resulting model is then adapted to the original task. This is typically done by removing the last layer and adding one or more layers that produce a suitable output. The main model may be fixed, and the new layers trained for the original task, or we may fine-tune the entire model.

The principle is that the network will build a good internal representation of the data from the secondary task, which can subsequently be exploited for the original task. Equivalently, transfer learning can be viewed as initializing most of the parameters of the final network in a sensible part of the space that is likely to produce a good solution

Multi-Task Learning

Definition (Multi-task Learning)

Multi-task learning is a related technique in which the network is trained to solve several problems concurrently.

For example, the network might take an image and simultaneously learn segment the scene, estimate the pixel-wise depth, and predict a caption describing the image. All of these tasks require some understanding of the image and, when learned simultaneously, the model performance for each may improve.

Self-Supervised Learning

The above discussion assumes that we have plentiful data for a secondary task or data for multiple tasks to be learned concurrently. If not, we can create large amounts of free labeled data using self-supervised learning and use this for transfer learning. There are two families of methods for self-supervised learning:

  • Generative self-supervised learning: Part of each data example is masked, and the secondary task is to predict the missing part. For example, we might use a corpus of unlabeled images and a secondary task that aims to inpaint (fill in) missing parts of the image. We train the network to predict the missing words and then fine-tune it for the actual langue task we are interested in.
  • Contrastive self-supervised learning: Pairs of examples with commonalities are compared to unrelated pairs. For images, the secondary task might be to identify whether a pair of images are transformed versions of one another or are unconnected. For text, the secondary task might be to determine whether two sentences follow one another in the original document. Sometimes, the precise relationship between a connected pair must be identified.

Augmentation

Transfer learning improves performance by exploiting a different dataset. Multi-task learning improves performance using additional labels. A third option is to expand the dataset through transforming each input data example in a way that the label remains the same. Examples include flipping or rotating images.

Additional Readings

TODO

  • Link applying noise to adversarial learning
  • Link Bayesian inference to BNNs

Sources

  • Prince, S. (2023). Understanding Deep Learning. Chapter 9.