Given a neural network and a loss function, the goal of model fitting is to find the parameter values of the neural network that minimize this loss. This is known as learning the network’s parameters, or simply training or fitting the model. The process is to choose initial parameter values and then iterate the following two steps:
Compute the derivative (gradients) of the loss with respect to the parameters, and
Adjust the parameters based on the gradients to decrease the loss.
Gradient Descent
To fit a model, we need a training set of input/output pairs. We seek parameters for the mode that map the inputs to outputs as closely as possible. We define a loss function that returns a single number that quantifies the mismatch in this mapping.
Definition (Optimization Algorithm)
The goal of an optimization algorithm is to find parameters that minimize the loss
There are many families of optimization algorithms, but the standard methods for training neural networks are iterative. These algorithms initialize the parameters heuristically and then adjust them repeatedly in such a way that the loss decreases.
The simplest method in this class is gradient descent.
Gradient Descent
Start with initial parameters and iterate between the two steps:
Compute the derivates of the loss with respect to the parameters:
Update the parameters according to the rule
where the positive scalar determines the magnitude of the change.
The first step computes the gradient of the loss at the current position, which determines the uphill direction of the loss function. The second step moves a small distance downhill. The parameter , called the learning rate, can be held fixed or can change each step. At the minimum of the loss function, the gradient will be zero, and the parameters will stop changing.
Local Minima and Saddle Points
If we start in a random position and use gradient descent to go downhill, there is no guarantee that we will end up at the global minimum and find the best parameters. The only guarantee of convergence to the global minimum is on convex functions. As it turns out, complex non-convex problems (like deep neural networks) are non-convex, and so gradient descent often settles on a local minimum or saddle point.
Stochastic Gradient Descent
Using gradient descent to find the global optimum of a high-dimension a loss function is challenging. We can find a minimum, but there is no way to tell whether this is the global minimum or even a good one. One of the main problems is that the final destination of a gradient descent algorithm is entirely determined by the starting point.
Stochastic gradient descent (SGD) attempts to remedy this problem by adding some noise to the gradient at each step. The solution still moves downhill on average, but any any given iteration the direction chosen is not necessarily in the steepest downhill direction (it may not be downhill at all). The SGD algorithm ahs the possibility of moving temporarily uphill and hence jumping from one “valley” of the loss function to another.
Batches and Epochs
The mechanism for introducing randomness is simple: at each iteration, the algorithm chooses a random subset of the training data and computes the gradient from these examples alone. This subset is known as a minibatch or batch for short. The update rule for the model parameters at iteration is hence
where is a set containing the indices of the input/output pairs in the current batch, and is the loss doe to the th pair. The batches are usually drawn without replacement. The algorithm works through the training examples until it has used all the data, at which point it starts sampling from the full training dataset again. A single pass through the entire training dataset is referred to as an epoch.
Properties of SGD
Although it adds noise to the trajectory, it still improves the fir to a subset of the data at each iteration. Hence, the updates tend to be sensible even if they are not optimal
Because it draws training examples without replacement and iterates through the dataset, the training examples all still contribute equally.
It is less computationally expensive to compute the gradient from just a subset of the training data
Example
As a rough memory model, suppose the network has parameters and each parameter uses bytes. Then storing the parameter vector itself costs bytes. If we very crudely assume that processing one training example requires another parameter-sized block of working memory, then computing a full-dataset gradient for examples needs about
bytes, whereas using a minibatch of size needs about
bytes. For example, if parameters and bytes (single-precision floats), then the model itself occupies about MB. A full batch over training examples would require roughly GB of working memory under this simplified model, while a minibatch with would require only about MB. The exact constants depend on the implementation, but the key point is that the memory requirement scales linearly with batch size, so using is much more practical.
It can (in principle) escape local minima
It reduces the chances of getting stuck near saddle points
There is some evidence that SGD finds parameters for neural networks that cause them to generalize well to new data
Extensions and Variants
Momentum
A common modification to stochastic gradient descent is to add a momentum term. We update the parameters with a weighted combination of the gradient computed from the current batch and the direction moved from the previous step:
where is the momentum (which drives the update at iteration ), controls the degree to which the gradient is smoothed over time, and is the learning rate. The recursive formulation of the momentum calculation means that the gradient step is an infinite weighted sum of all the previous gradients, where the weights get smaller as we move back in time.
Interpretation
The effective learning rage increases if all these gradients are aligned over multiple iterations, but decreases if the gradient direction repeatedly changes as the terms in the sum cancel out. The overall effect is a smoother trajectory and reduced oscillatory behaviour in the valleys.
Caution
Momentum can keep pushing in a direction that was useful a few steps ago but is no longer ideal now. Since the gradient is evaluated only at the current point, the update may react a little late to sharp curvature or to a nearby minimum, which can lead to overshooting.
Nesterov Accelerated Momentum
The momentum term can be considered a coarse prediction of where the SGD algorithm will move next. Nesterov accelerated momentum computes the gradients at this predicted point, rather than the current point
where now the gradients are evaluated at . One way to think about this is that the gradient term now corrects the path provided by momentum alone.
Intuition
Vanilla momentum uses the gradient at the current point, even though the momentum term is already carrying the parameters forward. Nesterov first looks ahead to where momentum alone would move the parameters, then measures the gradient there. This gives an earlier correction, so the update can slow down or turn before momentum overshoots.
Adam
Gradient descent with a fixed step size has the following undesirable property: it makes large adjustments to parameters associated with large gradients (where perhaps we should be more cautions) and small adjustments to parameters associated with small gradients (where we should perhaps explore further).
When the gradient of the loss surface is much steeper in one direction than another, it is difficult to choose a learning rate that
Makes good progress in both directions, and
Is stable
A straightforward approach is to normalize the gradients so that we move a fixed distance (governed by the learning rate) in each direction. To do this, we first measure the gradient and the pointwise squared gradient :
Then we apply the update rule
where the square root and division are both pointwise, is the learning rate, an is a small constant that prevents division by zero. The result is that the algorithm moves a fixed distance along each coordinate, where the direction is determined by whichever way is downhill.
Why square and then take the root?
In this simplified motivating example, , so pointwise. Therefore, the ratio
keeps only the sign of each component of and discards its magnitude. Writing it this way makes the normalization idea explicit, and later Adam will replace with a running average of squared gradients, at which point this form is no longer redundant.
Adaptive moment estimation, or Adam, takes this idea and adds momentum to bot the estimate of the gradient and the squared gradient:
where and are the momentum coefficients for the two statistics.
Using momentum is equivalent to taking an exponentially weighted average over the history of each of these statistics. Since both averages are initialized at zero, they are biased toward zero at the start of training: after only a few steps, the moving averages have not yet had time to accumulate their full mass. Consequently, we apply the bias-correction rule
For example, if the gradient were constant, then the raw moving average for would be times the true value, so dividing by exactly removes this initial shrinkage. The same logic applies to . Since and lie in , the powers and decay over time, the denominators approach one, and this correction quickly becomes negligible.
Why is this emphasized in Adam?
Ordinary SGD with momentum has the same zero-initialization bias in its momentum term, and in principle one could correct it there too. In practice, the effect is usually mild because the momentum term appears only in the numerator of the update, so the early steps are simply somewhat smaller. In Adam, however, the biased estimate appears in the denominator. If is underestimated, then we divide by a denominator that is too small, which can distort the step size much more strongly. So the correction is not unique to Adam, but it is much more important there.
Finally, we update the parameters as before, but with the modified terms
The result is an algorithm that can converge to the overall minimum and makes good progress in every direction in the parameter space.
The gradient magnitude of neural network parameters can depend on their depth in the network. Adam helps compensate for this tendency and balances out changes across different layers. In practice, Adam also has the advantage of being less sensitive to the initial learning.
Computing Gradients
To perform model fitting, we need to calculate gradients efficiently as neural networks can have a large number of parameters, and the gradient needs to be computed for every parameter at each iteration of the training algorithm. The derivatives of the loss tell us how the loss changes when we make a small change to the parameters. There are two observations that will provide some intuition:
Observation 1: Each weight (element of ) multiplies the activation at a source hidden unit and adds the result to a destination hidden unit in the next layer. It follows that the effect of any small change to the weight is amplified by the activation at the source hidden unit. Hence, we run the network for each data example in the batch and store the activations of all the hidden units. This is known as the forward pass. The stored activations will be used to compute the gradients.
Observation 2: A small change in a bias or weight causes a ripple effect of changes through the subsequent network. The change modifies the value of its destination hidden unit. This changes the values of the hidden units in the subsequent layers, which will change the hidden units in the layer after that, and so on, until a change is made to the model output, and finally the loss.
Hence, to know how changing a parameter modified the loss, we also need to know how changes to every subsequent hidden layer will, in turn, modify the successor. These same quantities are required when considering other parameters in the same or earlier layers. It follows that we can calculate them once and reuse them. As we move backward through the network, we see that most of the terms that we need were already calculated in the previous step, so we do not need to re-compute them. Proceeding backward through the network in this way is to compute the derivatives is known as the backward pass.
Backpropagation Algorithm
Backpropagation
Consider a deep neural network that takes input , has hidden layers with ReLU activations, and individual loss term . The goal of backpropagation is to compute the derivatives and with respect to the biases weights .
Forward Pass: We compute and store the following quantities
for .
Backward Pass: We start with the derivative of the loss function with respect to the network output and work backward through the network
for , where denotes pointwise multiplication and is a vector containing ones where is greater than zero and zeros elsewhere (gradient of ReLU). Finally, we compute the derivatives with respect to the first set of biases and weights:
We calculate the derivatives for every training example in the batch and sum them together to retrieve the gradient for the SGD update.
Caution
The backpropagation algorithm is efficient; the most demanding computational step in both the forward and backward pass is matrix multiplication. However, it is not memory efficient; the intermediate values in the forward pass must be all stored, and this can limit the size of the model we can train.
Sources
Prince, S. (2023). Understanding Deep Learning. Chapter 6.
Prince, S. (2023). Understanding Deep Learning. Chapter 7.