Measuring Performance
With sufficient capacity (i.e. number of hidden units), a neural network model will often perform perfectly on the training data. However, this does not necessarily mean it will generalize well to new test data.
Sources of Error
There are three possible sources of error, which are known as noise, bias, and variance
Noise
The data generation process includes the addition of noise, so there are multiple possible valid outputs
Noise can arise because there is a genuine stochastic element to the data generation process. In rare cases, noise may be absent; however, noise is usually a fundamental limitation on the possible test performance.
Example: the same house could sell for slightly different prices on different days because of bargaining or luck; that extra unpredictability is noise.
Bias
A second potential source of error may occur because the model is not flexible enough to fit the true function perfectly. For example, a three region neural network model cannot exactly describe the quasi-sinusoidal function, even when the parameters are chosen optimally. This is known as bias.
Variance
We have limited training examples, and there is no way to distinguish systematic changes in the underlying function from noise in the underlying data. When we fit a model, we do not get the closes possible approximation to the true underlying function. For different training datasets, the result will be slightly different each time. This additional source of variability in the fitted function is termed variance.
Example: if you retrain the model on a different sample of houses, the learned price rule shifts a bit; that sample-to-sample change is variance.
Mathematical Formulation of Test Error
Consider a 1D regression problem where the data generation process has additive noise with variance
and fixed noise
Now consider a least squares loss between the model prediction
The underlying function is stochastic, so this loss depends on the particular
The expected loss has been broken down into two terms: the squared deviation between the model and the true function mean, and the noise.
The first term can be further partitioned into bias and variance. The parameters of
Returning to the first term from above, we add and subtract
We then take expectation with respect to the dataset
Finally, we substitute this back into our original result:
where the first term is the variance, the second term is the bias, and the third term is the noise. This equation says that the expected loss after considering uncertainty in the training data
- The variance is uncertainty in the fitted model due to the particular training dataset we sampled
- The bias is the systematic deviation of the model from the mean of the function we are modeling
- The noise is the inherent uncertainty in the true mapping from input to output
While these sources of error combine additively for regression tasks with a least squares loss, their interaction can be more complex for other types of problems.
Reducing Error
Reducing Variance
Variance results from limited noisy training data. Fitting the model to two different training sets results in slightly different parameters. It follows that we can reduce the variance by increasing the quantity of training data. This averages out the inherent noise and ensures that the input space is well sampled.
Reducing Bias
The bias term results from the inability of the model to describe the true underlying function. This suggests that we can reduce this error by making the model more flexible. This is usually done by increasing the model capacity. For neural networks, this means adding mode hidden units and/or hidden layers
Bias-Variance Tradeoff
There is an unexpected side-effect of increasing the model capacity. For a fixed size training dataset, the variance term typically increases as the model capacity increases. Consequently, increasing the model capacity does not necessarily reduce the test error. This is known as the bias-variance tradeoff.
Double Descent
The classical bias-variance tradeoff suggests that increasing model capacity should eventually hurt test performance, because the reduction in bias is outweighed by an increase in variance. In many modern learning systems, however, this is not what is observed.
As model capacity increases, the test error often first decreases, then rises sharply near the point where the model can just interpolate the training data, and then decreases again as the model becomes even more overparameterized. This phenomenon is known as double descent.
Intuitively, the first descent is the familiar regime where increasing capacity reduces bias. The peak occurs near the interpolation threshold, where the model is flexible enough to fit the training data exactly, but may do so in an unstable way. Beyond this point, further overparameterization can sometimes lead learning algorithms to find smoother or simpler interpolating solutions, which improves generalization again.
Caution
This also has an unintuitive result in that adding training data can sometimes worsen test performance, where a model that was in the overparameterized region gets pushed back into the model capacity matching the training data.
So, double descent suggests that the classical bias-variance tradeoff is often only part of the story for highly overparameterized models such as deep neural networks.
Curse of Dimensionality
As dimensionality increases, the volume of space grows so fast that the amount of data needed to densely sample it increases exponentially. This phenomenon is known as the curse of dimensionality. High-dimension space has many unexpected properties, and caution should be used when trying to reason about it based on low-dimensional examples.
Surprising properties of high-dimensional spaces include
- Two randomly sampled data points from a standard normal distribution are very close to orthogonal to one another (relative to the origin) with high likelihood
- The distance from the origin of samples from a standard normal distribution is roughly constant
- Most of a volume of a high-dimensional sphere (hypersphere) is adjacent to its surface (a common metaphor is that most of the volume of a high-dimensional orange is in the peel, not in the pulp)
- If we place a unit-diameter hypersphere inside a hypercube with unit-length sides, then the hypersphere takes up a decreasing proportion of the volume of the cube as the dimension increases. Since the volume of the cube is fixed at size one, this implies that the volume of a high-dimensional hypersphere becomes close to zero.
- For random points drawn from a uniform distribution in a high-dimensional hypercube, the ratio of the Euclidean distance between the nearest and furthest points becomes close to one.
Sources
- Prince, S. (2023). Understanding Deep Learning. Chapter 8.