Measuring Performance

With sufficient capacity (i.e. number of hidden units), a neural network model will often perform perfectly on the training data. However, this does not necessarily mean it will generalize well to new test data.

Sources of Error

There are three possible sources of error, which are known as noise, bias, and variance

Noise

The data generation process includes the addition of noise, so there are multiple possible valid outputs for each input . This source of error is insurmountable for the test data. Note that it does not necessarily limit the training performance as we will not likely seem the same input multiple times in the training data, so its still possible to fit the training data perfectly.

Noise can arise because there is a genuine stochastic element to the data generation process. In rare cases, noise may be absent; however, noise is usually a fundamental limitation on the possible test performance.

Example: the same house could sell for slightly different prices on different days because of bargaining or luck; that extra unpredictability is noise.

Bias

A second potential source of error may occur because the model is not flexible enough to fit the true function perfectly. For example, a three region neural network model cannot exactly describe the quasi-sinusoidal function, even when the parameters are chosen optimally. This is known as bias.

Variance

We have limited training examples, and there is no way to distinguish systematic changes in the underlying function from noise in the underlying data. When we fit a model, we do not get the closes possible approximation to the true underlying function. For different training datasets, the result will be slightly different each time. This additional source of variability in the fitted function is termed variance.

Example: if you retrain the model on a different sample of houses, the learned price rule shifts a bit; that sample-to-sample change is variance.

Mathematical Formulation of Test Error

Consider a 1D regression problem where the data generation process has additive noise with variance : we can observe different outputs for the same input , so for each , there is a distribution with expected value

and fixed noise . Here, we have used the notation to specify that we are considering the output at a given input position .

Now consider a least squares loss between the model prediction at position and the observed value at that position:

The underlying function is stochastic, so this loss depends on the particular we observe. The expected loss is

The expected loss has been broken down into two terms: the squared deviation between the model and the true function mean, and the noise.

The first term can be further partitioned into bias and variance. The parameters of of the model depend on the training dataset , so more properly, we should write . The training dataset is a random sample from the data generation process; with a different sample of training data, we would learn different parameter values. The expected model output with respect to all possible datasets is hence:

Returning to the first term from above, we add and subtract and expand:

We then take expectation with respect to the dataset :

Finally, we substitute this back into our original result:

where the first term is the variance, the second term is the bias, and the third term is the noise. This equation says that the expected loss after considering uncertainty in the training data and the test data consists of three additive components.

The variance is uncertainty in the fitted model due to the particular training dataset we sampled
The bias is the systematic deviation of the model from the mean of the function we are modeling
The noise is the inherent uncertainty in the true mapping from input to output

While these sources of error combine additively for regression tasks with a least squares loss, their interaction can be more complex for other types of problems.

Reducing Error

Reducing Variance

Variance results from limited noisy training data. Fitting the model to two different training sets results in slightly different parameters. It follows that we can reduce the variance by increasing the quantity of training data. This averages out the inherent noise and ensures that the input space is well sampled.

Reducing Bias

The bias term results from the inability of the model to describe the true underlying function. This suggests that we can reduce this error by making the model more flexible. This is usually done by increasing the model capacity. For neural networks, this means adding mode hidden units and/or hidden layers

Bias-Variance Tradeoff

There is an unexpected side-effect of increasing the model capacity. For a fixed size training dataset, the variance term typically increases as the model capacity increases. Consequently, increasing the model capacity does not necessarily reduce the test error. This is known as the bias-variance tradeoff.

Double Descent

The classical bias-variance tradeoff suggests that increasing model capacity should eventually hurt test performance, because the reduction in bias is outweighed by an increase in variance. In many modern learning systems, however, this is not what is observed.

As model capacity increases, the test error often first decreases, then rises sharply near the point where the model can just interpolate the training data, and then decreases again as the model becomes even more overparameterized. This phenomenon is known as double descent.

Intuitively, the first descent is the familiar regime where increasing capacity reduces bias. The peak occurs near the interpolation threshold, where the model is flexible enough to fit the training data exactly, but may do so in an unstable way. Beyond this point, further overparameterization can sometimes lead learning algorithms to find smoother or simpler interpolating solutions, which improves generalization again.

Caution

This also has an unintuitive result in that adding training data can sometimes worsen test performance, where a model that was in the overparameterized region gets pushed back into the model capacity matching the training data.

So, double descent suggests that the classical bias-variance tradeoff is often only part of the story for highly overparameterized models such as deep neural networks.

Curse of Dimensionality

As dimensionality increases, the volume of space grows so fast that the amount of data needed to densely sample it increases exponentially. This phenomenon is known as the curse of dimensionality. High-dimension space has many unexpected properties, and caution should be used when trying to reason about it based on low-dimensional examples.

Surprising properties of high-dimensional spaces include

Two randomly sampled data points from a standard normal distribution are very close to orthogonal to one another (relative to the origin) with high likelihood
The distance from the origin of samples from a standard normal distribution is roughly constant
Most of a volume of a high-dimensional sphere (hypersphere) is adjacent to its surface (a common metaphor is that most of the volume of a high-dimensional orange is in the peel, not in the pulp)
If we place a unit-diameter hypersphere inside a hypercube with unit-length sides, then the hypersphere takes up a decreasing proportion of the volume of the cube as the dimension increases. Since the volume of the cube is fixed at size one, this implies that the volume of a high-dimensional hypersphere becomes close to zero.
For random points drawn from a uniform distribution in a high-dimensional hypercube, the ratio of the Euclidean distance between the nearest and furthest points becomes close to one.

Sources

Prince, S. (2023). Understanding Deep Learning. Chapter 8.

Additional Readings

https://openai.com/index/deep-double-descent/

Jake Tuero

Explorer

Measuring Performance

Measuring Performance

Sources of Error

Noise

Bias

Variance

Mathematical Formulation of Test Error

Reducing Error

Reducing Variance

Reducing Bias

Bias-Variance Tradeoff

Double Descent

Curse of Dimensionality

Sources

Additional Readings

Graph View

Table of Contents

Backlinks