Neural Networks

Shallow Neural Networks

Shallow neural networks describe piecewise linear functions that are expressive enough to approximate arbitrarily complex relationships between multi-dimensional inputs and outputs. The below image visualizes this, where an approximation of the dashed blue line is given by piecewise linear components, with an increasing number of components.

0.0 1.0 2.0 Inputx -1.0 0.0 1.0 Outputy 5LinearRegions 0.0 1.0 2.0 Inputx 10LinearRegions 0.0 1.0 2.0 Inputx 20LinearRegions

Definition (Shallow Neural Network)

A shallow neural network maps a multi-dimensional input to a multi-dimensional output using hidden units. Each hidden unit is computed as

and these are combined linearly to create the output

where is a nonlinear activation function. The model has parameters .

The number of hidden units in a shallow network is a measure of network capacity.

The below figure shows the flow of computation that creates the above functions, where we have a single input and single output , with 3 hidden units , , and . The top row is the linear combination of the input features, the middle row is hidden value by passing the linear combination through a ReLU activation function, then the hidden units are linearly combined to produce the output.

-1.0 0.0 1.0 Output µ10+µ11x -1.0 0.0 1.0 Output h1=a[µ10+µ11x] 0.0 1.0 2.0 Inputx -1.0 0.0 1.0 Output Á1h1 µ20+µ21x h2=a[µ20+µ21x] Á2h2 µ30+µ31x h3=a[µ30+µ31x] 0.0 1.0 2.0 Inputx Á3h3 0.0 1.0 2.0 Inputx Á0+Á1h1+Á2h2+Á3h3

Intuition

Each hidden unit contributes one joint to the function, so three hidden units can create four linear regions. However, the four slopes are not all independent: each joint only adds or removes one hidden unit’s slope contribution. If all hidden units are inactive in one region, that region is flat; otherwise each region’s slope is determined by the sum of the active hidden-unit slopes.

Deep Neural Networks

With ReLU activation functions, both shallow and deep networks describe piecewise linear mappings from input to output. As the number of hidden units increases, shallow neural networks improve their descriptive power. With enough hidden units, shallow networks can describe arbitrarily complex functions in high dimensions. However, it turns out that for some functions, the required number of hidden units is impractically large. Deep networks can produce many more linear regions that shallow networks for a given number of parameters. Hence, from a practical standpoint, they can be used to describe a broader family of functions.

Intuition

Consider the example below which composes 2 single-layer networks with 4 hidden units each. The output of the first network (left) is used as input for the second network (right), with the combination of input to output shown in the bottom plot. The first network has 3 linear regions of alternating sign. This means that three different ranges of are mapped to the same output range , and the subsequent mapping from this range of to is applied three times. The overall effect is that the function defined by the second network is duplicated three times to create nine linear regions, whereas the same number of hidden units in a shallow network would have 6 regions.

One way to think about the first network is that it folds the input space back on top of itself. The second network applies its function to the unfolded space. The final output is revealed by unfolding again.

-1.0 0.0 1.0 Inputx -1.0 0.0 1.0 Outputy -1.0 0.0 1.0 Inputy -1.0 0.0 1.0 Outputy0 0.0 1.0 2.0 Inputx -1.0 0.0 1.0 Outputy0

Definition (Deep Neural Networks)

A general deep network with layers is defined as

where is the vector of hidden units at layer , is the vector of biases (intercepts) that contribute to the hidden layer , and are the weights (slopes) that are applied to the th layer and contribute to the th layer. The parameters of this model comprise all of these weight matrices and bias vectors .

  • If the th layer has hidden units, then the bias vector will be of size .
  • The last bias vector has size of the output.
  • The first weight matrix has size where is the size of the input.
  • The last weight matrix is
  • The remaining matrices are of size .

Deep Neural Networks

With ReLU activation functions, both shallow and deep networks describe piecewise linear mappings from input to output. As the number of hidden units increases, shallow neural networks improve their descriptive power. With enough hidden units, shallow networks can describe arbitrarily complex functions in high dimensions. However, it turns out that for some functions, the required number of hidden units is impractically large. Deep networks can produce many more linear regions that shallow networks for a given number of parameters. Hence, from a practical standpoint, they can be used to describe a broader family of functions.

Intuition

Consider the example below which composes 2 single-layer networks with 4 hidden units each. The output of the first network (left) is used as input for the second network (right), with the combination of input to output shown in the bottom plot. The first network has 3 linear regions of alternating sign. This means that three different ranges of are mapped to the same output range , and the subsequent mapping from this range of to is applied three times. The overall effect is that the function defined by the second network is duplicated three times to create nine linear regions, whereas the same number of hidden units in a shallow network would have 6 regions.

One way to think about the first network is that it folds the input space back on top of itself. The second network applies its function to the unfolded space. The final output is revealed by unfolding again.

-1.0 0.0 1.0 Inputx -1.0 0.0 1.0 Outputy -1.0 0.0 1.0 Inputy -1.0 0.0 1.0 Outputy0 0.0 1.0 2.0 Inputx -1.0 0.0 1.0 Outputy0

Definition (Deep Neural Networks)

A general deep network with layers is defined as

where is the vector of hidden units at layer , is the vector of biases (intercepts) that contribute to the hidden layer , and are the weights (slopes) that are applied to the th layer and contribute to the th layer. The parameters of this model comprise all of these weight matrices and bias vectors .

  • If the th layer has hidden units, then the bias vector will be of size .
  • The last bias vector has size of the output.
  • The first weight matrix has size where is the size of the input.
  • The last weight matrix is
  • The remaining matrices are of size .

Shallow vs Deep Neural Networks

Number of Linear Regions per Parameter

  • A shallow network with one input, one output, and hidden units can create up to linear regions and is defined by parameters.
  • A deep network with one input, one output, and layers of hidden units can create a function with up to linear regions using parameters.

For the shallow network, there is one hidden layer with units:

  • Input to hidden: each hidden unit has one weight and one bias, giving parameters.
  • Hidden to output: the output unit has incoming weights and one bias, giving parameters.

Hence the total number of parameters is:

For the deep network, there are hidden layers, each with units:

  • First hidden layer: input to hidden gives weights and biases, so parameters.
  • Remaining hidden layers: each layer maps activations to activations, so each contributes weights and biases, for a total of parameters per layer.
  • Output layer: the final hidden layer connects to one output, giving weights and one bias, so parameters.

Hence the total number of parameters is:

Training and Generalization

It is usually easier to train moderately deep networks than to train shallow ones. It may be that over-parameterized deep models (those with more parameters than training examples) have a large family of roughly equivalent solutions that are easy to find. However, as we add more hidden layers, training becomes more difficult again.

Parameter Initialization

The backpropagation algorithm computes the derivatives that are used by stochastic gradient descent. Consider that during the forward pass, each set of pre-activations is computed as

where applied the ReLU activation functions, and and are the weights and biases respectively. Imagine we initialize the biases to zero and the elements of according to a Normal distribution with mean zero and variance . Consider two scenarios:

  • If the variance is very small, then each element of will likely have a smaller magnitude than the input. In addition, the ReLU function clips values less than zero, so the range of will be half that of . Consequently, the magnitudes of the pre-activations at the hidden layers will get smaller and smaller as we progress through the network.
  • If the variance is very large, then each element of will likely have a larger magnitude than the input. The ReLU function halves the range of the inputs, but if is large enough, the magnitudes of the pre-activations will still get larger as we progress through the network.

Vanishing Gradient and Exploding Gradient Problem

In these two situations, the values at the pre-activations can become so small or so large that they cannot be represented with finite precision floating point arithmetic. Even if the forward pass is tractable, the same logic applies to the backward pass. These are known as the vanishing gradient problem and exploding gradient problem.

Initialization for Forward Pass (He Initialization)

To see this concretely, consider the computation between adjacent pre-activations and with dimensions and respectively

where represents the activations, and represents the weights and biases, and is the activation function. Assume the pre-activations in the input layer have variance . Consider initializing the biases to zero and the weights as normally distributed with mean zero an variance . The expectation of the intermediate values are

where is the dimensionality of the input layer . The variance of the pre-activations is

Assuming that the distribution of pre-activation at the previous layer is symmetric about zero, half of these pre-activations will be clipped by the ReLU function, and the second moment will be half the variance of :

This, in turn, implies that if we want the variance of of the subsequent pre-activations to be the same as the variance of the original pre-activations during the forward pass, we should set

where is the dimension of the original layer to which the weights were applied. This is known as He initialization.

Initialization for the Backward Pass

A similar argument establishes how the variance of the gradients changes during the backward pass. During the backward pass, we multiply by the transpose of the weight matrix, so the equivalent expression becomes

Initialization for both Forward and Backward Pass

If the weight matrix is not square (i.e. and differ), then it is not possible to choose the variance to satisfy both of the above equations for the forward and backward pass. One possible compromise is to use the mean as a proxy for the number of terms, which gives

Sources

  • Prince, S. (2023). Understanding Deep Learning. Chapter 3.
  • Prince, S. (2023). Understanding Deep Learning. Chapter 4.