Normalizing Layers

An important class of operations to facilitate the training of deep architectures are the normalizing layers, which force the empirical mean and variance of groups of activations.

Batch Normalization

Definition (Batch Normalization)

Batch normalization shifts and rescales each activation so that its mean and variance across the batch become values that are learned during training. The empirical mean and standard deviation are computed.

where all quantities are scalars. Then we use these statistics to standardize the batch activations to have mean zero and unit variance:

where is a small number that prevents division by zero if is the same for every member in the batch and . Finally, the normalized variable is scaled by and shifted by :

After this operation, the activations have mean and standard deviation across all members of the batch.

Batch normalization is applied independently to each hidden unit.

In a standard neural network with layers, each containing hidden units, there would be learned offsets and learned scales .
In a convolutional network, the normalizing statistics are computed over both the batch and spatial position. If there were layers, each containing channels, there would be offsets and scales.

At test time, we do not have a batch from which to resolve statistics. To resolve this, the statistics and are calculated across the whole training dataset and frozen in the final network.

Cost and Benefits of Batch Normalization

Batch normalization makes the network invariant to rescaling the weights and biases. Consequently, there will be a large family of weights and biases that all produce the same effect. Batch normalization also adds two parameters and at every hidden unit, which makes the model somewhat larger. Hence, it creates redundancy in the weights and biases and adds extra parameters to compensate for that redundancy. This is inefficient, but batch normalization also provides several benefits

Stable forward propagation: If we initialize offsets to zero and scales to one, then each output activation will have unit variance. In a regular network, this ensures the variance is stable during forward propagation at initialization. In a residual network, the variance must still increase as we add a new source of variation to the input at each layer, but it increases linearly with each residual block. The network is effectively less deep at the start of training, and as training proceeds, the network can increase the scales in later layers can can control its own effective depth.
Higher learning rates: Empirical studies and theory show that batch normalization makes the loss surface and its gradient change more smoothly. This means that we can use higher learning rates as the surface is more predictable.
Regularization: BatchNorm injects noise because the normalization depends on the batch statistics. The activations for a given training example are normalized by an amount that depends on the other members of the batch and is different at each training iteration.

Intuition

Batch normalization does not remove the representational benefit of depth. Instead, it reduces the optimization problems caused by depth, such as unstable activation magnitudes and poorly behaved gradients. In that sense, batch normalization can make a deep network behave as if it were easier to optimize like a shallower one, while still preserving the expressive power that comes from having many layers.

Layer Normalization

Definition (Layer Normalization)

Layer normalization shifts and rescales the activations for a single training example using statistics computed across that example’s features. Instead of using the other members of the minibatch, it normalizes each example independently and then applies learned scale and shift parameters.

In a fully connected network, layer normalization typically computes the mean and variance across all hidden units in the layer for one example. In sequence models, it is usually computed across the hidden dimension at each token position. Because it does not depend on batch statistics, its behavior is the same at training and test time.

Instance Normalization

Definition (Instance Normalization)

Instance normalization shifts and rescales each channel for each training example independently, usually using statistics computed across the spatial dimensions of that channel. As with batch normalization, learned scale and shift parameters are applied after normalization.

In a convolutional network, this means that each image in the batch is normalized separately, and each channel is normalized using only that image’s own spatial activations. Instance normalization is therefore insensitive to the composition of the minibatch and tends to remove per-instance contrast information more aggressively than batch normalization.

Comparing Batch, Layer, and Instance Normalization

"\\begin{document}\n\\begin{tikzpicture}[\n x={(0.75cm,0cm)},\n y={(-0.42cm,0.24cm)},\n z={(0cm,0.75cm)},\n line join=round,\n line cap=round,\n font=\\small,\n scale=0.4\n]\n\\def\\N{6}\n\\def\\C{6}\n\\def\\H{6}\n\n\\newcommand{\\cubefaces}{\n % Visible faces\n \\fill[gray!12] (0,0,0) -- (\\N,0,0) -- (\\N,0,\\H) -- (0,0,\\H) -- cycle;\n \\fill[gray!12] (0,0,0) -- (0,\\C,0) -- (0,\\C,\\H) -- (0,0,\\H) -- cycle;\n \\fill[gray!8] (0,0,\\H) -- (\\N,0,\\H) -- (\\N,\\C,\\H) -- (0,\\C,\\H) -- cycle;\n}\n\n\\newcommand{\\cubegrid}{\n % Front grid\n \\foreach \\i in {0,...,6} {\n \\draw[thin] (\\i,0,0) -- (\\i,0,\\H);\n }\n \\foreach \\k in {0,...,6} {\n \\draw[thin] (0,0,\\k) -- (\\N,0,\\k);\n }\n\n % Side grid\n \\foreach \\j in {0,...,6} {\n \\draw[thin] (0,\\j,0) -- (0,\\j,\\H);\n }\n \\foreach \\k in {0,...,6} {\n \\draw[thin] (0,0,\\k) -- (0,\\C,\\k);\n }\n\n % Top grid\n \\foreach \\i in {0,...,6} {\n \\draw[thin] (\\i,0,\\H) -- (\\i,\\C,\\H);\n }\n \\foreach \\j in {0,...,6} {\n \\draw[thin] (0,\\j,\\H) -- (\\N,\\j,\\H);\n }\n\n % Cube outline\n \\draw[thick] (0,0,0) -- (\\N,0,0) -- (\\N,0,\\H) -- (0,0,\\H) -- cycle;\n \\draw[thick] (0,0,0) -- (0,\\C,0) -- (0,\\C,\\H) -- (0,0,\\H);\n \\draw[thick] (0,0,\\H) -- (\\N,0,\\H) -- (\\N,\\C,\\H) -- (0,\\C,\\H) -- cycle;\n}\n\n% Batch normalization: fixed channel, normalize over N, H, W\n\\begin{scope}[shift={(0,0)}]\n \\node[font=\\normalsize] at (3,3,7.6) {Batch Norm};\n \\cubefaces\n \\fill[blue!70] (0,0,0) -- (\\N,0,0) -- (\\N,0,\\H) -- (0,0,\\H) -- cycle;\n \\fill[blue!60] (0,0,0) -- (0,1,0) -- (0,1,\\H) -- (0,0,\\H) -- cycle;\n \\fill[blue!45] (0,0,\\H) -- (\\N,0,\\H) -- (\\N,1,\\H) -- (0,1,\\H) -- cycle;\n \\cubegrid\n \\node at (3.0,-1.4,0) {N};\n \\node at (0,3.7,-1) {C};\n \\node[rotate=90] at (-2.2,3.2,3) {H, W};\n\\end{scope}\n\n% Layer normalization: fixed sample, normalize over C, H, W\n\\begin{scope}[shift={(12,0)}]\n \\node[font=\\normalsize] at (3,3,7.6) {Layer Norm};\n \\cubefaces\n \\fill[blue!70] (0,0,0) -- (0,\\C,0) -- (0,\\C,\\H) -- (0,0,\\H) -- cycle;\n \\fill[blue!60] (0,0,0) -- (1,0,0) -- (1,0,\\H) -- (0,0,\\H) -- cycle;\n \\fill[blue!45] (0,0,\\H) -- (1,0,\\H) -- (1,\\C,\\H) -- (0,\\C,\\H) -- cycle;\n \\cubegrid\n \\node at (3.0,-1.4,0) {N};\n \\node at (0,3.7,-1) {C};\n \\node[rotate=90] at (-2.2,3.2,3) {H, W};\n\\end{scope}\n\n% Instance normalization: fixed sample and channel, normalize over H, W\n\\begin{scope}[shift={(24,0)}]\n \\node[font=\\normalsize] at (3,3,7.6) {Instance Norm};\n \\cubefaces\n \\fill[blue!70] (0,0,0) -- (1,0,0) -- (1,0,\\H) -- (0,0,\\H) -- cycle;\n \\fill[blue!60] (0,0,0) -- (0,1,0) -- (0,1,\\H) -- (0,0,\\H) -- cycle;\n \\fill[blue!45] (0,0,\\H) -- (1,0,\\H) -- (1,1,\\H) -- (0,1,\\H) -- cycle;\n \\cubegrid\n \\node at (3.0,-1.4,0) {N};\n \\node at (0,3.7,-1) {C};\n \\node[rotate=90] at (-2.2,3.2,3) {H, W};\n\\end{scope}\n\\end{tikzpicture}\n\\end{document}"

All three methods try to solve a similar optimization problem: they keep activations in a controlled range so that deeper models are easier to train. The main difference is the set of values over which the normalization statistics are computed.

Batch Normalization uses statistics from the minibatch. In convolutional networks, these statistics are also pooled across spatial positions within a channel.
Layer Normalization uses statistics from a single example across its features.
Instance Normalization uses statistics from a single example and a single channel, usually across spatial positions.

These differences affect when each method is most useful.

Batch normalization is the conventional choice in classical convolutional residual networks such as ResNet. It works well when batch sizes are reasonably large and provides both stabilization and a mild regularizing effect because the normalization depends on the other examples in the batch.
Layer normalization is useful when batch statistics are unreliable or inconvenient, such as in recurrent networks and Transformers, where examples may have variable length and minibatches may be small. It stabilizes each example independently, but does not provide the same batch-dependent noise as batch normalization.
Instance normalization is often used in image generation and style transfer, where it is desirable to normalize each image independently and reduce sensitivity to instance-specific contrast or style. It is less common as the default choice in standard image-classification ResNets.

In short, these methods are not trying to do completely different things, but they are not interchangeable either. They all stabilize optimization, yet they preserve and discard different kinds of information because they normalize over different axes.

Sources

Prince, S. (2023). Understanding Deep Learning. Chapter 11.

Jake Tuero

Explorer

Normalizing Layers

Normalizing Layers

Batch Normalization

Cost and Benefits of Batch Normalization

Layer Normalization

Instance Normalization

Comparing Batch, Layer, and Instance Normalization

Sources

Graph View

Table of Contents

Backlinks