Activation Functions

If a network were combining only linear components, it would itself be a linear operator, so its essential to have non-linear operations. These are implemented in particular with activation functions, which are layers that transform each component of the input tensor individually through a mapping, resulting in a tensor of the same shape.

The common and popular activation function of choice has changed throughout history. For majority of the modern activation functions (in particular among the variants of ReLU), the choice is generally driven by empirical performance.

Sigmoid (Logistic) Function

https://en.wikipedia.org/wiki/Logistic_function

Definition (Sigmoid)

The Sigmoid activation function is defined as

The derivative of the Sigmoid activation function is

x f(x) sigmoid(x) ddxsigmoid(x)

History

The sigmoid was a common early activation function because it is a smooth, differentiable nonlinearity that maps real-valued inputs into . This made it convenient for gradient-based learning, and the output could be interpreted as a probability or as a soft “on/off” firing rate for a neuron.

It was also mathematically convenient for backpropagation: its derivative can be written directly in terms of its own output, . This made the computations simple and efficient, especially in early neural network implementations.

Properties

  • The logistic function has the symmetry property that .
  • The logistic function is bounded within .
  • The logistic function is smooth and differentiable everywhere.
  • The derivative is largest at , where and .
  • The output can be useful when a value should be interpreted as a probability, especially in the output layer of a binary classifier.
  • The sigmoid saturates for large positive and large negative inputs, meaning its derivative becomes very small when is large. In deep networks, this can contribute to the vanishing gradient problem.
  • The sigmoid is not zero-centered, since its outputs are always non-negative. This can make optimization less efficient because activations passed to the next layer tend to have a positive mean.

Hyperbolic Tangent (tanh)

Definition (tanh)

The tanh activation function is defined as

The derivative of the tanh activation function is

x f(x) tanh(x) ddxtanh(x)

History

The hyperbolic tangent was used as an early activation function for many of the same reasons as the sigmoid: it is smooth, differentiable, bounded, and introduces a useful nonlinearity into the network. It also has a simple derivative, , which made it convenient for backpropagation.

Compared with the logistic sigmoid, was often preferred for hidden layers because its output is zero-centered. This usually makes optimization easier than using a hidden activation whose outputs are always positive.

Properties

  • The hyperbolic tangent is bounded within .
  • The hyperbolic tangent is smooth and differentiable everywhere.
  • The hyperbolic tangent is zero-centered and odd: .
  • The derivative is largest at , where and .
  • Because is zero-centered, it is generally better behaved than the logistic sigmoid as a hidden-layer activation.
  • Like the sigmoid, saturates for large positive and large negative inputs. When is large, is close to either or , and the derivative is close to . In deep networks, this can contribute to the vanishing gradient problem..

Rectified Linear Unit (ReLU)

https://en.wikipedia.org/wiki/Rectified_linear_unit

Definition (ReLU)

The ReLU activation function is defined as

The derivative of the ReLU activation function is

x f(x) ReLU(x) ddxReLU(x)

History

ReLU is one of the most popular activation functions. Prior to 2010, most activation functions used were the logistic sigmoid (which is inspired by probability theory) and its more numerically efficient counterpart, the hyperbolic tangent. Jarrett et al. (2009) noted that ReLU was critical for object detection in convolutional neural networks, specifically because it allows average pooling without neighbouring filter outputs cancelling each other out.

To see why, suppose a small pooling window has neighbouring filter outputs . Average pooling gives , which signals that there is “almost nothing here”, even though there were strong filter responses (the strong positive and negative results cancelled out). With ReLU applied before the pooling, we have with an average pooling of , which says that there is a strong activation somewhere in this neighbourhood. For , we have and , and so strong opposite-signed responses can average to zero.

Properties

Advantages of ReLU include:

  • Sparse activation: for example, in a randomly initialized network, only about 50% of the hidden units are activated.
  • Better gradient propagation: fewer vanishing gradient problems compared to sigmoid activation functions.
  • Efficiency: only requires comparison and addition.
  • Scale-invariant: for .

Potential downsides of ReLU include:

  • Non-differentiability at zero, but this is not really a big deal because we can just arbitrarily choose the derivative to be 0 or 1 here.
  • Not zero-centered: ReLU outputs are always non-negative, which can make it harder for networks to learn during backpropagation because gradient updates tend to push weights in one direction. Batch normalization can help address this.
  • ReLU is unbounded.
  • Dying ReLU: ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state (it “dies”). This is a form of the vanishing gradient problem. If enough neurons become stuck in a dead state, it can decrease the model capacity and potentially even halt the learning process. This typically arises when the learning rate is set too high. It can be mitigated using variants such as Leaky ReLU instead, where a small positive slope is assigned for .

Sources

Leaky ReLU

Leaky ReLU allows a small, positive gradient when the unit is inactive, helping to mitigate the vanishing gradient problem..

Definition (Leaky ReLU)

The Leaky ReLU activation function is defined as

The derivative of the Leaky ReLU activation function is

Typically, is set to somewhere within .

x f(x) LeakyReLU(x) ddxLeakyReLU(x)

Sources

Softplus

The Softplus activation function is a smooth approximation to the ReLU activation function.

Definition (Softplus)

The Softplus activation function is defined as

The derivative of the Softplus activation function is

For large negative , it is approximately , and so just above 0, while large positive is roughly , so just above x. The derivative of Softplus is the logistic function.

x f(x) Softplus(x) ddxSoftplus(x)

Exponential Linear Units (ELU)

Exponential linear units smoothly allow negative values. This is an attempt to make the mean activations closer to zero, which speeds up learning.

Definition (ELU)

The ELU activation function is defined as

The derivative of the ELU activation function is

is a hyperparameter to be tuned with the constraint .

x f(x) ELU(x) ddxELU(x)

Sigmoid Linear Unit (SiLU)

https://en.wikipedia.org/wiki/Swish_function

The Sigmoid Linear Unit (SiLU), or swish function, is a smooth interpolate between a linear function and the rectified linear unit.

Definition (SiLU)

The SiLU activation function is defined as

The derivative of the SiLU activation function is

x f(x) SiLU(x) ddxSiLU(x)