If a network were combining only linear components, it would itself be a linear operator, so its essential to have non-linear operations. These are implemented in particular with activation functions, which are layers that transform each component of the input tensor individually through a mapping, resulting in a tensor of the same shape.
The common and popular activation function of choice has changed throughout history. For majority of the modern activation functions (in particular among the variants of ReLU), the choice is generally driven by empirical performance.
The derivative of the Sigmoid activation function is
xf(x)sigmoid(x)ddxsigmoid(x)
History
The sigmoid was a common early activation function because it is a smooth, differentiable nonlinearity that maps real-valued inputs into . This made it convenient for gradient-based learning, and the output could be interpreted as a probability or as a soft “on/off” firing rate for a neuron.
It was also mathematically convenient for backpropagation: its derivative can be written directly in terms of its own output, . This made the computations simple and efficient, especially in early neural network implementations.
Properties
The logistic function has the symmetry property that .
The logistic function is bounded within .
The logistic function is smooth and differentiable everywhere.
The derivative is largest at , where and .
The output can be useful when a value should be interpreted as a probability, especially in the output layer of a binary classifier.
The sigmoid saturates for large positive and large negative inputs, meaning its derivative becomes very small when is large. In deep networks, this can contribute to the vanishing gradient problem.
The sigmoid is not zero-centered, since its outputs are always non-negative. This can make optimization less efficient because activations passed to the next layer tend to have a positive mean.
Hyperbolic Tangent (tanh)
Definition (tanh)
The tanh activation function is defined as
The derivative of the tanh activation function is
xf(x)tanh(x)ddxtanh(x)
History
The hyperbolic tangent was used as an early activation function for many of the same reasons as the sigmoid: it is smooth, differentiable, bounded, and introduces a useful nonlinearity into the network. It also has a simple derivative, , which made it convenient for backpropagation.
Compared with the logistic sigmoid, was often preferred for hidden layers because its output is zero-centered. This usually makes optimization easier than using a hidden activation whose outputs are always positive.
Properties
The hyperbolic tangent is bounded within .
The hyperbolic tangent is smooth and differentiable everywhere.
The hyperbolic tangent is zero-centered and odd: .
The derivative is largest at , where and .
Because is zero-centered, it is generally better behaved than the logistic sigmoid as a hidden-layer activation.
Like the sigmoid, saturates for large positive and large negative inputs. When is large, is close to either or , and the derivative is close to . In deep networks, this can contribute to the vanishing gradient problem..
ReLU is one of the most popular activation functions. Prior to 2010, most activation functions used were the logistic sigmoid (which is inspired by probability theory) and its more numerically efficient counterpart, the hyperbolic tangent. Jarrett et al. (2009) noted that ReLU was critical for object detection in convolutional neural networks, specifically because it allows average pooling without neighbouring filter outputs cancelling each other out.
To see why, suppose a small pooling window has neighbouring filter outputs . Average pooling gives , which signals that there is “almost nothing here”, even though there were strong filter responses (the strong positive and negative results cancelled out). With ReLU applied before the pooling, we have with an average pooling of , which says that there is a strong activation somewhere in this neighbourhood. For , we have and , and so strong opposite-signed responses can average to zero.
Properties
Advantages of ReLU include:
Sparse activation: for example, in a randomly initialized network, only about 50% of the hidden units are activated.
Efficiency: only requires comparison and addition.
Scale-invariant: for .
Potential downsides of ReLU include:
Non-differentiability at zero, but this is not really a big deal because we can just arbitrarily choose the derivative to be 0 or 1 here.
Not zero-centered: ReLU outputs are always non-negative, which can make it harder for networks to learn during backpropagation because gradient updates tend to push weights in one direction. Batch normalization can help address this.
ReLU is unbounded.
Dying ReLU: ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state (it “dies”). This is a form of the vanishing gradient problem. If enough neurons become stuck in a dead state, it can decrease the model capacity and potentially even halt the learning process. This typically arises when the learning rate is set too high. It can be mitigated using variants such as Leaky ReLU instead, where a small positive slope is assigned for .
The Softplus activation function is a smooth approximation to the ReLU activation function.
Definition (Softplus)
The Softplus activation function is defined as
The derivative of the Softplus activation function is
For large negative , it is approximately , and so just above 0, while large positive is roughly , so just above x. The derivative of Softplus is the logistic function.
xf(x)Softplus(x)ddxSoftplus(x)
Exponential Linear Units (ELU)
Exponential linear units smoothly allow negative values. This is an attempt to make the mean activations closer to zero, which speeds up learning.
Definition (ELU)
The ELU activation function is defined as
The derivative of the ELU activation function is
is a hyperparameter to be tuned with the constraint .