Popular Activation Functions in Neural Networks

--

The most frequently used Activation functions in Deep Neural Networks

What are the commonly used activation functions? What is the activation function? Why is ReLU so popular? Which activation function is the most commonly used activation function in neural networks? How do you choose activation function? Why do we use activation function? Which activation function is best? What is activation function and its types? What are activation functions and why do we need them? These are some frequently asked questions related to Activation functions. And this blog will answer you these questions and will give a clear view on them.

Let’s Start

There are some frequently used activation functions in the neural network such as sigmoid function, relu, softmax function, tanh function. In each neural network, the activation functions play an important role as it is responsible for discovering the patterns from the data, avoids non-linearity, Image pattern recognition using Convolutional Neural Networks, Speech recognition, etc..,

In this blog, we are going to see about the 7 most popular activation functions. We will see both the theoretical explanation as well as the practical implementation. So without making any delay, let’s straightly jump into it and get started.

• What is an Activation Function?
• Why Activation Function Required?
• 7 Popular Activation Functions

What is an Activation Function?

An activation function is a function that is appended into each neuron of an artificial neural network which helps the network to learn complex patterns in the data.

Why Activation Function Required?

In a neural network, the activation function is responsible for transforming the summed weighted input from the node into the activation of the node or output for that input.

Let’s see the 7 most popular activation functions with Deep Neural Networks.

Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. The popular activation functions are,

1. Step
2. Linear
3. Sigmoid
4. Tanh
5. ReLU
6. Leaky ReLU
7. Softmax

1. Step Function

• Step Function is one of the simplest kinds of activation functions.
• The binary Step function is a threshold-based activation function which means the neuron will be activated if the given input x is greater than the threshold, else neuron will not be triggered.
• Algebraically, the binary step function is represented as
`f(x) = 1, x>=0     = 0, x<0`
• Given below is the graphical representation of the step function.
• Why binary step function is not used frequently in Neural Networks? Because one of the most efficient ways to train a multi-layer neural network is by using gradient descent with backpropagation. A requirement for the backpropagation algorithm is a differentiable activation function. For each X, the differentiation will be zero. So this will be a problem while integrating step function with hidden and output layers.

2. Linear Function

• A neural network with a linear activation function is simply a linear regression model. It has limited power and the ability to handle complexity varying parameters of input data.
• With linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer, because a linear combination of linear functions is still a linear function. So a linear activation function turns the neural network into just one layer.
• Algebraically, the linear function is represented as
`f(x) = a*x`
• Given below is the graphical representation of a linear function.

3. Sigmoid Function

• The sigmoid activation function, also called the logistic function, is traditionally a very popular activation function for neural networks.
• The input to the function is transformed into a value between 0.0 and 1.0.
• The Sigmoid Function curve looks like an S-shape.
• The main reason why we use the sigmoid function is that it exists between 0 and 1. Therefore, it is especially used for models where we have to predict the probability as an output.
• Since the probability of anything exists only between the range of 0 and 1, sigmoid is the right choice. The function is differentiable.
• Algebraically, the sigmoid function is represented as
• Given below is the graphical representation of a sigmoid function.
• A wide variety of sigmoid functions including the logistic and hyperbolic tangent functions have been used as the activation function of artificial neurons.
• Sigmoid curves are also common in statistics as cumulative distribution functions (which go from 0 to 1), such as the integrals of the logistic density, the normal density, and Student’s T probability density functions.
• The logistic sigmoid function is invertible, and its inverse is the logit function.

Applications of Sigmoid Function

• In artificial neural networks, sometimes non-smooth functions are used instead for efficiency these are known as hard sigmoids.
• In audio signal processing, sigmoid functions are used as wave-shaper transfer functions to emulate the sound of analog circuitry clipping.

4. Hyperbolic Tangent Function ( tanh )

• The tanh function is very similar to the sigmoid function.
• The only difference is that it is symmetric around the origin.
• The range of values, in this case, is from -1 to 1. Thus the inputs to the next layers will not always be of the same sign.
• Algebraically, the tanh function can be represented as
• Given below is the graphical representation of a hyperbolic tangent function.
• The gradient of the tanh function is steeper as compared to the sigmoid function.
• Usually, tanh is preferred over the sigmoid function since it is zero centered and the gradients are not restricted to move in a certain direction.
• The function is differentiable. The function is monotonic while its derivative is not monotonic.
• The tanh function is mainly used classification between two classes. Both tanh and logistic sigmoid activation functions are used in feed-forward nets.

Applications

• The tanh function has been mostly used in recurrent neural networks for natural language processing and speech recognition tasks.
• However, the tanh function, too, has a limitation just like the sigmoid function, it cannot solve the vanishing gradient problem.

5. Rectified Linear Unit (ReLU)

• The rectified linear activation function or ReLU for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.
• The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn faster and perform better.
• The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time.
• Due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons that never get activated.
• The range of ReLU is [0, inf).
• Algebraically, the ReLU function can be represented as
`f(x) = max ( 0, x )`
• We prefer to use them when the features of the input aren’t independent.
• We have also seen why we use ReLU as an activation function. ReLU is simple to compute and has a predictable gradient for the backpropagation of the error.
• Always remember ReLu should be only used in hidden layers. For classification, Sigmoid functions(Logistic, tanh, Softmax) and their combinations work well. But at the same time, it may suffer from vanishing gradient problem. For RNN, the tanh activation function is preferred as a standard activation function.
• It is used in almost all convolutional neural networks or deep learning.
• Given below is the graphical representation of the ReLU function.
• The advantage of ReLU over the Sigmoid function is, with sigmoid activation, the gradient goes to zero if the input is very large or very small. In contrast, with ReLu activation, the gradient goes to zero if the input is negative but not if the input is large, so it might have only “half” of the problems of the sigmoid.

6. Leaky ReLU

• Leaky ReLUs is a method to solve the dying ReLU problem.
• Leaky ReLU is the most common and effective method to alleviate a dying ReLU. It adds a slight slope in the negative range to prevent the dying ReLU issue.
• Leaky ReLU has a small slope for negative values, instead of altogether zero.
• The difference is that ReLU is an activation function whereas Leaky ReLU is a Layer defined under Keras layers.
• For activation functions, you need to wrap around or use inside layers such as Activation but Leaky ReLU gives you a shortcut to that function with an alpha value.
• Algebraically, the Leaky ReLU function can be represented as
`f(x) = 0.01*x, x < 0     = x     , x >= 0`
• The leaky rectifier allows for a small, non-zero gradient when the unit is saturated and not active
• Similar to ReLU, Leaky ReLU is continuous everywhere but it is not differentiable at 0. The derivative of the function is 1 for x>0, and α for x<0.
• Given below is the graphical representation of the Leaky ReLU function.

7. Softmax

• The softmax function is used as the activation function in the output layer of neural network models that predict a multinomial probability distribution.
• By definition, the softmax activation will output one value for each node in the output layer
• The softmax function, also known as softargmax or normalized exponential function, is a generalization of the logistic function to multiple dimensions.
• Algebraically, the Softmax function can be represented as
• It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.
• The function is not a smooth maximum (a smooth approximation to the maximum function) but is rather a smooth approximation to the arg max function.
• For example, Let x = [-0.114, 2.388, 0.936, 0.853, 0.195], then softmax(x) = [0.049, 0.608, 0.142, 0.131, 0.067] and finally the argmax is 1.
• The softmax activation function is very much useful in multi-class classification.

--

--

Machine Learning Engineer proficient on Python | Machine Learning | Web Scraping | Tableau | Flask | Bootstrap | Heroku Deployment | Blog writer | Youtuber