# Popular Activation Functions in Neural Networks

--

The most frequently used Activation functions in Deep Neural Networks

What are the commonly used activation functions? What is the activation function? Why is ReLU so popular? Which activation function is the most commonly used activation function in neural networks? How do you choose activation function? Why do we use activation function? Which activation function is best? What is activation function and its types? What are activation functions and why do we need them? These are some frequently asked questions related to Activation functions. And this blog will answer you these questions and will give a clear view on them.

Let’s Start

There are some frequently used activation functions in the neural network such as sigmoid function, relu, softmax function, tanh function. In each neural network, the activation functions play an important role as it is responsible for discovering the patterns from the data, avoids non-linearity, Image pattern recognition using Convolutional Neural Networks, Speech recognition, etc..,

In this blog, we are going to see about the 7 most popular activation functions. We will see both the **theoretical explanation **as well as the **practical implementation**. So without making any delay, let’s straightly jump into it and get started.

# Table of Contents

- What is an Activation Function?
- Why Activation Function Required?
- 7 Popular Activation Functions

# What is an Activation Function?

An **activation function** is a **function** that is appended into each neuron of an artificial neural network which helps the network to learn complex patterns in the data.

# Why Activation Function Required?

In a neural network, the **activation function** is responsible for transforming the summed weighted input from the node into the **activation** of the node or output for that input.

Let’s see the 7 most popular activation functions with Deep Neural Networks.

**Activation functions are** mathematical equations that determine the output of a neural network. The **function** is attached to each neuron in the network, and determines whether it **should** be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. The popular activation functions are,

- Step
- Linear
- Sigmoid
- Tanh
- ReLU
- Leaky ReLU
- Softmax

# 1. Step Function

**Step Function**is one of the simplest kinds of**activation functions**.- The binary Step function is a threshold-based activation function which means the neuron will be activated if the given input x is greater than the threshold, else neuron will not be triggered.
- Algebraically, the binary step function is represented as

`f(x) = 1, x>=0`

= 0, x<0

- Given below is the graphical representation of the
**step function**.

- Why binary step function is not used frequently in Neural Networks? Because one of the most efficient ways to train a multi-layer neural network is by using gradient descent with backpropagation. A requirement for the backpropagation algorithm is a differentiable activation function. For each X, the differentiation will be zero. So this will be a problem while integrating step function with hidden and output layers.

# 2. Linear Function

- A neural network with a linear activation function is simply a linear regression model. It has limited power and the ability to handle complexity varying parameters of input data.
- With linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer, because a linear combination of linear functions is still a linear function. So a linear activation function turns the neural network into just one layer.
- Algebraically, the linear function is represented as

`f(x) = a*x`

- Given below is the graphical representation of a
**linear****function**.

# 3. Sigmoid Function

- The
**sigmoid activation function**, also called the**logistic**function, is traditionally a very popular**activation function**for neural networks. - The input to the
**function**is transformed into a value between 0.0 and 1.0. - The Sigmoid Function curve looks like an S-shape.
- The main reason why we
**use the sigmoid function**is that it exists between 0 and 1. Therefore, it is especially used for models where we have to predict the probability as an output. - Since the probability of anything exists only between the range of 0 and 1,
**sigmoid**is the right choice. The**function**is differentiable. - Algebraically, the sigmoid function is represented as

- Given below is the graphical representation of a
**sigmoid****function**.

- A wide variety of sigmoid functions including the logistic and hyperbolic tangent functions have been used as the activation function of artificial neurons.
- Sigmoid curves are also common in statistics as cumulative distribution functions (which go from 0 to 1), such as the integrals of the logistic density, the normal density, and Student’s T probability density functions.
- The logistic sigmoid function is
*invertible*, and its inverse is the*logit*function.

## Applications of Sigmoid Function

- In artificial neural networks, sometimes non-smooth functions are used instead for efficiency these are known as hard sigmoids.
- In audio signal processing, sigmoid functions are used as wave-shaper transfer functions to emulate the sound of analog circuitry clipping.

# 4. Hyperbolic Tangent Function ( tanh )

- The tanh function is very similar to the sigmoid function.
- The only difference is that it is symmetric around the origin.
- The range of values, in this case, is from -1 to 1. Thus the inputs to the next layers will not always be of the same sign.
- Algebraically, the tanh function can be represented as

- Given below is the graphical representation of a
**hyperbolic tangent function**.

- The gradient of the tanh function is steeper as compared to the sigmoid function.
- Usually, tanh is preferred over the sigmoid function since it is zero centered and the gradients are not restricted to move in a certain direction.
- The
**function**is differentiable. The**function**is monotonic while its derivative is not monotonic. - The
**tanh function**is mainly used classification between two classes. Both**tanh**and logistic**sigmoid activation functions**are used in feed-forward nets.

## Applications

- The
**tanh**function has been mostly**used in**recurrent neural networks for natural language processing and speech recognition tasks. - However, the
**tanh**function, too, has a limitation just like the sigmoid function, it cannot solve the vanishing gradient problem.

# 5. Rectified Linear Unit (ReLU)

- The rectified linear
**activation function**or**ReLU**for short is a piecewise linear**function**that will output the input directly if it is positive, otherwise, it will output zero. - The rectified linear
**activation function**overcomes the vanishing gradient problem, allowing models to learn faster and perform better. - The main advantage of using the
**ReLU function**over other**activation functions**is that it does not activate all the neurons at the same time. - Due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons that never get activated.
- The
**range of ReLU**is [0, inf). - Algebraically, the ReLU function can be represented as

`f(x) = max ( 0, x )`

**We**prefer to**use**them when the features of the input aren’t independent.**We**have also seen why**we use ReLU**as an activation function.**ReLU**is simple to compute and has a predictable gradient for the backpropagation of the error.- Always remember
**ReLu**should be only used in hidden layers. For classification, Sigmoid functions(Logistic,**tanh**,**Softmax**) and their combinations work well. But at the same time, it may suffer from vanishing gradient problem. For RNN, the**tanh**activation function is preferred as a standard activation function. - It is
**used**in almost all convolutional neural networks or deep learning. - Given below is the graphical representation of the
**ReLU****function**.

- The advantage of ReLU over the Sigmoid function is, with
**sigmoid activation**, the gradient goes to zero if the input is very large or very small. In contrast, with**ReLu activation**, the gradient goes to zero if the input is negative but not if the input is large, so it might have only “half” of the problems of the**sigmoid**.

# 6. Leaky ReLU

**Leaky**ReLUs is a method to solve the dying**ReLU**problem.- Leaky
**ReLU**is the most common and effective method to alleviate a**dying ReLU**. It adds a slight slope in the negative range to prevent the**dying ReLU**issue. - Leaky
**ReLU**has a small slope for negative values, instead of altogether zero. - The
**difference**is that**ReLU**is an activation function whereas**Leaky ReLU**is a Layer defined under Keras layers. - For activation functions, you need to wrap around or use inside layers such as Activation but
**Leaky ReLU**gives you a shortcut to that function with an alpha value. - Algebraically, the Leaky ReLU function can be represented as

`f(x) = 0.01*x, x < 0`

= x , x >= 0

- The
**leaky**rectifier allows for a small, non-zero gradient when the unit is saturated and not active - Similar to
**ReLU**,**Leaky ReLU**is continuous everywhere but it is not**differentiable**at 0. The derivative of the function is 1 for x>0, and α for x<0. - Given below is the graphical representation of the
**Leaky****ReLU****function**.

# 7. Softmax

- The
**softmax function**is used as the**activation function**in the output layer of neural network models that predict a multinomial probability distribution. - By definition, the
**softmax activation**will output one value for each node in the output layer - The
**softmax function,**also known as**softargmax**or**normalized exponential function**, is a generalization of the logistic function to multiple dimensions. - Algebraically, the Softmax function can be represented as

- It is used in multinomial logistic regression and is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.
- The function is not a smooth maximum (a smooth approximation to the maximum function) but is rather a smooth approximation to the arg max function.
- For example, Let x = [-0.114, 2.388, 0.936, 0.853, 0.195], then softmax(x) = [0.049, 0.608, 0.142, 0.131, 0.067] and finally the argmax is 1.
- The softmax activation function is very much useful in multi-class classification.

For the Python Implementation, you can use the below Google Colab Notebook.

*Hope you all loved it.*

Kindly give a clap if I deserve it.

For more stuff, do follow me.