At x.ai, we have an organizational concept called “practices.” A practice is a group of people that come up with the best practices for a given area. Some of these practices are in the areas of data science, javascript, data engineering, and user testing. Some members of a practice are highly skilled in that area, and some members are just interested in learning.

I’m an engineer, but I’ve been learning immensely from the data science practice. I was particularly curious about activation functions available in Keras, a Python library for building neural networks. When preparing to present to the group, I found that information about activation functions is spread out all over the internet. My hope is that this blog post can help other engineers interested in data science understand what activation functions are, why they are used, and why so many different types of activation functions are used.

You may be asking, what’s the role of an activation function in a neural network? Each connection between neurons learns its own weight. A neuron is connected to multiple neurons; the input to that neuron is the sum of the outputs of the products of the previous layer’s neurons with the learned weights. Here’s a diagram and equation:

image9activation_function

Neuron 3’s input (defined above as h) is then fed into an activation function. Activation functions are used to introduce nonlinearity to models, which allows deep learning models to learn nonlinear prediction boundaries.

There are 2 main classes of activation functions: those that have a Logistic shape or those that have a Rectifier shape.

 

Logistic Shape

Logistic-curve.svgLogistic functions have an “S” shape. Two asymptotes that bound the function. They’re useful because they can force any input to be bounded between two outputs. The gradient is classified as increasing at an increasing rate, and then it switches to decreasing at a decreasing rate as the input increases.

Sigmoid: The sigmoid function is used for binary classification problems; a classic example is Twitter sentiment analysis. You’re trying to predict if the sentiment of a tweet is positive or negative, with a bunch of tweets as your data set. The sigmoid will output values between [0, 1] and generates a probability that the input is of a specific class. There’s an asymptote at y = 0 and y = 1. The sigmoid function is defined as:image12

One potential challenge of using sigmoid activation functions is that these neural networks can have “vanishing gradients” when the number of hidden layers gets sufficiently deep and the neurons get saturated.  Sigmoids constrain the input space to outputs between 0 and 1. Therefore, large changes in input will result in small changes in output, which equates to small gradients.

As we add more layers, large changes on the input will have even smaller changes on the output.  The gradient on a large change of input and a small output change will be small.  Each connection between 2 neurons has one weight.  At a very basic level, a method called backpropagation is used to move the weight in the opposite direction of the gradient.  If the gradient is small, your network will learn very slowly, or possibly not at all.  ReLU, an activation function that will be discussed later, was invented to solve this.

Softmax:  The softmax function looks similar to the sigmoid function. When trying to predict two classes, the softmax function is equivalent to using the sigmoid activation function.  The sum of all of the softmax functions in a layer of a neural network will always equal 1.  

image5

Hard Sigmoid: Instead of calculating the sigmoid as defined above, you can use an approximation of the sigmoid.  When trained, the hard sigmoid can generate very similar outputs to the sigmoid, but is much easier computationally to calculate. It’s defined as a piecewise function that is either 0 slope when x is < -2.5 or x > 2.5 and has a 0.2 slope when x is -2.5 < x < 2.5.  Because the hard sigmoid is easier for a computer to calculate, each iteration of learning is faster than with a traditional sigmoid.  If time to train the model is a constraint, this activation function could lead to a smarter model.

main-qimg-2fd3181b8ebfab960a8012d0b92a09a8

Rectifier Shape

Rectifier_and_softplus_functions.svg-1Rectifier functions are characterized by having slopes of 0 or approaching 0 when the input is less than 0 and slopes of 1 or close to 1 when the input is greater than or equal to 0.  One of their benefits over logistic functions is that the output is not bounded between 2 asymptotes. Therefore, a rectifier can generate output values in the range of [0, ∞), which is why rectifiers have become popular in solving the “vanishing gradient problem” described earlier.

ReLU: ReLUs are very commonly used as the activation functions in the hidden layers of a neural network. Hidden layers are middle layers; they’re not the first or last layers in the network. It’s a very simple function: max(0, x).  It has a 0 slope for x < 0 and a slope of 1 for x > 0.  

If a neuron’s input is less than 0, the neuron’s activation function will return 0 to the next layer in the network.  When there are too many neurons returning 0 to the next neuron, the network will have trouble propagating information through the layers of the network.

relu

Leaky ReLU:  In a standard ReLU, the slope of the activation function for input values less than 0 is 0.  However, in a Leaky ReLU, we can give x a small positive slope (let’s call it the constant C) so that the network learns to move away from the negative x values.

f(x) = (Cx)(x < 0) + (x)(x >= 0)

The Leaky ReLU will have a slope of 1 when x > 0 and have a slope of C when x < 0.  This allows the network to continue learning and avoids the trap of the network becoming very sparse. 1  

leaky

PReLU: PReLU is very similar to the Leaky ReLU function above because the model uses a value to multiply by x when x < 0.  But instead of just guessing a value to multiply x by when x < 0, you train a parameter of your model to find the optimal value to multiply by x. 2

 

Further Learning

If you’re an aspiring data scientist or just curious, it’s useful to gain an understanding of the common activation functions. These functions add nonlinearity to neural networks. Without them a neural network would not be able to learn nonlinear decision boundaries; these nonlinear boundaries are what allow deep learning techniques to master such varied and complex domains of expertise.  If you’re as fascinated by this as I am, you’ll enjoy checking out more of Keras’ advanced activation functions with the source code here!

Special thanks to Adam Kleczewski and the whole x.ai Data Science team for their help with this piece!