1.First of all

Simply put, the rich expressiveness of neural networks comes from nesting simple activation functions into deep hierarchies.

This time, I will summarize what I learned about the activation function used in neural networks.

2. Sigmoid function

#Sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))   

#Differentiation of sigmoid function
def sigmoid_d(x):
    return (1 / (1 + np.exp(-x))) * ( 1- (1 / (1 + np.exp(-x))))

#graph display
x = np.arange(-5.0, 5.0, 0.01)
plt.plot(x, sigmoid(x), label='sigmoid')
plt.plot(x, sigmoid_d(x), label='sigmoid_d')
plt.ylim(-1.1, 1.1)  
plt.legend()  
plt.grid()
plt.show()

スクリーンショット 2020-02-10 09.13.52.png

** sigmoid function: **

sigmoid(x) = \frac{1}{1+e^{-x}}

** Differentiation of sigmoid function: **

sigmoid'(x) = \frac{1}{1+e^{-x}} * ( 1 - \frac{1}{1+e^{-x}})

The Sigmoid function has always been included in neural network textbooks for a long time, and it has a beautiful shape with almost no change in shape even when differentiated, but recently it is rarely used as an activation function.

The reason is that, as you can see in the graph, as the value of x increases, y sticks to 1 and becomes stuck. This is because the neural network differentiates y to find the slope and optimizes the weight parameter, so when the derivative becomes almost 0, it has the problem that it cannot approach the optimum solution (gradient disappearance).

** Derivative of derivative of Sigmoid function **

sigmoid'(x) = ((1 + e^{-x})^{-1})'\\

By the derivative of the composite function, u= 1+e^{-x}If you put\frac{dy}{dx}=\frac{dy}{du}\frac{du}{dx}So\\

= -(1 + e^{-x})^{-2} * (1+e^{-x})'\\

= -\frac{1}{(1+e^{-x})^2} * -e^{-x}\\

= \frac{1}{1+e^{-x}} * \frac{e^{-x}}{1+e^{-x}} \\

= \frac{1}{1+e^{-x}} * (\frac{1+e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}})\\

= \frac{1}{1+e^{-x}} * ( 1 - \frac{1}{1+e^{-x}})

3. tanh function

#Tanh function
def tanh(x):
    return (np.exp(x) -np.exp(-x)) / (np.exp(x) + np.exp(-x))

#Differentiation of Tanh function
def tanh_d(x):
    return 1- ( (np.exp(x) -np.exp(-x)) / (np.exp(x) + np.exp(-x)) )**2

#graph display
x = np.arange(-5.0, 5.0, 0.01)
plt.plot(x, tanh(x), label='tanh')
plt.plot(x, tanh_d(x), label='tanh_d')
plt.ylim(-1.1, 1.1)  
plt.legend()  
plt.grid()
plt.show()

スクリーンショット 2020-02-10 09.15.39.png ** Tanh function: **

tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}

** Differentiation of Tanh function: **

tanh(x) = \frac{4}{(e^x + e^{-x})^2}\\

The Tanh function was used as an improved version of the Sigmoid function (the maximum value when differentiated is higher than Sigmoid), but the fundamental problem that y sticks to 1 as x increases has not been improved. ..

** Derivation of derivative of Tanh function **

Quotient differential formula(\frac{f(x)}{g(x)})' = \frac{f'(x)*g(x) - f(x)*g'(x)}{g(x)^2}With\\

tanh'(x) = \frac{(e^x+e^{-x})^2 - (e^x - e^{-x})^2}{(e^x + e^{-x})^2}\\

= \frac{e^{2x}+2+e^{-2x} - (e^{2x} -2 + e^{-2x})}{(e^x + e^{-x})^2}\\

= \frac{4}{(e^x + e^{-x})^2}\\

tanh'(x) = \frac{(e^x+e^{-x})^2 - (e^x - e^{-x})^2}{(e^x + e^{-x})^2}\\

= 1 - \frac{(e^x - e^{-x})^2}{(e^x + e^{-x})^2}\\

= 1 - (\frac{e^x - e^{-x}}{e^x + e^{-x}})^2\\

4. ReLU function

#ReLU function
def relu(x):
    return np.maximum(0, x)

#Differentiation of ReLU function
def relu_d(x):
    return np.array(x > 0, dtype=np.int)

#graph display
x = np.arange(-5.0, 5.0, 0.01)
plt.plot(x, relu(x), label='relu')
plt.plot(x, relu_d(x), label='relu_d')
plt.ylim(-1.1, 1.1)  
plt.legend()  
plt.grid()
plt.show()

スクリーンショット 2020-02-10 09.17.50.png

** ReLU function: ** スクリーンショット 2020-02-10 21.46.11.png

** Derivative of ReLU function: ** スクリーンショット 2020-02-10 21.39.13.png

The ReLU function was created to eliminate the fundamental problem of the Sigmoid function. Even if x increases, y also increases proportionally, and if it is always differentiated, a constant term remains. When I hear something now, it seems natural, but it wasn't used until around 2012.

Professor Yutaka Matsuo of the University of Tokyo said that the Sigmoid function was simple and almost unchanged in shape even when differentiated, and was a beautiful function for science and engineering engineers. On the other hand, ReLU wasn't cool, and there was a point that it couldn't be differentiated at (0,0), so no one wanted to use it.

In the old days, deep learning didn't work well, so everyone used the beautiful Sigmoid function as their expression. However, after being able to move it, some people tried various things, and it is said that ReLU has come to be used in such a situation.

5. Leaky ReLU function

#Leaky ReLU function
def leaky_relu(x):
    return np.where(x > 0, x , 0.01 * x)

#Differentiation of Leaky ReLU function
def leaky_relu_d(x):
    return np.where(x>0,1,0.01)

#graph display
x = np.arange(-5.0, 5.0, 0.01)
plt.plot(x, leaky_relu(x), label='leaky_relu')
plt.plot(x, leaky_relu_d(x), label='leaky_relu_d')
plt.ylim(-1.1, 1.1)  
plt.legend()  
plt.grid()
plt.show()

スクリーンショット 2020-02-10 09.19.25.png

** Leaky ReLU function: ** スクリーンショット 2020-02-10 21.45.58.png ** Differentiation of Leaky ReLU function: ** スクリーンショット 2020-02-10 21.39.25.png

The Leaky ReLU function is derived from the ReLU function and has a slope of 0.01x even if x is 0 or less. It was expected that the optimization would be more advanced than the ReLU function, but it seems to be quite limited if the optimization works better than the ReLU function.

6. Experience the difference in the performance of the activation function

Finally, let's realize how much the optimization performance differs depending on the activation function.

[TensorFlow PlayGround](http://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=3,3&seed=0.34380&showTestData=false&discretize=false&percrain True & xTimesY = false & xSquared = false & ySquared = false & cosX = false & sinX = false & cosY = false & sinY = false & collectStats = false & problem = classification & initZero = false & hideText = false)

スクリーンショット 2020-02-11 14.32.09.png Let's check the convergence time by setting a two-layer neural network with these three neurons, switching the activation (activation function) in the red frame.

There are some stochastic problems, but the convergence time is about 10 times faster than Sigmoid and ReLU is about 2 times faster than Tanh.