Deep learning / activation functions

1.First of all

Simply put, the rich expressiveness of neural networks comes from nesting simple activation functions into deep hierarchies.

This time, I will summarize what I learned about the activation function used in neural networks.

2. Sigmoid function

#Sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))   

#Differentiation of sigmoid function
def sigmoid_d(x):
    return (1 / (1 + np.exp(-x))) * ( 1- (1 / (1 + np.exp(-x))))

#graph display
x = np.arange(-5.0, 5.0, 0.01)
plt.plot(x, sigmoid(x), label='sigmoid')
plt.plot(x, sigmoid_d(x), label='sigmoid_d')
plt.ylim(-1.1, 1.1)  
plt.legend()  
plt.grid()
plt.show()

スクリーンショット 2020-02-10 09.13.52.png

** sigmoid function: **

sigmoid(x) = \frac{1}{1+e^{-x}}

** Differentiation of sigmoid function: **

sigmoid'(x) = \frac{1}{1+e^{-x}} * ( 1 - \frac{1}{1+e^{-x}})

The Sigmoid function has always been included in neural network textbooks for a long time, and it has a beautiful shape with almost no change in shape even when differentiated, but recently it is rarely used as an activation function.

The reason is that, as you can see in the graph, as the value of x increases, y sticks to 1 and becomes stuck. This is because the neural network differentiates y to find the slope and optimizes the weight parameter, so when the derivative becomes almost 0, it has the problem that it cannot approach the optimum solution (gradient disappearance).

** Derivative of derivative of Sigmoid function **

sigmoid'(x) = ((1 + e^{-x})^{-1})'\\
By the derivative of the composite function, u= 1+e^{-x}If you put\frac{dy}{dx}=\frac{dy}{du}\frac{du}{dx}So\\
= -(1 + e^{-x})^{-2} * (1+e^{-x})'\\
= -\frac{1}{(1+e^{-x})^2} * -e^{-x}\\
= \frac{1}{1+e^{-x}} * \frac{e^{-x}}{1+e^{-x}} \\
= \frac{1}{1+e^{-x}} * (\frac{1+e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}})\\
= \frac{1}{1+e^{-x}} * ( 1 - \frac{1}{1+e^{-x}})

3. tanh function

#Tanh function
def tanh(x):
    return (np.exp(x) -np.exp(-x)) / (np.exp(x) + np.exp(-x))

#Differentiation of Tanh function
def tanh_d(x):
    return 1- ( (np.exp(x) -np.exp(-x)) / (np.exp(x) + np.exp(-x)) )**2

#graph display
x = np.arange(-5.0, 5.0, 0.01)
plt.plot(x, tanh(x), label='tanh')
plt.plot(x, tanh_d(x), label='tanh_d')
plt.ylim(-1.1, 1.1)  
plt.legend()  
plt.grid()
plt.show()

スクリーンショット 2020-02-10 09.15.39.png ** Tanh function: **

tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}

** Differentiation of Tanh function: **

tanh(x) = \frac{4}{(e^x + e^{-x})^2}\\

The Tanh function was used as an improved version of the Sigmoid function (the maximum value when differentiated is higher than Sigmoid), but the fundamental problem that y sticks to 1 as x increases has not been improved. ..

** Derivation of derivative of Tanh function **

Quotient differential formula(\frac{f(x)}{g(x)})' = \frac{f'(x)*g(x) - f(x)*g'(x)}{g(x)^2}With\\
tanh'(x) = \frac{(e^x+e^{-x})^2 - (e^x - e^{-x})^2}{(e^x + e^{-x})^2}\\
= \frac{e^{2x}+2+e^{-2x} - (e^{2x} -2 + e^{-2x})}{(e^x + e^{-x})^2}\\
= \frac{4}{(e^x + e^{-x})^2}\\

Or

tanh'(x) = \frac{(e^x+e^{-x})^2 - (e^x - e^{-x})^2}{(e^x + e^{-x})^2}\\
= 1 - \frac{(e^x - e^{-x})^2}{(e^x + e^{-x})^2}\\
= 1 - (\frac{e^x - e^{-x}}{e^x + e^{-x}})^2\\

4. ReLU function

#ReLU function
def relu(x):
    return np.maximum(0, x)

#Differentiation of ReLU function
def relu_d(x):
    return np.array(x > 0, dtype=np.int)

#graph display
x = np.arange(-5.0, 5.0, 0.01)
plt.plot(x, relu(x), label='relu')
plt.plot(x, relu_d(x), label='relu_d')
plt.ylim(-1.1, 1.1)  
plt.legend()  
plt.grid()
plt.show()

スクリーンショット 2020-02-10 09.17.50.png

** ReLU function: ** スクリーンショット 2020-02-10 21.46.11.png

** Derivative of ReLU function: ** スクリーンショット 2020-02-10 21.39.13.png

The ReLU function was created to eliminate the fundamental problem of the Sigmoid function. Even if x increases, y also increases proportionally, and if it is always differentiated, a constant term remains. When I hear something now, it seems natural, but it wasn't used until around 2012.

Professor Yutaka Matsuo of the University of Tokyo said that the Sigmoid function was simple and almost unchanged in shape even when differentiated, and was a beautiful function for science and engineering engineers. On the other hand, ReLU wasn't cool, and there was a point that it couldn't be differentiated at (0,0), so no one wanted to use it.

In the old days, deep learning didn't work well, so everyone used the beautiful Sigmoid function as their expression. However, after being able to move it, some people tried various things, and it is said that ReLU has come to be used in such a situation.

5. Leaky ReLU function

#Leaky ReLU function
def leaky_relu(x):
    return np.where(x > 0, x , 0.01 * x)

#Differentiation of Leaky ReLU function
def leaky_relu_d(x):
    return np.where(x>0,1,0.01)

#graph display
x = np.arange(-5.0, 5.0, 0.01)
plt.plot(x, leaky_relu(x), label='leaky_relu')
plt.plot(x, leaky_relu_d(x), label='leaky_relu_d')
plt.ylim(-1.1, 1.1)  
plt.legend()  
plt.grid()
plt.show()

スクリーンショット 2020-02-10 09.19.25.png

** Leaky ReLU function: ** スクリーンショット 2020-02-10 21.45.58.png ** Differentiation of Leaky ReLU function: ** スクリーンショット 2020-02-10 21.39.25.png

The Leaky ReLU function is derived from the ReLU function and has a slope of 0.01x even if x is 0 or less. It was expected that the optimization would be more advanced than the ReLU function, but it seems to be quite limited if the optimization works better than the ReLU function.

6. Experience the difference in the performance of the activation function

Finally, let's realize how much the optimization performance differs depending on the activation function.

[TensorFlow PlayGround](http://playground.tensorflow.org/#activation=relu&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=3,3&seed=0.34380&showTestData=false&discretize=false&percrain True & xTimesY = false & xSquared = false & ySquared = false & cosX = false & sinX = false & cosY = false & sinY = false & collectStats = false & problem = classification & initZero = false & hideText = false)

スクリーンショット 2020-02-11 14.32.09.png Let's check the convergence time by setting a two-layer neural network with these three neurons, switching the activation (activation function) in the red frame.

There are some stochastic problems, but the convergence time is about 10 times faster than Sigmoid and ReLU is about 2 times faster than Tanh.

Recommended Posts

Deep learning / activation functions
Deep Learning
Deep Learning Memorandum
Start Deep learning
Python Deep Learning
Deep learning × Python
First Deep Learning ~ Struggle ~
Deep Learning from scratch
Deep learning 1 Practice of deep learning
Deep learning / cross entropy
First Deep Learning ~ Preparation ~
List of activation functions (2020)
First Deep Learning ~ Solution ~
[AI] Deep Metric Learning
I tried deep learning
Python: Deep Learning Tuning
Deep learning large-scale technology
Deep learning / softmax function
Why Deep Metric Learning based on Softmax functions works
Deep Learning from scratch 1-3 chapters
Try deep learning with TensorFlow
<Course> Deep Learning: Day2 CNN
Visualize activation functions side by side
Deep learning image recognition 1 theory
Deep running 2 Tuning of deep learning
Deep learning / LSTM scratch code
Rabbit Challenge Deep Learning 1Day
<Course> Deep Learning: Day1 NN
Deep Kernel Learning with Pyro
Try Deep Learning with FPGA
Deep learning for compound formation?
Introducing Udacity Deep Learning Nanodegree
Subjects> Deep Learning: Day3 RNN
Introduction to Deep Learning ~ Learning Rules ~
Rabbit Challenge Deep Learning 2Day
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Deep reinforcement learning 2 Implementation of reinforcement learning
Generate Pokemon with Deep Learning
Introduction to Deep Learning ~ Backpropagation ~
Python control syntax, functions (Python learning memo ②)
Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo
Try Deep Learning with FPGA-Select Cucumbers
Cat breed identification with deep learning
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Make ASCII art with deep learning
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Implement Deep Learning / VAE (Variational Autoencoder)
Introduction to Deep Learning ~ Function Approximation ~
Try deep learning with TensorFlow Part 2
Deep learning from scratch (cost calculation)
About Deep Learning (DNN) Project Management
Deep learning to start without GPU
Solve three-dimensional PDEs with deep learning.
Introduction to Deep Learning ~ Coding Preparation ~
Organize machine learning and deep learning platforms
Deep learning learned by implementation 1 (regression)
Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo
Microsoft's Deep Learning Library "CNTK" Tutorial
Deep Learning / Deep Learning from Zero Chapter 5 Memo
Check squat forms with deep learning