I tried to summarize four neural network optimization methods

What kind of article is it?

――I would like to summarize the four optimization methods of neural networks that I recently learned by reading the book "Neural networks made from scratch". ――Basically, it is based on a book, but ** How can you imagine it? This is an article that summarizes my personal interpretation with an emphasis on **. Therefore, it does not include exact mathematical formulas or applications. ――If you search on the net, more detailed explanations than this article will come out, but on the other hand, there are some articles that summarize almost the same explanation as the book in Qiita, so it is meaningful to summarize your own interpretation I thought there was a little bit, so I came to write an article ――I'm quite biased towards the explanation of the method called Momentum, but please forgive me. .. ――I'm a beginner and beginner about neural networks, so I'd be happy if you could comment on any mistakes or advice.

Neural network made from scratch

"Neural networks" and "deep learning" that even ordinary people who are not specialized in machine learning now know only their names. It's one of machine learning, and it's often used for image recognition (like Google image search). Probably the most read book in Japan when studying this neural network is "** Deep Learning from scratch **". As it is read well, I think it is a very good introductory book. (What is it?)

"Learning" in machine learning

Not limited to neural networks, the purpose of machine learning is to learn based on training data (teacher data) and predict unknown data. For example, in terms of image recognition, a large amount of cat images are learned, and when a new image is shown, the answer is "whether that image or a cat?" In normal machine learning, there is a restriction that "features must be specified for learning", but in deep learning, there is an advantage that features can be found from images without permission during learning.

What is optimization?

What exactly is learning? In other words, it is the task of finding the optimal weight parameter. Once the optimum parameters are found, the correct answer can be predicted from unknown data based on them. This is called ** optimize **. For optimization, we define a function called a loss function (cost function) and look for the parameter that minimizes it. Least squares error and LogLoss (cross entropy error) are used as the loss function, and the smaller the value, the better the classification (or regression). So how do you find the smallest location of the loss function? Think about. This book introduces the following four optimization methods.

--SGD (gradient descent method)

SGD Abbreviation for Gradient Stochastic Descent. It is the most standard optimization method and updates the weight with the following formula.

\boldsymbol{W} \leftarrow \boldsymbol{W} - \eta\frac{\partial{L}}{\partial{\boldsymbol{W}}}

Calculate the slope of the loss function and update it so that the loss function becomes smaller. The idea is, "If you go down the slope, you will eventually reach the lowest point." $ \ Eta $ is called the learning rate, and if it is too small, it will be difficult to reach it, and if it is too large, it will move too much at once and fly in a ridiculous direction. So you need to set the appropriate value first. Personally, I think it's like golf, and if you hit it with only a little force, you will never reach the hole, but if you hit it too hard, you will pass through the hole and go in a ridiculous direction. I will end up. That's SGD, but it doesn't always work. For example, consider a function such as $ f (x, y) = \ frac {1} {20} x ^ {2} + y ^ {2} $. It looks like a bowl-shaped stretched out in the x-axis direction. The result of doing SGD with this function is as follows. スクリーンショット 2020-03-16 21.38.18.png

Since the slope in the vertical direction is steep, you can see that it is moving toward the origin (the point where the loss function is the smallest) while being jagged vertically. It's a pretty inefficient move. There are three ways to improve this.

Momentum Momentum is the momentum that appears in physics. First, let's look at the weight update formula.

 \boldsymbol{v} \leftarrow \alpha\boldsymbol{v} - \eta\frac{\partial{L}}{\partial{\boldsymbol{W}}}\\
\boldsymbol{W} \leftarrow \boldsymbol{W} +\boldsymbol{v}

If you just look at the formula, it may not come to you, so let's first look at the result applied to the previous function.

スクリーンショット 2020-03-16 21.38.29.png

It's a smooth movement. Somehow it draws a trajectory like a ball rolling in a bowl, which is natural. Why is it moving smoothly? I wrote a simple conceptual diagram below.

スクリーンショット 2020-03-18 21.19.32.png Consider one example. Suppose that ① represents the first learning and ② represents the second learning. Suppose that the result of first calculating $ \ frac {\ partial {L}} {\ partial {\ boldsymbol {W}}} $ is an upward vector as shown in ①. Suppose that you move in that direction by multiplying the size of $ \ eta $. Now consider updating the weight of ②. As a result of calculating the derivative of L again, let's assume that the direction of the downhill slope has changed to the right instead of the top. In the case of SGD, just like ①, this time go to the right with ②. As a result, it is moving to the right at a right angle. This is the cause of the rattling of the SGD, and the unnaturalness is in orbit. Let's check the Momentum formula here. There is a $ \ alpha \ {v} $ section in the update formula. This corresponds to $ \ frac {\ partial {L}} {\ partial {\ boldsymbol {W}}} $ at the time of ①. Let's say $ \ alpha = 1 $ for simplicity. Then, you can see that the $ \ alpha \ boldsymbol {v}-\ eta \ frac {\ partial {L}} {\ partial {\ boldsymbol {W}}} $ part is the sum of the vectors ① and ②. I will. Then, the direction of ② will go to the upper right at an angle of 45 degrees instead of the right. In this way, ** By reflecting the slope at the time of the previous weight update in the next weight update, the movement becomes smooth. ** ** α is a parameter of how much the inclination at the time of the last update is affected. If α is 0, the slope at the time of the previous weight update is not reflected at all, so it is the same as SGD in the right direction.

Next, let's consider why this method is called Momentum. In physics (Newtonian mechanics), momentum is the product of mass (m) and velocity vector ($ \ boldsymbol {v} ). The following relational expression, which is the relation between momentum and impulse ( F \ Delta {t} $), holds.

 m\boldsymbol{v_{before}}+F\Delta{t} = m\boldsymbol{v_{after}}

Let's think about this as well

スクリーンショット 2020-03-18 21.42.43.png

It shows how the motion of an object that is moving upward with momentum $ m \ boldsymbol {v_ {before}} $ changes when it receives an impulse to the right. If the magnitude of the impulse is the same as the initial momentum, the object will be flipped to the upper right at an angle of 45 degrees, and if the magnitude of the impulse is larger than that, the object will be flipped to the right. You can see that this is very similar to the conceptual diagram above. The trajectory drawn by Momentum was described earlier as ** like a ball rolling in a bowl **, but since it has a formula similar to the physical laws of actual mechanics, such a trajectory You can see that it becomes. The image is ** You can't suddenly turn at a right angle even if a force is suddenly applied to the right where you are moving up! Does it feel like you applied ** to the weight update?

AdaGrad Momentum's explanation has become very long, but I would like to briefly explain the remaining methods. When explaining SGD, I likened the weight update to golf. SGD updates with the same learning rate for each learning. In other words, in golf, you are hitting the ball with the same strength every time. However, in the case of golf, you hit the first shot hard to extend the flight distance, and when the hole approaches, you hit it weaker and fine-tune it, right? The same can be applied to neural networks. In other words, the weight is updated greatly at the time of the first learning, and the size to be updated is gradually reduced. This technique is called AdaGrad. Let's look at the following formula.

 \boldsymbol{h} \leftarrow \boldsymbol{h} + \left(\frac{\partial{L}}{\partial{\boldsymbol{W}}}\right)^{2}\\
\boldsymbol{W} \leftarrow \boldsymbol{W} -\frac{1}{\sqrt{\boldsymbol{h}}}\eta\frac{\partial{L}}{\partial{\boldsymbol{W}}}

In the formula below, you divide by $ \ sqrt {\ boldsymbol {h}} $. This means that each time you update the weight, the size of the update will decrease. In the above equation, $ \ left (\ frac {\ partial {L}} {\ partial {\ boldsymbol {W}}} \ right) ^ {2} $ means the square of all the elements of the matrix. .. (Because it is not an inner product, it does not become a scalar) In other words, the steeper the slope, the larger the weight is updated, and the smaller it is. It makes a lot of sense. The following is the actual application of this to the previous function.

スクリーンショット 2020-03-16 21.38.41.png

In this example, it looks like it's working very well.

Adam Finally, I would like to introduce Adam. Adam is simply a good thing about Momentum and Ada Grad. I won't go into detail here, including formulas. As usual, when applied to a function, the figure below is obtained.

スクリーンショット 2020-03-16 21.38.50.png This also feels good compared to SGD and has reached the minimum value.

Summary

--Introduced (quite roughly) the four optimization methods used in neural networks. ――The question is, "** Which one should I use after all? **", but there is no one-size-fits-all method, so it seems necessary to use it flexibly. ――SGD is still popular, but Adam seems to be popular these days.

Remarks

--The code used to plot the figure this time is published in the following git

Recommended Posts

I tried to summarize four neural network optimization methods
I tried to summarize SparseMatrix
I tried to classify music major / minor on Neural Network
I tried to summarize the umask command
Python3 standard input I tried to summarize
I tried to summarize the graphical modeling.
I tried to summarize Ansible modules-Linux edition
I tried how to improve the accuracy of my own Neural Network
LeetCode I tried to summarize the simple ones
I tried to debug.
I tried to paste
I tried Bayesian optimization!
I tried to summarize how to use matplotlib of python
I tried to summarize how to use pandas in python
I tried to step through Bayesian optimization. (With examples)
I tried to summarize the string operations of Python
I tried to learn PredNet
I tried to implement PCANet
Introduction to Nonlinear Optimization (I)
I tried to reintroduce Linux
I tried to introduce Pylint
I tried to touch jupyter
I tried to implement StarGAN (1)
[First COTOHA API] I tried to summarize the old story
I tried to simulate ad optimization using the bandit algorithm.
I tried to summarize the code often used in Pandas
I tried to solve a combination optimization problem with Qiskit
I tried to summarize the commands often used in business
[Machine learning] I tried to summarize the theory of Adaboost
I tried to summarize SQLAlchemy briefly (There is also TIPS)
I tried to summarize how to use the EPEL repository again
I tried to predict the genre of music from the song title on the Recurrent Neural Network
I tried to implement Deep VQE
[Linux] I tried to summarize the command of resource confirmation system
I tried to create Quip API
I tried to touch Python (installation)
I made my own 3-layer forward propagation neural network and tried to understand the calculation deeply.
[Slack api + Python] I tried to summarize the methods such as status confirmation and message sending
[Updated as appropriate] I tried to organize the basic visualization methods
I tried to implement adversarial validation
I tried Watson Speech to Text
I implemented a two-layer neural network
I tried to summarize what was output with Qiita with Word cloud
I tried to touch Tesla's API
I tried to summarize the commands used by beginner engineers today
I tried to implement hierarchical clustering
I tried to summarize everyone's remarks on slack with wordcloud (Python)
I tried to organize about MCMC.
I tried to solve the shift scheduling problem by various methods
I tried to summarize the frequently used implementation method of pytest-mock
I tried to implement Realness GAN
I tried to move the ball
[Sentence classification] I tried various pooling methods of Convolutional Neural Networks
I tried to estimate the interval.
I tried to summarize the methods that are often used when implementing basic algo in Quantx Factory
I tried to understand the learning function in the neural network carefully without using the machine learning library (second half).
[Python] I tried to summarize the set type (set) in an easy-to-understand manner.
I tried to summarize until I quit the bank and became an engineer
I tried to summarize the general flow up to service creation by self-education.
I tried to summarize various sentences using the automatic summarization API "summpy"
I tried to summarize the logical way of thinking about object orientation.