I tried to summarize four neural network optimization methods

What kind of article is it?

――I would like to summarize the four optimization methods of neural networks that I recently learned by reading the book "Neural networks made from scratch". ――Basically, it is based on a book, but ** How can you imagine it? This is an article that summarizes my personal interpretation with an emphasis on **. Therefore, it does not include exact mathematical formulas or applications. ――If you search on the net, more detailed explanations than this article will come out, but on the other hand, there are some articles that summarize almost the same explanation as the book in Qiita, so it is meaningful to summarize your own interpretation I thought there was a little bit, so I came to write an article ――I'm quite biased towards the explanation of the method called Momentum, but please forgive me. .. ――I'm a beginner and beginner about neural networks, so I'd be happy if you could comment on any mistakes or advice.

Neural network made from scratch

"Neural networks" and "deep learning" that even ordinary people who are not specialized in machine learning now know only their names. It's one of machine learning, and it's often used for image recognition (like Google image search). Probably the most read book in Japan when studying this neural network is "** Deep Learning from scratch **". As it is read well, I think it is a very good introductory book. (What is it?)

"Learning" in machine learning

Not limited to neural networks, the purpose of machine learning is to learn based on training data (teacher data) and predict unknown data. For example, in terms of image recognition, a large amount of cat images are learned, and when a new image is shown, the answer is "whether that image or a cat?" In normal machine learning, there is a restriction that "features must be specified for learning", but in deep learning, there is an advantage that features can be found from images without permission during learning.

What is optimization?

What exactly is learning? In other words, it is the task of finding the optimal weight parameter. Once the optimum parameters are found, the correct answer can be predicted from unknown data based on them. This is called ** optimize **. For optimization, we define a function called a loss function (cost function) and look for the parameter that minimizes it. Least squares error and LogLoss (cross entropy error) are used as the loss function, and the smaller the value, the better the classification (or regression). So how do you find the smallest location of the loss function? Think about. This book introduces the following four optimization methods.

--SGD (gradient descent method)

Momentum
AdaGrad
Adam

SGD Abbreviation for Gradient Stochastic Descent. It is the most standard optimization method and updates the weight with the following formula.

\boldsymbol{W} \leftarrow \boldsymbol{W} - \eta\frac{\partial{L}}{\partial{\boldsymbol{W}}}

Calculate the slope of the loss function and update it so that the loss function becomes smaller. The idea is, "If you go down the slope, you will eventually reach the lowest point." $ \ Eta $ is called the learning rate, and if it is too small, it will be difficult to reach it, and if it is too large, it will move too much at once and fly in a ridiculous direction. So you need to set the appropriate value first. Personally, I think it's like golf, and if you hit it with only a little force, you will never reach the hole, but if you hit it too hard, you will pass through the hole and go in a ridiculous direction. I will end up. That's SGD, but it doesn't always work. For example, consider a function such as $ f (x, y) = \ frac {1} {20} x ^ {2} + y ^ {2} $. It looks like a bowl-shaped stretched out in the x-axis direction. The result of doing SGD with this function is as follows. スクリーンショット 2020-03-16 21.38.18.png

Since the slope in the vertical direction is steep, you can see that it is moving toward the origin (the point where the loss function is the smallest) while being jagged vertically. It's a pretty inefficient move. There are three ways to improve this.

Momentum Momentum is the momentum that appears in physics. First, let's look at the weight update formula.

 \boldsymbol{v} \leftarrow \alpha\boldsymbol{v} - \eta\frac{\partial{L}}{\partial{\boldsymbol{W}}}\\
\boldsymbol{W} \leftarrow \boldsymbol{W} +\boldsymbol{v}

If you just look at the formula, it may not come to you, so let's first look at the result applied to the previous function.

It's a smooth movement. Somehow it draws a trajectory like a ball rolling in a bowl, which is natural. Why is it moving smoothly? I wrote a simple conceptual diagram below.

Consider one example. Suppose that ① represents the first learning and ② represents the second learning. Suppose that the result of first calculating $ \ frac {\ partial {L}} {\ partial {\ boldsymbol {W}}} $ is an upward vector as shown in ①. Suppose that you move in that direction by multiplying the size of $ \ eta $. Now consider updating the weight of ②. As a result of calculating the derivative of L again, let's assume that the direction of the downhill slope has changed to the right instead of the top. In the case of SGD, just like ①, this time go to the right with ②. As a result, it is moving to the right at a right angle. This is the cause of the rattling of the SGD, and the unnaturalness is in orbit. Let's check the Momentum formula here. There is a $ \ alpha \ {v} $ section in the update formula. This corresponds to $ \ frac {\ partial {L}} {\ partial {\ boldsymbol {W}}} $ at the time of ①. Let's say $ \ alpha = 1 $ for simplicity. Then, you can see that the $ \ alpha \ boldsymbol {v}-\ eta \ frac {\ partial {L}} {\ partial {\ boldsymbol {W}}} $ part is the sum of the vectors ① and ②. I will. Then, the direction of ② will go to the upper right at an angle of 45 degrees instead of the right. In this way, ** By reflecting the slope at the time of the previous weight update in the next weight update, the movement becomes smooth. ** ** α is a parameter of how much the inclination at the time of the last update is affected. If α is 0, the slope at the time of the previous weight update is not reflected at all, so it is the same as SGD in the right direction.

Next, let's consider why this method is called Momentum. In physics (Newtonian mechanics), momentum is the product of mass (m) and velocity vector ($ \ boldsymbol {v} ). The following relational expression, which is the relation between momentum and impulse ( F \ Delta {t} $), holds.

 m\boldsymbol{v_{before}}+F\Delta{t} = m\boldsymbol{v_{after}}

Let's think about this as well

It shows how the motion of an object that is moving upward with momentum $ m \ boldsymbol {v_ {before}} $ changes when it receives an impulse to the right. If the magnitude of the impulse is the same as the initial momentum, the object will be flipped to the upper right at an angle of 45 degrees, and if the magnitude of the impulse is larger than that, the object will be flipped to the right. You can see that this is very similar to the conceptual diagram above. The trajectory drawn by Momentum was described earlier as ** like a ball rolling in a bowl **, but since it has a formula similar to the physical laws of actual mechanics, such a trajectory You can see that it becomes. The image is ** You can't suddenly turn at a right angle even if a force is suddenly applied to the right where you are moving up! Does it feel like you applied ** to the weight update?

AdaGrad Momentum's explanation has become very long, but I would like to briefly explain the remaining methods. When explaining SGD, I likened the weight update to golf. SGD updates with the same learning rate for each learning. In other words, in golf, you are hitting the ball with the same strength every time. However, in the case of golf, you hit the first shot hard to extend the flight distance, and when the hole approaches, you hit it weaker and fine-tune it, right? The same can be applied to neural networks. In other words, the weight is updated greatly at the time of the first learning, and the size to be updated is gradually reduced. This technique is called AdaGrad. Let's look at the following formula.

 \boldsymbol{h} \leftarrow \boldsymbol{h} + \left(\frac{\partial{L}}{\partial{\boldsymbol{W}}}\right)^{2}\\
\boldsymbol{W} \leftarrow \boldsymbol{W} -\frac{1}{\sqrt{\boldsymbol{h}}}\eta\frac{\partial{L}}{\partial{\boldsymbol{W}}}

In the formula below, you divide by $ \ sqrt {\ boldsymbol {h}} $. This means that each time you update the weight, the size of the update will decrease. In the above equation, $ \ left (\ frac {\ partial {L}} {\ partial {\ boldsymbol {W}}} \ right) ^ {2} $ means the square of all the elements of the matrix. .. (Because it is not an inner product, it does not become a scalar) In other words, the steeper the slope, the larger the weight is updated, and the smaller it is. It makes a lot of sense. The following is the actual application of this to the previous function.

In this example, it looks like it's working very well.

Adam Finally, I would like to introduce Adam. Adam is simply a good thing about Momentum and Ada Grad. I won't go into detail here, including formulas. As usual, when applied to a function, the figure below is obtained.

This also feels good compared to SGD and has reached the minimum value.

Summary

--Introduced (quite roughly) the four optimization methods used in neural networks. ――The question is, "** Which one should I use after all? **", but there is no one-size-fits-all method, so it seems necessary to use it flexibly. ――SGD is still popular, but Adam seems to be popular these days.

Remarks

--The code used to plot the figure this time is published in the following git

https://github.com/oreilly-japan/deep-learning-from-scratch