Introducing the optimizer used in machine learning.
Introduction Let's start with the example of linear regression, which you all love. Most of the time, you casually put data into a machine learning package and you're done. You may not even notice that the optimizer is writhing in the package. But they are there.
Now, let's say the objective variable is $ y $, the explanatory variable is $ X $, and the regression coefficient is $ w $, albeit a little abruptly. In linear regression, we want to approximate the objective variable with a linear combination of explanatory variables. That is,
y \sim Xw \tag{1.1}
We aim to determine the coefficient $ w $ so that is as "good" as possible. Here, $ y $ and $ w $ are treated as vertical vectors, and $ X $ is treated as a matrix. If the number of data samples is $ N $ and the number of explanatory variables is $ M $, then $ y $ is the $ N $ dimension, $ w $ is the $ M $ dimension, and $ X $ is the $ N \ times M $ dimension. is. Please note that this article does not include constant terms for the sake of simplicity. You need to set the loss function to determine $ w $ "good". Square error
L(w) := (y-Xw)^2 \tag{1.2}
Is typical, so I will use it. To find the value of $ w $ that minimizes the loss function $ L $, it is OK to differentiate and find the point where "= 0", so let's differentiate.
\frac{\partial L }{\partial w} = -2(X^Ty-X^TXw)=0 \tag{1.3}
Than,
w=\big(X^TX\big)^{-1}X^Ty \tag{1.4}
It can be obtained. $ X ^ T $ is the transposed matrix of $ X $. This solves everything. After that, it is a big mistake if you think that it is OK to mechanically throw $ X $ and $ y $ into (1.4). In the first place, (1.4) doesn't make sense if the inverse matrix of $ X ^ TX $ doesn't exist. Let's consider a concrete example. Try setting as follows.
X=\begin{pmatrix}
1 & 1 & 1 \\
1 & 2 & 3
\end{pmatrix},\;
y=\begin{pmatrix}
0\\1
\end{pmatrix},\;
w=\begin{pmatrix}
a\\b\\c
\end{pmatrix} \tag{1.5}
In this case, if you try to calculate $ X ^ TX $ concretely
X^TX=
\begin{pmatrix}
1 & 1 \\
1 & 2 \\
1 & 3 \\
\end{pmatrix}
\begin{pmatrix}
1 & 1 & 1 \\
1 & 2 & 3
\end{pmatrix}
=
\begin{pmatrix}
2 & 3 & 4 \\
3 & 5 & 7 \\
4 & 7 & 10
\end{pmatrix} \tag{1.6}
It has become. If you also ask for the determinant
\text{det}\big(X^TX\big) = 0 \tag{1.7}
It turns out that there is no inverse matrix. Then, does the solution of (1.3) exist? That's not true. On the contrary, there are innumerable. Under the setting of (1.5), (1.1) is a system of equations.
a+b+c=0, \;\; a+2b+3c=1 \tag{1.8}
Is equivalent to However, since there are two equations for three variables, the solution cannot be uniquely determined. In fact, if you solve $ a $ and $ b $ for $ c $, a set of solutions with $ c $ as a parameter.
a=c-1, \;\; b=1-2c \tag{1.9}
It can be obtained.
Now, do you have any doubts here? What will be returned if you throw it into the machine learning package with the setting of (1.5)? let's do it. Use python.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
sns.set_context('poster')
Create the data corresponding to (1.5).
X = np.array([[1, 1, 1],[1, 2, 3]])
y = np.array([[0],[1]])
I'll throw it into scikit-learn.
from sklearn.linear_model import LinearRegression
np.random.seed(0)
model = LinearRegression(fit_intercept=False)
model.fit(X,y)
Let's display the regression coefficient after learning.
model.coef_
The results are $ a = -0.5 $, $ b = 0 $, $ c = 0.5 $. It certainly meets (1.8), so it is certainly one of (1.9). However, it is certain that it was not calculated based on (1.4) because the inverse matrix of $ X ^ TX $ does not exist. So what happened?
Gradient Decent Let's talk a little more general, not just linear regression. Now suppose you want to know the minimum (or maximum) value of a function $ f (x) $. The argument $ x $ is a vector. Fix one point $ x_0 $ and consider a small fluctuation $ x_0 $ → $ x_0 + \ Delta x $ around $ x_0 $. If you expand it, leaving up to the first-order term of $ \ Delta x $
f(x_0+\Delta x)\simeq f(x_0)+\Delta x \cdot \frac{\partial f}{\partial x} (x_0) \tag{1.10}
Can be written. "$ \ Cdot $" represents the dot product of vectors. By the way, when the direction of $ \ Delta x $ is changed in various ways, if the amount of change of $ f (x) $ goes in the direction of the smallest (or larger), the minimum value (or maximum value) is searched well. I think I can do it. For that purpose, the inner product should be minimized (or maximized).
\Delta x \propto \mp \frac{\partial f}{\partial x} (x_0) \tag{1.11}
It should be. In other words, if you want to find the minimum value, after deciding the initial position appropriately, set the time-step subscript to $ t $ and set it.
x_{t+1} = x_{t} -\eta \frac{\partial f}{\partial x}(x_t) \tag{1.12}
You just have to repeat the search sequentially like this. $ \ eta $ is a constant called learning rate that you manually set a "small" positive value. This is a technique called Gradient Decent, which is the most basic optimizer. In the case of linear regression (1.1), if you write down the update formula of Gradient Decent
\begin{align}
w_{t+1} &= w_t - \eta \frac{\partial L}{\partial w}(w_t) \\
&= w_t - \eta \big[-2(X^Ty-X^TXw_t) \big] \tag{1.13}
\end{align}
And the inverse matrix of $ X ^ TX $ is not needed. Also, the existence of innumerable solutions in (1.9) is considered to correspond to the initial value of the search.
Let's implement linear regression based on (1.13).
class LR_GradientDecent(object):
def __init__(self, lr=0.01, n_iter=100, seed=0):
self.lr = lr # learning rate
self.n_iter = n_iter # number of iterations
self.seed = seed # seed for initializing coefficients
def fit(self, X, y):
np.random.seed(self.seed)
self.w = np.random.uniform(low=-1.0, high=1.0, size=(X.shape[1], 1)) # initializing coefficients
self.loss_arr = np.array([])
for _ in range(self.n_iter):
self.loss_arr = np.append(self.loss_arr, (y-X.dot(self.w)).T.dot(y-X.dot(self.w)).flatten(), axis=0)
dw = -2*(X.T.dot(y)-X.T.dot(X).dot(self.w)) # derivatives of the loss function
self.w -= self.lr * dw # up-date coefficients
def pred(self, X):
return X.dot(self.w)
I will try using it.
model = LR_GradientDecent(n_iter=1000)
model.fit(X,y)
Let's plot the change in the value of the loss function during training.
plt.plot(model.loss_arr)
plt.xlabel('Number of Iterations')
plt.ylabel('Error')
Let's also look at the regression coefficient.
model.w.flatten()
(1.9) is satisfied, depending on how the error tolerance is taken.
Next, let's experiment by shaking the initial values.
coefs = []
err = []
for seed in range(200):
model = LR_GradientDecent(n_iter=1000, seed=seed)
model.fit(X, y)
coefs.append(model.w.flatten())
err.append(model.loss_arr[-1])
df_sol = pd.DataFrame(coefs)
df_sol.columns = ['a', 'b', 'c']
df_sol['error'] = err
df_sol
It can be seen that the convergence value of the regression coefficient is completely different depending on how the initial value is taken. This is the so-called initial value dependency. Next, plot the obtained regression coefficient $ (a, b, c) $ on a 3D map.
fig = plt.figure()
ax = Axes3D(fig)
# solutions found by gradient decent
ax.plot(df_sol['a'], df_sol['b'], df_sol['c'], 'o', ms=4, mew=0.5)
# exact solutions
c = np.linspace(0, 1, 100)
a = np.array([i - 1 for i in c])
b = np.array([1-2*i for i in c])
ax.plot(a, b, c, color='yellow', ms=4, mew=0.5, alpha=0.4)
#ax.set_xlim(-1,0)
#ax.set_ylim(-1,1)
#ax.set_zlim(0,1)
#ax.set_xlabel('a')
#ax.set_ylabel('b')
#ax.set_zlabel('c')
ax.view_init(elev=30, azim=45)
plt.show()
In the figure above, the blue circle shows the regression coefficient obtained by swinging the initial value, and the light yellow line shows the exact solution (1.9). By shaking the initial value, you can see how the blue circles are scattered on the yellow line.
One last thing. In fact, scikit-learn's LinearRegression does not use the Gradient Decent update (1.13). I'm not familiar with the internals of scikit-learn, but when I looked at the source code, I found that an algorithm specializing in solving linear equations called LSQR was used. This algorithm also searches for the minimum value by repeating the iterative search, but the regression coefficient obtained by touching the seed did not change, probably because the initial value was given. Thinking this way, when I put the data into the linear regression package and the results are returned, I feel like I've done a job, but I'm wondering if I'm getting a really meaningful answer. In this article, I would like to introduce the successors of Gradient Decent from a slightly more general perspective than a model-specific algorithm.
Stochastic Gradient Decent (SGD) Well, in the previous chapter we introduced Gradient Decent, which was batch processing. In other words, all the data is processed together for each 1-step. In the previous example, there was only two data samples, so there was no problem, but as the amount of data increases, batch processing becomes difficult. In some cases, such as online data streams, it may not be possible to use all the data together. Therefore, we need a method to process Gradient Decent sequentially, and these are called Stochastic Gradient Decent (SGD). Depending on the size of the lump to be processed, it may be processed for each 1-sample, or it may be processed in a certain unit (mini-batch). In general, the loss function $ L $ is in the form of the sum of each data sample.
L=\sum_{n=1}^NL_n \tag{2.1}
Therefore, when processing 1-sample by sample, the algorithm of (1.12) is applied sequentially to each $ L_n $. In the case of mini-batch processing, the update algorithm is used for the sum $ \ sum_ {n \ in \ text {mini-batch}} L_n $ of the samples included in the mini-batch. By the way, SGD also has some benefits that Gradient Decent does not have. In Gradient Decent, there is a high possibility that you will get stuck in the local solution and you will not be able to escape, but in SGD, you will be more likely to get out of the local solution by exchanging data and shaking the search path. However, it seems that mini-batch processing is preferred because the processing for each 1-sample will shake too much and the convergence of the calculation will deteriorate.
The following content is a general story that does not depend on whether to process all data at once, 1-sample by sample, or mini-batch, so I decided to write the loss function as just $ L $. I will.
SGD is a very simple algorithm, but performance suffers as the loss function becomes more complex. Therefore, various algorithms have been proposed that are improvements to SGD. From here, I would like to introduce a typical method.
Momentum SGD For example, consider the following loss function. First, look at the figure on the left.
SGD | Momentum SGD |
---|---|
Suppose the black line represents the contour line of the loss function and the purple star is the minimum. According to the update formula (1.12) of Gradient Decent, the zigzag movement starts because the inclination is pushed in the direction of the minimum, and it is difficult to reach the target star mark. However, if you look closely, it seems that if you take the "average" of the part that goes back and forth, the zigzag movement will be offset and you will get forward propulsion (blue arrow in the right figure). This is the idea of Momentum SGD. When written in a formula,
\begin{align}
& m_t = \gamma m_{t-1} + \eta \frac{\partial L}{\partial w}(w_t) \tag{3.1}\\
& w_{t+1} = w_t - m_t \tag{3.2}
\end{align}
It will be. Set the initial value of $ m_t $ to $ m_0 = 0 $. $ M $ in (3.1) acts as an offset for the zigzag movement in a term called Momentum. If you put $ g_t ^ {\ eta}: = \ eta \ frac {\ partial L} {\ partial w} (w_t) $, (3.1) is
m_t=g_t^{\eta} + \gamma g_{t-1}^{\eta} + \gamma^2 g_{t-2}^{\eta} + \gamma^3 g_{t-3}^{\eta} + \cdots \tag{3.3}
Since it can be rewritten as, we can see that the past garadient information is accumulated at the attenuation factor $ \ gamma $. Replacing $ m_t $ with $ g_t ^ \ eta $ in (3.2) will revert to SGD. It seems that $ \ gamma $ is often set to around $ 0.9 $.
Nesterov accelerated gradient(NAG) Nesterov accelerated gradient (NAG) is a slightly modified version of Momentum SGD. In the previous figure, the purple star was the destination. Even if I suppress the zigzag movement, I am worried that it will stop properly at the destination. Just before you arrive, it would be nice if you could predict your destination and apply the brakes. So, the update formula
\begin{align}
& m_t = \gamma m_{t-1} + \eta \frac{\partial L}{\partial w}(w_t - m_{t-1}) \tag{3.4}\\
& w_{t+1} = w_t - m_t \tag{3.5}
\end{align}
I will try. (3.4) Note that the evaluation of the gradient on the right side is not the current point $ w_t $ but the predicted movement point $ w_t --m_ {t-1} $. The time-step of $ m $ is $ t-1 $ because $ m_t $ itself cannot be used to construct $ m_t $, it is just a "predicted" move point.
AdaGrad The basis of AdaGrad is SGD, but we devise a different learning rate for each component of the learning parameter $ w $ and a high learning rate for the less updated components. Below, for the sake of simplification of symbols
g^t_i:=\frac{\partial L}{\partial w_i}(w^t) \tag{3.6}
By the way, I will write the time-step subscript $ t $ in the upper right corner. $ i $ is an index that specifies the components of the vector $ w $. AdaGrad's update algorithm is given below
\begin{align}
& v^t_i = v^{t-1}_i + (g^t_i)^2 \tag{3.7} \\
& w^{t+1}_i = w^t_i - \frac{\eta}{\sqrt{v^t_i} + \epsilon} g^t_i \tag{3.8}
\end{align}
Set the initial value of $ v $ to $ v_i ^ 0 = 0 $. $ \ eta $ is a manually given constant, also called the learning rate. $ \ Epsilon $ sets a small value to prevent division by $ 0 $. Also, $ v ^ t_i $ is from (3.7)
v^t_i = (g^t_i)^2 + (g^{t-2}_i)^2 + (g^{t-3}_i)^2 + \cdots \tag{3.9}
Therefore, the value of $ v ^ t_i $ is small for learning parameters that are not updated very often, and the actual learning rate $ \ eta / (\ sqrt {v ^ t_i} + \ epsilon) $ is large. Become. If you don't add the square root "$ \ sqrt {\ cdot} $" attached to $ v ^ t_i $, the performance will deteriorate.
RMSprop In AdaGrad, as shown in (3.9), the actual learning rate decreases as the update is repeated, and the value update stops. The purpose of RMSprop is to prevent this. Instead of accumulating all past information, we will accumulate past accumulation and latest information at a ratio of $ \ gamma: 1- \ gamma $. That is, the update algorithm looks like this:
\begin{align}
& v^t_i = \gamma v^{t-1}_i + (1-\gamma)(g^t_i)^2 \tag{3.10} \\
& w^{t+1}_i = w^t_i - \frac{\eta}{\sqrt{v^t_i}+\epsilon} g^t_i \tag{3.11}
\end{align}
Set the initial value of $ v $ to $ v_i ^ 0 = 0 $. If you rewrite (3.10)
v^t_i=\gamma^t v^0_i + (1-\gamma)\sum_{l=1}^t \gamma^{t-l}(g^l_i)^2 \tag{3.12}
Therefore, it can be confirmed that past information is accumulated while decaying exponentially, which is a kind of averaging method called moving index averaging.
AdaDelta AdaDelta seems to have been proposed independently of RMSprop, but it is functionally a further modification of RMSprop. The point is to increase the search distance when the learning parameters enter the unsearched area. The update algorithm looks like this:
\begin{align}
& v^t_i = \gamma v^{t-1}_i + (1-\gamma)(g^t_i)^2 \tag{3.13} \\
& s^t_i = \gamma s^{t-1}_i + (1-\gamma)(\Delta w^{t-1}_i)^2 \tag{3.14} \\
& \Delta w^t_i = -\frac{\sqrt{s^t_i + \epsilon}}{\sqrt{v^t_i + \epsilon}} g^t_i \tag{3.15} \\
& w^{t+1}_i = w^t_i + \Delta w^t_i \tag{3.16}
\end{align}
Set the initial values as $ v_i ^ 0 = 0 $ and $ s_i ^ 0 = 0 $. Looking at (3.15), $ s ^ t $ is included instead of $ \ eta $. If $ w ^ t_i $ is updated significantly, it will enter an unsearched area, but (3.14) also updates $ s ^ t_i $ significantly, so it is devised to increase the search distance. Occasionally I see an expression with a learning rate parameter.
Adam Adam looks like a combination of RMSprop and the Momentum method. The update algorithm is given as follows,
\begin{align}
& m^t_i = \beta_1 m^{t-1}_i + (1-\beta_1)g^t_i \tag{3.17} \\
& v^t_i = \beta_2 v^{t-1}_i + (1-\beta_2)(g^t_i)^2 \tag{3.18} \\
& \hat{m}^t_i = \frac{m^t_i}{1-\beta_1^t} \tag{3.19} \\
& \hat{v}^t_i = \frac{v^t_i}{1-\beta_2^t} \tag{3.20} \\
& w^{t+1}_i = w^t_i - \frac{\eta}{\sqrt{\hat{v}^t_i}+\epsilon} \hat{m}^t_i \tag{3.21}
\end{align}
Set the initial values as $ v_i ^ 0 = 0 $ and $ m_i ^ 0 = 0 $. (3.17)
m^t_i=\beta_1^tm_0 + (1-\beta_1)\sum_{k=1}^t \beta_1^{t-k}g^k_i \tag{3.22}
After rewriting, substitute the initial value $ m_0 = 0 $, and compare the orders on both sides with the order of $ g ^ k_i $ as $ g ^ k_i \ sim O [g] $.
O[m] \sim O[g](1-\beta_1) \sum_{k=1}^t \beta_1^{t-k} = O[g](1-\beta_1^t) \tag{3.23}
It will be. In the first place, $ m ^ t $ was introduced with the expectation that it is the "average" of $ g ^ t $, so it seems unpleasant that a deviation of $ 1- \ beta_1 ^ t $ occurs. is. Especially, when it is $ \ beta_1 ^ t \ sim 0 $, it becomes $ O [m] \ sim 0 $. (3.19) is the correction for that. The correction in (3.20) is for the same reason.
Eve Eve is a method proposed as upward compatible with Adam around November 2016. It has been pointed out that as the loss function becomes more complex, especially in deep learning, it is more likely that flat areas called saddle points and plateaus are hindering learning rather than getting stuck in local minimums. I will. However, since SGD uses only the information on the first derivative of the loss function, there are some aspects that cannot be dealt with. Therefore, Eve directly uses the fluctuation of the loss function that accompanies the update of the learning parameters. When the amount of fluctuation is small, it is highly likely that the saddle point / plateau is trapped, and the search distance for the learning parameters is extended. The other steps are the same as Adam, and the update algorithm is as follows.
\begin{align}
& m^t_i = \beta_1 m^{t-1}_i + (1-\beta_1)g^t_i \tag{3.24} \\
& v^t_i = \beta_2 v^{t-1}_i + (1-\beta_2)(g^t_i)^2 \tag{3.25} \\
& \hat{m}^t_i = \frac{m^t_i}{1-\beta_1^t} \tag{3.26} \\
& \hat{v}^t_i = \frac{v^t_i}{1-\beta_2^t} \tag{3.27} \\
& r^t = (\text{feed-back from the loss function}) \tag{3.28} \\
& d^t = \beta_3 d_{t-1} + (1-\beta_3) r^t \tag{3.29} \\
& w^{t+1}_i = w^t_i - \frac{\eta}{d^t \sqrt{\hat{v}^t_i}+\epsilon} \hat{m}^t_i \tag{3.30}
\end{align}
The point is how to configure (3.28). (3.29) takes the moving index average of the past feed-back and the latest feed-back and is added in the denominator of (3.30). Set the initial value of $ d $ to $ d ^ 0 = 1 $. Now, let's move on to the explanation of (3.28). First, define $ L ^ t: = L (w ^ t) $ for the loss function $ L (w) $. Basically, the non-negative change of $ L $ $ \ frac {L ^ {t} -L ^ {t-1}} {L ^ {t-1}} $ ($ L ^ t> L ^ {t -1} $), $ \ frac {L ^ {t-1}-L ^ {t}} {L ^ {t}} $ ($ L ^ {t-1}> L ^ t $ ) Is used as $ r ^ t $, but the calculation may not be stable due to jumping values, so set an appropriate set of constants $ 0 <k <K $ and $ k <r ^ It will be rounded so that t <K $. However, according to the paper, this simple method doesn't seem to work, and Eve takes the following steps: First, for $ L ^ 0 = L (w ^ 0) $, $ [\ frac {L ^ 0} {K + 1}, \ frac {L ^ 0} {k + 1}] $ (in the figure) ②), $ [(k + 1) L ^ 0, (K + 1) L ^ 0] $ (⑤ in the figure). In addition, cut the section with $ L ^ 0 $ as the boundary as shown in the figure.
Next, suppose you get $ L ^ 1 = L (w ^ 1) $. $ \ Hat {L} ^ 1 $ clipped around $ L ^ 0 $ for $ L ^ 1 $, $ L ^ 1 \ in $ ② or ⑤ ⇒ $ \ hat {L ^ 1} = L ^ 1 $, $ L ^ 1 \ in $ ① ⇒ $ \ hat {L ^ 1} = \ frac {L ^ 0} {K + 1} $, $ L ^ 1 \ in $ ③ ⇒ $ \ hat { L ^ 1} = \ frac {L ^ 0} {k + 1} $, $ L ^ 1 \ in $ ④ ⇒ $ \ hat {L ^ 1} = (k + 1) L ^ 0 $, $ L ^ 1 \ in $ ⑥ ⇒ $ \ hat {L ^ 1} = (K + 1) L ^ 0 $. Using this $ \ hat {L} ^ 1 $, $ r ^ 1 $ $ = $ $ \ frac {\ hat {L} ^ {1} -L ^ {0}} {L ^ {0}} $ ( $ \ Hat {L} ^ 1> L ^ 0 $), $ \ frac {L ^ {0}-\ hat {L} ^ {1}} {\ hat {L ^ {1}}} $ ( $ L ^ {0}> \ hat {L} ^ 1 $). Then $ kGradient Decent used to use up to the first derivative, but of course there is also a method that uses the second derivative. A typical one is Newton's method. In (1.10), if you replace $ f $ → $ \ partial f / \ partial x $
\frac{\partial f}{\partial x}(x_0+\Delta x)\simeq \frac{\partial f}{\partial x}(x_0)+ \frac{\partial f}{\partial x \partial x}(x_0) \Delta x \tag{4.1}
It will be. Here, $ \ partial f / \ partial x \ partial x $ is a matrix (Hessian) created from the second derivative. What I want to know now is that $ \ partial f / \ partial x = 0 $, so put "= 0" on the right side of (4.1).
\Delta x = - \Big[\frac{\partial f}{\partial x \partial x}(x_0)\Big]^{-1}\frac{\partial f}{\partial x}(x_0) \tag{4.2}
To get Update algorithm based on this formula
x_{t+1} = x_t -\Big[\frac{\partial f}{\partial x \partial x}(x_t)\Big]^{-1}\frac{\partial f}{\partial x}(x_t) \tag{4.3}
It is determined by. This is called the Newton method, and when compared with Gradient Decent's equation (1.12), we can see that the learning rate $ \ eta $ is replaced by the Hessian inverse matrix. By the way, the geometric meaning of Hessian is "curvature". That is, it has "= 0" where the flat area spreads, and has a large value where the contour lines are crowded. In (4.3), since the Hessian inverse matrix is taken, it can be seen that it has the effect of increasing the search distance in a flat region. However, keep in mind that you will be trapped at the saddle point. At the saddle point, there are positive and negative directions for the Hessian eigenvalues, and when moving along the former, the saddle point appears to be the minimum value and along the latter the maximum value. In the positive direction of the eigenvalue, it is pulled to the saddle point as in the case of Gradient Decent. On the other hand, in the negative direction of the eigenvalue, the coefficient of the differential term in Eq. (1.12) is inverted, so the algorithm results in finding the maximum value instead of the minimum value, and in this case as well, it is drawn into the saddle point. It will be. Some methods have been proposed to improve this point, but I think that it is not used much in machine learning because it is difficult to find Hessian in a large-scale model. Various techniques have also been developed on how to calculate / approximate the Hessian.
・ Gradient descent method Overview of Gradient Descent Optimization Algorithm Theano implementation example: Try various optimization algorithms with neural network ・ LSQR Original paper Conjugate Gradient Method ・ Eve Paper: Improving Stochastic Gradient Descent with Feedback
Recommended Posts