I summarized the stochastic gradient descent method, which is an optimization method for deep learning. However, based on the formulas and papers summarized in here, Pytorch decides the initial values and constants that were not included in the paper. , Keras, Chainer, etc. are just summarized. Formulas etc. are summarized in an implementation-like manner, so please take a look if you want to implement it yourself.
Please feel free to point out any mistakes. Also, if you have information on new optimization methods, please let us know about the optimization methods you are creating!
-Introduction -[What is the steepest descent method](# What is the steepest descent method) -[What is Stochastic Gradient Descent](# What is Stochastic Gradient Descent)
First of all, as prior knowledge, I will explain the difference between the steepest descent method and the stochastic gradient descent method. That's not a big deal.
The steepest descent method is a method of descent in the direction that reduces the loss function most (the direction in which the gradient is maximum) **. It's also called gradient descent. As a procedure,
The chain rule is briefly mentioned in here.
** A certain rule ** is the heart of the gradient descent method. Performance is determined by this part. Unfortunately, there is no best gradient descent method for all problems (at least for now). The parameter update formula is for $ t $ at a certain time.
w_{t+1} = w_t - \eta \nabla_{w_t} \mathcal{L}(w_t)
It can be expressed as When expressed for each parameter element
w_{t+1}^{(i)} = w_t^{(i)} - \eta \nabla_{w_{t}^{(i)}} \mathcal{L}(w_t)
And so on (of course the parameter dimensions are usually more multidimensional). As an explanation of the symbol
--Parameters at $ w_t \ cdots $ time $ t $ -$ \ eta \ cdots $ Learning rate -$ \ mathcal {L} (w_t) \ cdots $ Loss function
is. The parameter update according to this update formula is repeated until the loss function hardly changes. Learning is considered complete when there is little change. Note that this is not "until the value of the loss function is small enough" **. For example, it is possible that learning does not converge due to insufficient parameters, so ** "until the amount of change is sufficiently small" ** learning is performed.
Before discussing stochastic gradient descent, let's first introduce the drawbacks of steepest descent.
The drawback of the steepest descent method is that it ** updates the parameters using all the input data **. That's because this method ** you can't get out of it when you hit the minimum value **. Details are available at here, so I will omit it, but in short, if you learn using all the data at once, you will easily fall into such a situation. By the way, when the parameter dimension is 2 or more, a more troublesome point called ** saddle point ** appears. The saddle point is "a point that is maximal for one parameter but minimal for another", which is very annoying. This is because it is known that the saddle point increases exponentially as the parameter dimension becomes multidimensional. This is called a ** curse of dimensionality **, but since this saddle point still has a gradient of $ 0 $, learning does not proceed. Moreover, as the parameter dimension increases, it is much more likely to appear than the minimum and maximum points, so it is no different from the situation where there are "pitfalls" everywhere. Furthermore, the parameter dimension is basically the total number of weights $ w $ and bias $ b $, so it quickly becomes a ridiculous number. </ font>
Quiet talk ...
In any case, learning rules are a very delicate and very complicated problem. Let's get back to the topic. The drawback of the steepest descent method is that it uses all the input data, so it seems that it can be solved by doing this one by one. Therefore, the method of updating parameters using input data one by one is called ** stochastic gradient descent **. In addition, "learning method that sequentially updates parameters for each input data" in this way is called ** online learning **. Since the parameters are updated sequentially for each input data, it seems that a more accurate model can be created. By the way, what is probabilistic is that random numbers are used to extract data one by one from all input data.
Although it is such a stochastic gradient descent method, it has the disadvantage that it cannot be parallelized **. It's a waste to not be able to benefit from the progress of parallel processing such as GPUs in modern times. As a result of , a method called ** mini-batch learning ** was devised. The idea is simple: update the parameters using a certain number of input data at a time (likely $ 16 $ or $ 32 $). You can also parallelize this. </ font>
However, using mini-batch learning has the drawback of ** "vibrating in sharp valleys" **. It seems that the only way to overcome this is to leave it to the learning rules themselves.
SGD First is the most basic gradient descent method. SGD: Stochastic Gradient Descent is translated as stochastic gradient descent in Japanese. In other words, it's the learning rule I mentioned earlier. As a mathematical expression, it is the same as the steepest descent method.
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
\Delta w_t &= - \eta g_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
In the implementation, $ g_t $ is calculated by the chain rule as the flowing error (gradient) (TensorFlow should be done by automatic differentiation?). So you don't have to worry about $ g_t $. $ \ Delta w_t $ is the amount of updates in this time step. Simply multiply the gradient $ g_t $ by the learning rate $ \ eta $. The reason for making it negative like $ \ Delta w_t =-\ eta g_t $ is to add it in the last expression. There is no further meaning, so if you set $ w_ {t + 1} = w_t-\ Delta w_t $, you can use $ \ Delta w_t = \ eta g_t $. </ font>
The value of hyperparameter is
symbol | Notation in code | value |
---|---|---|
eta |
is.
optimizers.py
class SGD(Optimizer):
def __init__(self, eta=1e-2, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
def update(self, grad_w, grad_b, *args, **kwds):
dw = -self.eta*grad_w
db = -self.eta*grad_b
return dw, db
MomentumSGD A simple SGD has slow convergence and is prone to problems such as vibration and saddle points, so it needs to be improved. One way to do this is to consider ** Momentum **. Moment is, in short, inertia. Vibration is suppressed by using the information of the previous gradient.
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
\Delta w_t &= \mu \Delta w_{t-1} - (1 - \mu) \eta g_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
There are quite a few fluctuations in the notation depending on where $ \ eta $ is multiplied and whether $ 1- \ mu $ is multiplied by the second term on the right side of the second equation. Just a quick look
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
\Delta w_t &= \mu \Delta w_{t-1} - \eta g_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
Or
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
\Delta w_t &= \mu \Delta w_{t-1} - g_t \\
w_{t+1} &= w_t + \eta \Delta w_t
\end{align}
There was also. In this article, it is implemented by the first formula. The fluctuation of the notation around here can be made into the same expression by modifying the hyperparameters $ \ mu $ and $ \ eta $ appropriately, so don't worry about it. </ font>
Hyperparameters
symbol | Notation in code | value |
---|---|---|
eta | ||
mu | ||
dw, db |
It has become.
optimizers.py
class MSGD(Optimizer):
def __init__(self, eta=1e-2, mu=0.9, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
self.mu = mu
#Hold the value of the previous step
self.dw = 0
self.db = 0
def update(self, grad_w, grad_b, *args, **kwds):
dw = self.mu*self.dw - (1-self.mu)*self.eta*grad_w
db = self.mu*self.db - (1-self.mu)*self.eta*grad_b
#Assigning in the view instead of copying is because these values may be used
#This is because it will not be changed.
self.dw = dw
self.db = db
return dw, db
NAG NAG: Nesterov's Accelerated Gradient method is a modification of MSGD to accelerate acceleration to convergence. By performing the gradient calculation using the previous ** update amount **, it is possible to roughly estimate the destination by one time step.
** Maybe [this formula](https://qiita.com/ZoneTsuyoshi/items/8ef6fa1e154d176e25b8#%E3%83%8D%E3%82%B9%E3%83%86%E3%83%AD%E3 % 83% 95% E3% 81% AE% E5% 8A% A0% E9% 80% 9F% E6% B3% 95nag-1983) Wrong. ** **
Probably the following formula.
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t+\mu \Delta w_{t-1}) \\
\Delta w_t &= \mu \Delta w_{t-1} - (1-\mu)\eta g_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
Besides [here](https://qiita.com/omiita/items/1735c1d048fe5f611f80#54-%E3%83%A2%E3%83%BC%E3%83%A1%E3%83%B3%E3% 82% BF% E3% 83% A0% E3% 81% AE% E6% 94% B9% E8% 89% AF% E7% 89% 88-nag)
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t - \mu \Delta w_{t-1}) \\
\Delta w_t &= \mu \Delta w_{t-1} + (1-\mu)g_t \\
w_t &= w_{t-1} - \eta \Delta w_t
\end{align}
(The notation matches that of this article). We're doing the same thing, and we've actually implemented and tried both, but they worked the same, so it shouldn't be a problem.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
eta | ||
mu | ||
dw, db |
It is said.
optimizers.py
class NAG(Optimizer):
def __init__(self, eta=1e-2, mu=0.9, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
self.mu = mu
#Holds the value of the previous step
self.dw = 0
self.db = 0
def update(self, grad_w, grad_b, w=0, b=0, df=None, *args, **kwds):
grad_w = df(w + self.mu*self.dw)
grad_b = df(b + self.mu*self.db)
dw = self.mu*self.dw - (1-self.mu)*self.eta*grad_w
db = self.mu*self.db - (1-self.mu)*self.eta*grad_b
#Assigning in the view instead of copying is because these values may be used
#This is because it will not be changed.
self.dw = dw
self.db = db
return dw, db
It seems difficult to use in deep learning because it is necessary to re-flow the gradient that has flowed in the chain rule ... I wonder how to do it ...
AdaGrad Since MSGD did not use information on the direction of convergence, there is a problem that convergence is slow in the direction of a gentle gradient. AdaGrad is a pioneer in the method of adjusting the learning rate for each dimension to solve this problem.
\begin{align}
g_t &= \nabla_{w_t} \mathcal{L}(w_t) \\
\Delta w_t &= - \cfrac{\eta}{\displaystyle\sqrt{\sum_{\tau=1}^{t}{g_{\tau}^2}}} g_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
In terms of implementation, the sum of squares of the second equation
\begin{align}
gw_t = gw_{t-1} + g_t^2
\end{align}
It saves memory and computational complexity by holding it like this.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
eta | ||
gw, gb |
It has become.
optimizers.py
class AdaGrad(Optimizer):
def __init__(self, eta=1e-3, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
#Hold the value of the previous step
self.gw = 0
self.gb = 0
def update(self, grad_w, grad_b, *args, **kwds):
self.gw += grad_w*grad_w
self.gb += grad_b*grad_b
dw = -self.eta*grad_w/np.sqrt(self.gw)
db = -self.eta*grad_b/np.sqrt(self.gb)
return dw, db
RMSprop In AdaGrad, there is a steep slope and then a gentle slope, and even if there is another valley after that, it does not descend much into the valley. The reason is that the square of the gradient continues to accumulate and the learning rate continues to decrease. RMSprop has improved this. By "forgetting" the influence of the gradient with the coefficient $ \ rho $, the problem that the learning rate keeps decreasing is solved.
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
v_t &= \rho v_{t-1} + (1-\rho)g_t^2 \\
\Delta w_t &= - \cfrac{\eta}{\sqrt{v_t + \varepsilon}}g_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
When implementing, the second formula
\begin{align}
v_t &= \rho v_{t-1} + (1-\rho)g_t^2 \\
\Leftrightarrow
v_t &= (v_{t-1} - v_{t-1}) + \rho v_{t-1} + (1-\rho)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} - (1-\rho)v_{t-1} + (1-\rho)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} + (1-\rho)(g_t^2 - v_{t-1})
\end{align}
The formula is transformed as follows.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
eta | ||
rho | ||
eps | ||
vw, vb |
It has become.
optimizers.py
class RMSprop(Optimizer):
def __init__(self, eta=1e-2, rho=0.99, eps=1e-8, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
self.rho = rho
self.eps = eps
#Hold the value of the previous step
self.vw = 0
self.vb = 0
def update(self, grad_w, grad_b, *args, **kwds):
self.vw += (1-self.rho)*(grad_w**2 - self.vw)
self.vb += (1-self.rho)*(grad_b**2 - self.vb)
dw = -self.eta*grad_w/np.sqrt(self.vw+self.eps)
db = -self.eta*grad_b/np.sqrt(self.vb+self.eps)
return dw, db
AdaDelta RMSprop is actually physically wrong. Simply put, the dimensions of both sides do not match. I'd like to say that it doesn't really matter because I'm not solving a physics problem ... but there are a few problems with this. After all, we need to fit the hyperparameters $ \ eta $ for each problem. For example
\begin{align}
f(x) &= x^2 \\
g(x) &= 10x^2
\end{align}
Consider fitting each of these two functions. At this time, if the learning rules whose dimensions do not match are used, it becomes necessary to adjust the learning rate for each function. In other words, you need to adjust the hyperparameters. This is a tedious and difficult task, and I can't imagine how much time would be wasted if I did this for a large-scale study that would take several days to study once. So, the dimensional mismatch is quite a problem, and it is this AdaDelta that solves it.
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
v_t &= \rho v_{t-1} + (1-\rho) g_t^2 \\
\Delta w_t &= - \cfrac{\sqrt{u_{t-1} + \varepsilon}}{\sqrt{v_t + \varepsilon}} g_t \\
u_t &= \rho u_{t-1} + (1-\rho) (\Delta w_t)^2 \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
Formulas 2 and 4 for mounting
\begin{align}
v_t &= \rho v_{t-1} + (1-\rho)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} + (1-\rho)(g_t^2 - v_{t-1}) \\
u_t &= \rho u_{t-1} + (1-\rho) (\Delta w_t)^2 \\
\Leftrightarrow
u_t &= u_{t-1} + (1-\rho) \left\{ ( \Delta w_t)^2 - u_{t-1} \right\}
\end{align}
The formula is transformed as follows.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
rho | ||
eps | ||
vw, vb | ||
uw, ub |
It has become.
optimizers.py
class AdaDelta(Optimizer):
def __init__(self, rho=0.95, eps=1e-6, *args, **kwds):
super().__init__(*args, **kwds)
self.rho = rho
self.eps = eps
#Hold the value of the previous step
self.vw = 0
self.vb = 0
self.uw = 0
self.ub = 0
def update(self, grad_w, grad_b, *args, **kwds):
self.vw += (1-self.rho)*(grad_w**2 - self.vw)
self.vb += (1-self.rho)*(grad_b**2 - self.vb)
dw = -grad_w*np.sqrt(self.uw+self.eps)/np.sqrt(self.vw+self.eps)
db = -grad_b*np.sqrt(self.ub+self.eps)/np.sqrt(self.vb+self.eps)
self.uw += (1-self.rho)*(dw**2 - self.uw)
self.ub += (1-self.rho)*(db**2 - self.ub)
return dw, db
Adam Needless to say, it's the famous algorithm Adam. Since then, various improvements have been made, but it still seems to be active. After all, its strength is its high versatility. However, there is no difference in the improved version of RMSprop, so there is also the problem that the dimensions do not actually match. Even so, compared to other learning rules, it is thought that the dimensional problem is alleviated due to the characteristic of ** forgetting gradient information **.
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
m_t &= \beta_1 m_{t-1} + (1- \beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1- \beta_2) g_t^2 \\
\alpha_t &= \alpha_0 \cfrac{\sqrt{1-\beta_2^t}}{1- \beta_1^2} \\
\Delta w_t &= - \alpha_t \cfrac{m_t}{\sqrt{v_t + \varepsilon}} \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
The above formula uses the improved code found in the text of the Original Paper. If it is a mathematical formula as it is
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
m_t &= \beta_1 m_{t-1} + (1- \beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1- \beta_2) g_t^2 \\
\hat{m}_t &= \cfrac{m_t}{1-\beta_1^t} \\
\hat{v}_t &= \cfrac{v_t}{1-\beta_2^t} \\
\Delta w_t &= - \alpha_0 \cfrac{\hat{m}_t}{\sqrt{\hat{v}_t + \varepsilon}} \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
is not it.
Formulas 2 and 3 for mounting
\begin{align}
m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\
\Leftrightarrow
m_t &= m_{t-1} + (1-\beta_1)(g_t - m_{t-1}) \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} + (1-\beta_2)(g_t^2 - v_{t-1})
\end{align}
The formula is transformed as follows.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
alpha | ||
beta1 | ||
beta2 | ||
eps | ||
mw, mb | ||
vw, vb |
It has become.
optimizers.py
class Adam(Optimizer):
def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, eps=1e-8,
*args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
alpha_t = self.alpha*np.sqrt(1-self.beta2**t)/(1-self.beta1**t)
dw = -alpha_t*self.mw/(np.sqrt(self.vw+self.eps))
db = -alpha_t*self.mb/(np.sqrt(self.vb+self.eps))
return dw, db
RSMpropGraves This is an improved version of RMSprop devised by Graves. It seems to be used in the field of handwritten character recognition.
\begin{align}
g_t &= \nabla_{w_t} (w_t) \\
m_t &= \rho m_{t-1} + (1-\rho) g_t \\
v_t &= \rho v_{t-1} + (1-\rho) g_t^2 \\
\Delta w_t &= - \eta \cfrac{g_t}{\sqrt{v_t - m_t^2 + \varepsilon}} \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
Formulas 2 and 3 for mounting
\begin{align}
m_t &= \rho m_{t-1} + (1-\rho)g_t \\
\Leftrightarrow
m_t &= m_{t-1} + (1-\rho)(g_t - m_{t-1}) \\
v_t &= \rho v_{t-1} + (1-\rho)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} + (1-\rho)(g_t^2 - v_{t-1})
\end{align}
The formula is transformed as follows.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
eta | ||
rho | ||
eps | ||
mw, mb | ||
vw, vb |
It has become.
optimizers.py
class RMSpropGraves(Optimizer):
def __init__(self, eta=1e-4, rho=0.95, eps=1e-4, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
self.rho = rho
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
def update(self,grad_w, grad_b, *args, **kwds):
self.mw += (1-self.rho)*(grad_w - self.mw)
self.mb += (1-self.rho)*(grad_b - self.mb)
self.vw += (1-self.rho)*(grad_w**2 - self.vw)
self.vb += (1-self.rho)*(grad_b**2 - self.vb)
dw = -self.eta*grad_w/np.sqrt(self.vw - self.mw**2 + self.eps)
db = -self.eta*grad_b/np.sqrt(self.vb - self.mb**2 + self.eps)
return dw, db
SMORMS3 This is also an improvement of RMSprop. It seems that the dimensional problem was overcome by the curvature and the LeCun method. I'm not sure what the LeCun method is, but it seems that LeCun is advocating a technique for the backpropagation method and announcing a model called LeNet.
\begin{align}
g_t &= \nabla_{w_t}(w_t) \\
s_t &= 1 + (1-\zeta_{t-1} s_{t-1}) \\
\rho_t &= \cfrac{1}{s_t + 1} \\
m_t &= \rho_t m_{t-1} + (1-\rho_t) g_t \\
v_t &= \rho_t v_{t-1} + (1-\rho_t) g_t^2 \\
\zeta_t &= \cfrac{m_t^2}{v_t + \varepsilon} \\
\Delta w_t &= \cfrac{\textrm{min}\{ \eta, \zeta_t \}}{\sqrt{v_t + \varepsilon}} g_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
Formulas 4 and 5 for mounting
\begin{align}
m_t &= \rho_t m_{t-1} + (1-\rho_t)g_t \\
\Leftrightarrow
m_t &= m_{t-1} + (1-\rho_t)(g_t - m_{t-1}) \\
v_t &= \rho_t v_{t-1} + (1-\rho_t)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} + (1-\rho_t)(g_t^2 - v_{t-1})
\end{align}
The formula is transformed as follows.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
eta | ||
eps | ||
zetaw, zetab | ||
sw, sb | ||
mw, mb | ||
vw, vb |
It has become.
optimizers.py
class SMORMS3(Optimizer):
def __init__(self, eta=1e-3, eps=1e-8, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
self.eps = eps
#Hold the value of the previous step
self.zetaw = 0
self.zetab = 0
self.sw = 0
self.sb = 0
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
def update(self, grad_w, grad_b, *args, **kwds):
self.sw = 1 + (1 - self.zetaw*self.sw)
self.sb = 1 + (1 - self.zetab*self.sb)
rhow = 1/(1+self.sw)
rhob = 1/(1+self.sb)
self.mw += (1-rhow)*(grad_w - self.mw)
self.mb += (1-rhob)*(grad_b - self.mb)
self.vw += (1-rhow)*(grad_w**2 - self.vw)
self.vb += (1-rhob)*(grad_b**2 - self.vb)
self.zetaw = self.mw**2 / (self.vw + self.eps)
self.zetaw = self.mb**2 / (self.vb + self.eps)
dw = -grad_w*(np.minimum(self.eta, self.zetaw)
/np.sqrt(self.vw + self.eps))
db = -grad_b*(np.minimum(self.eta, self.zetab)
/np.sqrt(self.vb + self.eps))
return dw, db
AdaMax It's a dimensionless version of Adam.
\begin{align}
g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\
u_t &= \textrm{max}\left\{ \beta_2 u_{t-1}, |g_t| \right\} \\
\Delta w_t &= - \cfrac{\alpha}{1-\beta_1^t}\cfrac{m_t}{u_t} \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
The second formula for mounting
\begin{align}
m_t &= \rho_t m_{t-1} + (1-\rho_t)g_t \\
\Leftrightarrow
m_t &= m_{t-1} + (1-\rho_t)(g_t - m_{t-1}) \end{align}
The formula is transformed as follows.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
alpha | ||
beta1 | ||
beta2 | ||
mw, mb | ||
uw, ub |
It has become.
optimizers.py
class AdaMax(Optimizer):
def __init__(self, alpha=2e-3, beta1=0.9, beta2=0.999,
*args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.beta1 = beta1
self.beta2 = beta2
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.uw = 0
self.ub = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.uw = np.maximum(self.beta2*self.uw, np.abs(grad_w))
self.ub = np.maximum(self.beta2*self.ub, np.abs(grad_b))
alpha_t = self.alpha/(1 - self.beta1**t)
dw = -alpha_t*self.mw/self.uw
db = -alpha_t*self.mb/self.ub
return dw, db
\begin{align}
g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
m_t &= \mu m_{t-1} + (1-\mu) g_t \\
v_t &= \nu v_{t-1} + (1-\nu) g_t^2 \\
\hat{m}_t &= \cfrac{\mu}{1 - \mu^{t+1}} m_t + \cfrac{1-\mu}{1-\mu^t} g_t \\
\hat{v}_t &= \cfrac{\nu}{1-\nu^t} v_t \\
\Delta w_t &= - \alpha \cfrac{\hat{m}_t}{\sqrt{\hat{v}_t + \varepsilon}} \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
Actually, $ \ alpha $ and $ \ mu $ are timestep-dependent parameters, if true </ font>
\begin{align}
g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
m_t &= \mu_t m_{t-1} + (1-\mu_t) g_t \\
v_t &= \nu v_{t-1} + (1-\nu) g_t^2 \\
\hat{m}_t &= \cfrac{\mu_t}{1 - \displaystyle\prod_{\tau=1}^{t+1}{\mu_{\tau}}} m_t + \cfrac{1-\mu_t}{1-\displaystyle\prod_{\tau=1}^{t}{\mu_{\tau}}} g_t \\
\hat{v}_t &= \cfrac{\nu}{1-\nu^t} v_t \\
\Delta w_t &= - \alpha_t \cfrac{\hat{m}_t}{\sqrt{\hat{v}_t + \varepsilon}} \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
It is expressed by the formula. However, it seems that it is fixed in most cases, so I fixed it for the time being.
Formulas 2 and 3 for mounting
\begin{align}
m_t &= \mu m_{t-1} + (1-\mu)g_t \\
\Leftrightarrow
m_t &= m_{t-1} + (1-\mu)(g_t - m_{t-1}) \\
v_t &= \nu v_{t-1} + (1-\nu)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} + (1-\nu)(g_t^2 - v_{t-1})
\end{align}
The formula is transformed as follows.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
alpha | ||
mu | ||
nu | ||
eps | ||
mw, mb | ||
vw, vb |
It has become.
optimizers.py
class Nadam(Optimizer):
def __init__(self, alpha=2e-3, mu=0.975, nu=0.999, eps=1e-8,
*args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.mu = mu
self.nu = nu
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.mu)*(grad_w - self.mw)
self.mb += (1-self.mu)*(grad_b - self.mb)
self.vw += (1-self.nu)*(grad_w**2 - self.vw)
self.vb += (1-self.nu)*(grad_b**2 - self.vb)
mhatw = (self.mu*self.mw/(1-self.mu**(t+1))
+ (1-self.mu)*grad_w/(1-self.mu**t))
mhatb = (self.mu*self.mb/(1-self.mu**(t+1))
+ (1-self.mu)*grad_b/(1-self.mu**t))
vhatw = self.nu*self.vw/(1-self.nu**t)
vhatb = self.nu*self.vb/(1-self.nu**t)
dw = -self.alpha*mhatw/np.sqrt(vhatw + self.eps)
db = -self.alpha*mhatb/np.sqrt(vhatb + self.eps)
return dw, db
Eve It's an improved version of Adam. It speeds up learning in flat areas (areas with almost zero gradient). The names are fashionable, aren't they? Adam and Eve
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\
\hat{m}_t &= \cfrac{m_t}{1-\beta_1^t} \\
\hat{v}_t &= \cfrac{v_t}{1-\beta_2^t} \\
\textrm{if} \quad &t \gt 1 \\
& d_t = \cfrac{\left| f_t - f_{t-1} \right|}{\textrm{min} \left\{ f_t, f_{t-1} \right\} - f^{\star}} \\
& \hat{d}_t = \textrm{clip}\left( d_t, \left[ \frac{1}{c}, c \right] \right) \\
& \tilde{d}_t = \beta_3 \tilde{d}_{t-1} + (1-\beta_3)\hat{d}_t \\
\textrm{else} \\
& \tilde{d}_t = 1 \\
\textrm{end} &\textrm{if} \\
\Delta w_t &= - \cfrac{\alpha}{\tilde{d}_t} \cfrac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
It's too hard ... I'll add a few notes about the symbols in the formula. $ f_t $ is the value of the objective function $ f $ at the time step $ t $. $ f ^ {\ star} $ is the minimum value of the objective function. However, according to the original paper, if the squared error or cross entropy error is used for the loss function, $ 0 $ is sufficient. $ \ textrm {clip} $ is a function that sets the minimum and maximum values and rounds the values below / above it to the set value. </ font>
Formulas 2 and 3 for mounting
\begin{align}
m_t &= \mu m_{t-1} + (1-\mu)g_t \\
\Leftrightarrow
m_t &= m_{t-1} + (1-\mu)(g_t - m_{t-1}) \\
v_t &= \nu v_{t-1} + (1-\nu)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} + (1-\nu)(g_t^2 - v_{t-1})
\end{align}
The formula is transformed as follows.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
alpha | ||
beta1 | ||
beta2 | ||
beta3 | ||
c | ||
eps | ||
fstar | ||
mw, mb | ||
vw, vb | ||
fw, fb | ||
dtilde_w, dtilde_b |
It has become.
optimizers.py
class Eve(Optimizer):
def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, beta3=0.999,
c=10, eps=1e-8, fstar=0, *args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.beta1 = beta1
self.beta2 = beta
self.beta3 = beta3
self.c = c
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
self.fw = np.inf
self.fb = np.inf
self.fstar = fstar
self.dtilde_w = 0
self.dtilde_b = 0
def update(self, grad_w, grad_b, t=1, fw=1, fb=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
mhatw = self.mw/(1 - self.beta1**t)
mhatb = self.mb/(1 - self.beta1**t)
vhatw = self.vw/(1 - self.beta2**t)
vhatb = self.vb/(1 - self.beta2**t)
if t > 1:
d_w = (np.abs(fw-self.fstar)
/(np.minimum(fw, self.fw) - self.fstar))
d_b = (np.abs(fb-self.fstar)
/(np.minimum(fb, self.fb) - self.fstar))
dhat_w = np.clip(d_w, 1/self.c, self.c)
dhat_b = np.clip(d_b, 1/self.c, self.c)
self.dtilde_w += (1 - self.beta3)*(dhat_w - self.dtilde_w)
self.dtilde_b += (1 - self.beta3)*(dhat_b - self.dtilde_b)
else:
self.dtilde_w = 1
self.dtilde_b = 1
self.fw = fw
self.fb = fb
dw = -(self.alpha*mhatw
/(self.dtilde_w*(np.sqrt(vhatw) + self.eps)))
db = -(self.alpha*mhatb
/(self.dtilde_b*(np.sqrt(vhatb) + self.eps)))
return dw, db
Santa-E It is a Markov chain Monte Carlo method (MCMC) incorporated into Adam and RMSprop. Among them, Santa-E seems to use Euler type? Original paper was read through when implementing, but honestly I don't understand at all ... lol But even if you don't understand, you can write the code if you design according to the formula! So I did my best. Or rather, I was in trouble because I didn't give me the initial values of the original paper ... I took a peek at the implementations of Pytorch, Keras, Chainer, etc. (all were disjointed ...), and on my own implementation I use only the hyperparameters that I feel are necessary.
\begin{align}
g_t &= \nabla_{w_t} \mathcal{L}(w_t) \\
v_t &= \sigma v_{t-1} + \cfrac{1-\sigma}{N^2} g_t^2 \\
\mathcal{G}_t &= \cfrac{1}{\sqrt{\lambda + \sqrt{v_t}}} \\
\textrm{if} \quad &t \lt burnin \\
& \alpha_t = \alpha_{t-1} + (u_t^2 - \cfrac{\eta}{\beta_t}) \\
& u_t = \cfrac{\eta}{\beta_t}\cfrac{1-\frac{\mathcal{G}_{t-1}}{\mathcal{G}_t}}{u_{t-1}} + \sqrt{\cfrac{2\eta}{\beta_t}\mathcal{G}_{t-1}} \otimes \zeta_t \\
\textrm{else} \\
& \alpha_t = \alpha_{t-1} \\
& u_t = 0 \\
\textrm{end} &\textrm{if} \\
u_t &= u_t + (1-\alpha_t) \otimes u_{t-1} - \eta \mathcal{G}_t \otimes g_t \\
\Delta w_t &= \mathcal{G}_t \otimes u_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
There are a lot of element products $ \ otimes $, but you don't have to worry about it. It is the matrix product that you should be careful about when implementing normally.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
eta | ||
sigma | ||
lambda_ | ||
burnin | ||
C | ||
alpha_w, alpha_b | ||
Generate random numbers directly without storing in variables | ||
vw, vb | ||
gw, gb | ||
uw, ub |
It has become.
$ \ mathcal {N} (0, 1) $ is a random number according to the standard normal distribution.
Also, $ \ beta $ is determined by ʻanne_func and ʻanne_rate
in the code. The original paper does not specify this, but the condition is that $ t \ to \ infinty $ must be $ \ beta_t \ to \ infinty $. Apparently Keras, [Pytorch](https://github.com/ReyhaneAskari/santa/blob/master/santa. $ \ Beta_t = \ sqrt {t} $ in famous machine learning libraries such as py), Chainer You are calculating.
But is it all right if the conditions are met? </ font>
optimizers.py
class SantaE(Optimizer):
def __init__(self, eta=1e-2, sigma=0.95, lambda_=1e-8,
anne_func=lambda t, n: t**n, anne_rate=0.5,
burnin=100, C=5,
*args, **kwds):
"""
Args:
eta: Learning rate
sigma: Maybe in other cases;
'rho' in RMSprop, AdaDelta, RMSpropGraves.
'rhow' or 'rhob' in SMORMS3.
'beta2' in Adam, Eve.
'nu' in Nadam.
To use calculation 'v'.
lambda_: Named 'eps'(ε) in other cases.
anne_func: Annealing function.
To use calculation 'beta' at each timestep.
Default is 'timestep'**'annealing rate'.
The calculated value should be towards infinity
as 't' increases.
anne_rate: Annealing rate.
To use calculation 'beta' at each timestep.
The second Argument of 'anne_func'.
burnin: Swith exploration and refinement.
This should be specified by users.
C: To calculate first 'alpha'.
"""
super().__init__(*args, **kwds)
self.eta = eta
self.sigma = sigma
self.lambda_ = lambda_
self.anne_func = anne_func
self.anne_rate = anne_rate
self.burnin = burnin
# Keep one step before and Initialize.
self.alpha_w = np.sqrt(eta)*C
self.alpha_b = np.sqrt(eta)*C
self.vw = 0
self.vb = 0
self.gw = 0
self.gb = 0
def update(self, grad_w, grad_b, t=1, N=200, *args, **kwds):
try:
shape_w = grad_w.shape
except:
shape_w = (1, )
try:
shape_b = grad_b.shape
except:
shape_b = (1, )
if t == 1:
# Initialize uw, ub.
self.uw = np.sqrt(self.eta)*np.random.randn(*shape_w)
self.ub = np.sqrt(self.eta)*np.random.randn(*shape_b)
self.vw = (self.sigma*self.vw
+ grad_w*grad_w * (1 - self.sigma) / N**2)
self.vb = (self.sigma*self.vb
+ grad_b*grad_b * (1 - self.sigma) / N**2)
gw = 1/np.sqrt(self.lambda_ + np.sqrt(self.vw))
gb = 1/np.sqrt(self.lambda_ + np.sqrt(self.vb))
beta = self.anne_func(t, self.anne_rate)
if t < self.burnin:
# Exploration.
self.alpha_w += self.uw*self.uw - self.eta/beta
self.alpha_b += self.ub*self.ub - self.eta/beta
uw = (self.eta/beta * (1 - self.gw/gw)/self.uw
+ np.sqrt(2*self.eta/beta * self.gw)
* np.random.randn(*shape_w))
ub = (self.eta/beta * (1 - self.gb/gb)/self.ub
+ np.sqrt(2*self.eta/beta * self.gb)
* np.random.randn(*shape_b))
else:
# Refinement.
uw = 0
ub = 0
uw += (1 - self.alpha_w)*self.uw - self.eta*gw*grad_w
ub += (1 - self.alpha_b)*self.ub - self.eta*gb*grad_b
# Update values.
self.uw = uw
self.ub = ub
self.gw = gw
self.gb = gb
dw = gw*uw
db = gb*ub
return dw, db
Santa-SSS This is also the algorithm described in the same paper. I don't understand at all, but I can implement it. What is Lehman Information Geometry?
\begin{align}
g_t &= \nabla_{w_t} \mathcal{L}(w_t) \\
v_t &= \sigma v_{t-1} + \cfrac{1-\sigma}{N^2} g_t^2 \\
\mathcal{G}_t &= \cfrac{1}{\sqrt{\lambda + \sqrt{v_t}}} \\
\Delta w_t &= \mathcal{G}_t \otimes u_{t-1} \times 0.5 \\
\textrm{if} \quad &t \lt burnin \\
& \alpha_t = \alpha_{t-1} + (u_t^2 - \cfrac{\eta}{\beta_t}) \times 0.5 \\
& u_t = e^{- 0.5\alpha_t} \otimes u_{t-1} \\
& u_t = u_t - \mathcal{G}_t \otimes g_t \eta + \sqrt{2\mathcal{G}_t \cfrac{\eta}{\beta_t}} \otimes \zeta_t + \cfrac{\eta}{\beta_t}\cfrac{
1 - \frac{\mathcal{G}_{t-1}}{\mathcal{G}_t}}{u_{t-1}} \\
& u_t = e^{-0.5\alpha_t} \otimes u_t \\
& \alpha_t = \alpha_t + \left( u_t^2 - \cfrac{\eta}{\beta_t} \right) \times 0.5 \\
\textrm{else} \\
& \alpha_t = \alpha_{t-1} \\
& u_t = e^{-0.5\alpha_t} \otimes u_{t-1} \\
& u_t = u_t - \mathcal{G}_t \otimes g_t \eta \\
& u_t = e^{-0.5\alpha_t} \otimes u_t \\
\textrm{end} &\textrm{if} \\
\Delta w_t &= \Delta w_t + \mathcal{G}_t \otimes u_t \times 0.5 \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
Hyperparameters
symbol | Notation in code | value |
---|---|---|
eta | ||
sigma | ||
lambda_ | ||
burnin | ||
C | ||
alpha_w, alpha_b | ||
Generate random numbers directly without storing in variables | ||
vw, vb | ||
gw, gb | ||
uw, ub |
It has become.
optimizers.py
class SantaSSS(Optimizer):
def __init__(self, eta=1e-2, sigma=0.95, lambda_=1e-8,
anne_func=lambda t, n: t**n, anne_rate=0.5,
burnin=100, C=5,
*args, **kwds):
"""
Args:
eta: Learning rate
sigma: Maybe in other cases;
'rho' in RMSprop, AdaDelta, RMSpropGraves.
'rhow' or 'rhob' in SMORMS3.
'beta2' in Adam, Eve.
'nu' in Nadam.
To use calculation 'v'.
lambda_: Named 'eps'(ε) in other cases.
anne_func: Annealing function.
To use calculation 'beta' at each timestep.
Default is 'timestep'**'annealing rate'.
The calculated value should be towards infinity
as 't' increases.
anne_rate: Annealing rate.
To use calculation 'beta' at each timestep.
The second Argument of 'anne_func'.
burnin: Swith exploration and refinement.
This should be specified by users.
C: To calculate first 'alpha'.
"""
super().__init__(*args, **kwds)
self.eta = eta
self.sigma = sigma
self.lambda_ = lambda_
self.anne_func = anne_func
self.anne_rate = anne_rate
self.burnin = burnin
# Keep one step before and Initialize.
self.alpha_w = np.sqrt(eta)*C
self.alpha_b = np.sqrt(eta)*C
self.vw = 0
self.vb = 0
self.gw = 0
self.gb = 0
def update(self, grad_w, grad_b, t=1, N=200, *args, **kwds):
try:
shape_w = grad_w.shape
except:
shape_w = (1, )
try:
shape_b = grad_b.shape
except:
shape_b = (1, )
if t == 1:
# Initialize uw, ub.
self.uw = np.sqrt(self.eta)*np.random.randn(*shape_w)
self.ub = np.sqrt(self.eta)*np.random.randn(*shape_b)
self.vw = (self.sigma*self.vw
+ grad_w*grad_w * (1 - self.sigma) / N**2)
self.vb = (self.sigma*self.vb
+ grad_b*grad_b * (1 - self.sigma) / N**2)
gw = 1/np.sqrt(self.lambda_ + np.sqrt(self.vw))
gb = 1/np.sqrt(self.lambda_ + np.sqrt(self.vb))
dw = 0.5*gw*self.uw
db = 0.5*gb*self.ub
beta = self.anne_func(t, self.anne_rate)
if t < self.burnin:
# Exploration.
self.alpha_w += (self.uw*self.uw - self.eta/beta)*0.5
self.alpha_b += (self.ub*self.ub - self.eta/beta)*0.5
uw = np.exp(-0.5*self.alpha_w)*self.uw
ub = np.exp(-0.5*self.alpha_b)*self.ub
uw += (-gw*grad_w*self.eta
+ np.sqrt(2*self.gw*self.eta/beta)
* np.random.randn(*shape_w)
+ self.eta/beta*(1-self.gw/gw)/self.uw)
ub += (-gb*grad_b*self.eta
+ np.sqrt(2*self.gb*self.eta/beta)
* np.random.randn(*shape_b)
+ self.eta/beta*(1-self.gb/gb)/self.ub)
uw *= np.exp(-0.5*self.alpha_w)
ub *= np.exp(-0.5*self.alpha_b)
self.alpha_w += (uw*uw - self.eta/beta)*0.5
self.alpha_b += (ub*ub - self.eta/beta)*0.5
else:
# Refinement.
uw = np.exp(-0.5*self.alpha_w)*self.uw
ub = np.exp(-0.5*self.alpha_b)*self.ub
uw -= gw*grad_w*self.eta
ub -= gb*grad_b*self.eta
uw *= np.exp(-0.5*self.alpha_w)
ub *= np.exp(-0.5*self.alpha_b)
# Update values.
self.uw = uw
self.ub = ub
self.gw = gw
self.gb = gb
dw = gw*uw*0.5
db = gb*ub*0.5
return dw, db
GD by GD Papers I read ... I don't know at all ...
AdaSecant Papers Read ... I don't understand the moving average relationship ...
AMSGrad RMSprop, AdaDelta, Adam, Nadam had exponential decay due to the exponential moving average. Even if important gradient information flows, it may not be possible to converge to the optimum solution because it is forgotten immediately. AMS Grad has solved that problem.
\begin{align}
g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\
\hat{v}_t &= \textrm{max}\{ \hat{v}_{t-1}, v_t \} \\
\Delta w_t &= - \alpha_t \cfrac{m_t}{\sqrt{\hat{v}_t + \varepsilon}} \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
Formulas 2 and 3 for mounting
\begin{align}
m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\
\Leftrightarrow
m_t &= m_{t-1} + (1-\beta_1)(g_t - m_{t-1}) \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} + (1-\beta_2)(g_t^2 - v_{t-1})
\end{align}
The formula is transformed as follows. Also, is it better to calculate with $ \ alpha_t = \ frac {\ alpha} {\ sqrt {t}} $? There was a description like that in the paper.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
alpha | ||
beta1 | ||
beta2 | ||
eps | ||
mw, mb | ||
vw, vb | ||
vhatw, vhatb |
It has become. Regarding the initial value $ \ alpha_0 $, there was no specific value in the paper, so Chainer I brought it from.
optimizers.py
class AMSGrad(Optimizer):
def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, eps=1e-8,
*args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
self.vhatw = 0
self.vhatb = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
self.vhatw = np.maximum(self.vhatw, self.vw)
self.vhatb = np.maximum(self.vhatb, self.vb)
alpha_t = self.alpha / np.sqrt(t)
dw = - alpha_t * self.mw/np.sqrt(self.vhatw + self.eps)
db = - alpha_t * self.mb/np.sqrt(self.vhatb + self.eps)
return dw, db
AdaBound AMSGrad suppressed the problem of unnecessarily large learning rate, but did not solve the opposite problem of unnecessarily small. AdaBound and AMSBound have Adam's "quick learning but cannot suppress generalization error well" and SGD' s "final generalization error can be reduced but until then". Combining the "slow" points, it was proposed as an optimization method that behaves like Adam at the beginning and SGD at the end. AdaBound is an improved version of Adam.
\begin{align}
g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\
\hat{\eta}_t &= \textrm{clip}\left( \cfrac{\alpha}{\sqrt{v_t}}, \left[ \eta_l(t), \eta_h(t) \right] \right) \\
\eta_t &= \cfrac{\hat{\eta}_t}{\sqrt{t}} \\
\Delta w_t &= - \eta_t m_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
Formulas 2 and 3 for mounting
\begin{align}
m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\
\Leftrightarrow
m_t &= m_{t-1} + (1-\beta_1)(g_t - m_{t-1}) \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} + (1-\beta_2)(g_t^2 - v_{t-1})
\end{align}
The formula is transformed as follows. Also, if the learning rate as SGD (learning rate at the end) is $ \ eta $,
\begin{align}
\eta_l(t) &= \eta \left( 1 - \cfrac{1}{(1-\beta_2)t+1} \right) \\
\eta_u(t) &= \eta \left( 1 - \cfrac{1}{(1-\beta_2)t} \right)
\end{align}
It seems that it is calculated in the paper, so (probably) I will adopt it. actually,
- $ \ eta_l $ is $ 0 $ at $ t = 0 $, $ \ eta_-$ at $ t \ to \ infinty $ </ font> - $ \ eta_u $ is $ \ infinty $ for $ t = 0 $, $ \ eta_ + $ for $ infinty $ </ font> - $ \ eta_ {\ pm} $ is the right limit and the left limit </ font>, respectively.
It seems that it should be.
Hyperparameters
symbol | Notation in code | value |
---|---|---|
alpha | ||
eta | ||
beta1 | ||
beta2 | ||
eps | ||
mw, mb | ||
vw, vb |
It has become.
optimizers.py
class AdaBound(Optimizer):
def __init__(self, alpha=1e-3, eta=1e-1, beta1=0.9, beta2=0.999,
eps=1e-8, *args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.eta = eta
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
etal = self.eta*(1 - 1/((1-self.beta2)**(t+1)))
etau = self.eta*(1 + 1/((1-self.beta2)*t))
etahatw_t = np.clip(self.alpha/np.sqrt(self.vw), etal, etau)
etahatb_t = np.clip(self.alpha/np.sqrt(self.vb), etal, etau)
etaw_t = etahatw_t/np.sqrt(t)
etab_t = etahatb_t/np.sqrt(t)
dw = - etaw_t*self.mw
db = - etab_t*self.mb
return dw, db
AMSBound AMSGrad suppressed the problem of unnecessarily large learning rate, but did not solve the opposite problem of unnecessarily small. AdaBound and AMSBound have Adam's "quick learning but cannot suppress generalization error well" and SGD' s "final generalization error can be reduced but until then". Combining the "slow" points, it was proposed as an optimization method that behaves like Adam at the beginning and SGD at the end. AdaBound is an improved version of Adam.
\begin{align}
g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\
\hat{v}_t &= \textrm{max}(\hat{v}_{t-1}, v_t) \\
\hat{\eta}_t &= \textrm{clip}\left( \cfrac{\alpha}{\sqrt{v_t}}, \left[ \eta_l(t), \eta_h(t) \right] \right) \\
\eta_t &= \cfrac{\hat{\eta}_t}{\sqrt{t}} \\
\Delta w_t &= - \eta_t m_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
Formulas 2 and 3 for mounting
\begin{align}
m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\
\Leftrightarrow
m_t &= m_{t-1} + (1-\beta_1)(g_t - m_{t-1}) \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\
\Leftrightarrow
v_t &= v_{t-1} + (1-\beta_2)(g_t^2 - v_{t-1})
\end{align}
The formula is transformed as follows. $ \ eta_l $ and $ \ eta_u $ are calculated in the same way as AdaBound. </ font>
Hyperparameters
symbol | Notation in code | value |
---|---|---|
alpha | ||
eta | ||
beta1 | ||
beta2 | ||
eps | ||
mw, mb | ||
vw, vb | ||
vhatw, vhatb |
It has become.
optimizers.py
class AMSBound(Optimizer):
def __init__(self, alpha=1e-3, eta=1e-1, beta1=0.9, beta2=0.999,
eps=1e-8, *args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.eta = eta
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
self.vhatw = 0
self.vhatb = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
self.vhatw = np.maximum(self.vhatw, self.vw)
self.vhatb = np.maximum(self.vhatb, self.vb)
etal = self.eta*(1 - 1/((1-self.beta2)*t + 1))
etau = self.eta*(1 + 1/((1-self.beta2)*t + self.eps))
etahatw_t = np.clip(self.alpha/np.sqrt(self.vhatw), etal, etau)
etahatb_t = np.clip(self.alpha/np.sqrt(self.vhatb), etal, etau)
etaw_t = etahatw_t/np.sqrt(t)
etab_t = etahatb_t/np.sqrt(t)
dw = - etaw_t*self.mw
db = - etab_t*self.mb
return dw, db
Here is a list of code examples. I will also post a simple experimental result. We may add a saddle point comparison over time.
optimizers.py
import numpy as np
class Optimizer():
"""
A superclass inherited by the optimization method.
"""
def __init__(self, *args, **kwds):
pass
def update(self, *args, **kwds):
pass
class SGD(Optimizer):
def __init__(self, eta=1e-2, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
def update(self, grad_w, grad_b, *args, **kwds):
dw = -self.eta*grad_w
db = -self.eta*grad_b
return dw, db
class MSGD(Optimizer):
def __init__(self, eta=1e-2, mu=0.9, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
self.mu = mu
#Hold the value of the previous step
self.dw = 0
self.db = 0
def update(self, grad_w, grad_b, *args, **kwds):
dw = self.mu*self.dw - (1-self.mu)*self.eta*grad_w
db = self.mu*self.db - (1-self.mu)*self.eta*grad_b
#Assigning in the view instead of copying is because these values may be used
#This is because it will not be changed.
self.dw = dw
self.db = db
return dw, db
class NAG(Optimizer):
def __init__(self, eta=1e-2, mu=0.9, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
self.mu = mu
#Holds the value of the previous step
self.dw = 0
self.db = 0
def update(self, grad_w, grad_b, w=0, b=0, df=None, *args, **kwds):
grad_w = df(w + self.mu*self.dw)
grad_b = df(b + self.mu*self.db)
dw = self.mu*self.dw - (1-self.mu)*self.eta*grad_w
db = self.mu*self.db - (1-self.mu)*self.eta*grad_b
#Assigning in the view instead of copying is because these values may be used
#This is because it will not be changed.
self.dw = dw
self.db = db
return dw, db
class AdaGrad(Optimizer):
def __init__(self, eta=1e-3, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
#Hold the value of the previous step
self.gw = 0
self.gb = 0
def update(self, grad_w, grad_b, *args, **kwds):
self.gw += grad_w*grad_w
self.gb += grad_b*grad_b
dw = -self.eta*grad_w/np.sqrt(self.gw)
db = -self.eta*grad_b/np.sqrt(self.gb)
return dw, db
class RMSprop(Optimizer):
def __init__(self, eta=1e-2, rho=0.99, eps=1e-8, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
self.rho = rho
self.eps = eps
#Hold the value of the previous step
self.vw = 0
self.vb = 0
def update(self, grad_w, grad_b, *args, **kwds):
self.vw += (1-self.rho)*(grad_w**2 - self.vw)
self.vb += (1-self.rho)*(grad_b**2 - self.vb)
dw = -self.eta*grad_w/np.sqrt(self.vw+self.eps)
db = -self.eta*grad_b/np.sqrt(self.vb+self.eps)
return dw, db
class AdaDelta(Optimizer):
def __init__(self, rho=0.95, eps=1e-6, *args, **kwds):
super().__init__(*args, **kwds)
self.rho = rho
self.eps = eps
#Hold the value of the previous step
self.vw = 0
self.vb = 0
self.uw = 0
self.ub = 0
def update(self, grad_w, grad_b, *args, **kwds):
self.vw += (1-self.rho)*(grad_w**2 - self.vw)
self.vb += (1-self.rho)*(grad_b**2 - self.vb)
dw = -grad_w*np.sqrt(self.uw+self.eps)/np.sqrt(self.vw+self.eps)
db = -grad_b*np.sqrt(self.ub+self.eps)/np.sqrt(self.vb+self.eps)
self.uw += (1-self.rho)*(dw**2 - self.uw)
self.ub += (1-self.rho)*(db**2 - self.ub)
return dw, db
class Adam(Optimizer):
def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, eps=1e-8,
*args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
alpha_t = self.alpha*np.sqrt(1-self.beta2**t)/(1-self.beta1**t)
dw = -alpha_t*self.mw/(np.sqrt(self.vw+self.eps))
db = -alpha_t*self.mb/(np.sqrt(self.vb+self.eps))
return dw, db
class RMSpropGraves(Optimizer):
def __init__(self, eta=1e-4, rho=0.95, eps=1e-4, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
self.rho = rho
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
def update(self,grad_w, grad_b, *args, **kwds):
self.mw += (1-self.rho)*(grad_w - self.mw)
self.mb += (1-self.rho)*(grad_b - self.mb)
self.vw += (1-self.rho)*(grad_w**2 - self.vw)
self.vb += (1-self.rho)*(grad_b**2 - self.vb)
dw = -self.eta*grad_w/np.sqrt(self.vw - self.mw**2 + self.eps)
db = -self.eta*grad_b/np.sqrt(self.vb - self.mb**2 + self.eps)
return dw, db
class SMORMS3(Optimizer):
def __init__(self, eta=1e-3, eps=1e-8, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
self.eps = eps
#Hold the value of the previous step
self.zetaw = 0
self.zetab = 0
self.sw = 0
self.sb = 0
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
def update(self, grad_w, grad_b, *args, **kwds):
self.sw = 1 + (1 - self.zetaw*self.sw)
self.sb = 1 + (1 - self.zetab*self.sb)
rhow = 1/(1+self.sw)
rhob = 1/(1+self.sb)
self.mw += (1-rhow)*(grad_w - self.mw)
self.mb += (1-rhob)*(grad_b - self.mb)
self.vw += (1-rhow)*(grad_w**2 - self.vw)
self.vb += (1-rhob)*(grad_b**2 - self.vb)
self.zetaw = self.mw**2 / (self.vw + self.eps)
self.zetaw = self.mb**2 / (self.vb + self.eps)
dw = -grad_w*(np.minimum(self.eta, self.zetaw)
/np.sqrt(self.vw + self.eps))
db = -grad_b*(np.minimum(self.eta, self.zetab)
/np.sqrt(self.vb + self.eps))
return dw, db
class AdaMax(Optimizer):
def __init__(self, alpha=2e-3, beta1=0.9, beta2=0.999,
*args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.beta1 = beta1
self.beta2 = beta2
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.uw = 0
self.ub = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.uw = np.maximum(self.beta2*self.uw, np.abs(grad_w))
self.ub = np.maximum(self.beta2*self.ub, np.abs(grad_b))
alpha_t = self.alpha/(1 - self.beta1**t)
dw = -alpha_t*self.mw/self.uw
db = -alpha_t*self.mb/self.ub
return dw, db
class Nadam(Optimizer):
def __init__(self, alpha=2e-3, mu=0.975, nu=0.999, eps=1e-8,
*args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.mu = mu
self.nu = nu
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.mu)*(grad_w - self.mw)
self.mb += (1-self.mu)*(grad_b - self.mb)
self.vw += (1-self.nu)*(grad_w**2 - self.vw)
self.vb += (1-self.nu)*(grad_b**2 - self.vb)
mhatw = (self.mu*self.mw/(1-self.mu**(t+1))
+ (1-self.mu)*grad_w/(1-self.mu**t))
mhatb = (self.mu*self.mb/(1-self.mu**(t+1))
+ (1-self.mu)*grad_b/(1-self.mu**t))
vhatw = self.nu*self.vw/(1-self.nu**t)
vhatb = self.nu*self.vb/(1-self.nu**t)
dw = -self.alpha*mhatw/np.sqrt(vhatw + self.eps)
db = -self.alpha*mhatb/np.sqrt(vhatb + self.eps)
return dw, db
class Eve(Optimizer):
def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, beta3=0.999,
c=10, eps=1e-8, fstar=0, *args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.beta1 = beta1
self.beta2 = beta2
self.beta3 = beta3
self.c = c
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
self.fw = np.inf
self.fb = np.inf
self.fstar = fstar
self.dtilde_w = 0
self.dtilde_b = 0
def update(self, grad_w, grad_b, t=1, fw=1, fb=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
mhatw = self.mw/(1 - self.beta1**t)
mhatb = self.mb/(1 - self.beta1**t)
vhatw = self.vw/(1 - self.beta2**t)
vhatb = self.vb/(1 - self.beta2**t)
if t > 1:
d_w = (np.abs(fw-self.fstar)
/(np.minimum(fw, self.fw) - self.fstar))
d_b = (np.abs(fb-self.fstar)
/(np.minimum(fb, self.fb) - self.fstar))
dhat_w = np.clip(d_w, 1/self.c, self.c)
dhat_b = np.clip(d_b, 1/self.c, self.c)
self.dtilde_w += (1 - self.beta3)*(dhat_w - self.dtilde_w)
self.dtilde_b += (1 - self.beta3)*(dhat_b - self.dtilde_b)
else:
self.dtilde_w = 1
self.dtilde_b = 1
self.fw = fw
self.fb = fb
dw = -(self.alpha*mhatw
/(self.dtilde_w*(np.sqrt(vhatw) + self.eps)))
db = -(self.alpha*mhatb
/(self.dtilde_b*(np.sqrt(vhatb) + self.eps)))
return dw, db
class SantaE(Optimizer):
def __init__(self, eta=1e-2, sigma=0.95, lambda_=1e-8,
anne_func=lambda t, n: t**n, anne_rate=0.5,
burnin=100, C=5,
*args, **kwds):
"""
Args:
eta: Learning rate
sigma: Maybe in other cases;
'rho' in RMSprop, AdaDelta, RMSpropGraves.
'rhow' or 'rhob' in SMORMS3.
'beta2' in Adam, Eve.
'nu' in Nadam.
To use calculation 'v'.
lambda_: Named 'eps'(ε) in other cases.
anne_func: Annealing function.
To use calculation 'beta' at each timestep.
Default is 'timestep'**'annealing rate'.
The calculated value should be towards infinity
as 't' increases.
anne_rate: Annealing rate.
To use calculation 'beta' at each timestep.
The second Argument of 'anne_func'.
burnin: Swith exploration and refinement.
This should be specified by users.
C: To calculate first 'alpha'.
"""
super().__init__(*args, **kwds)
self.eta = eta
self.sigma = sigma
self.lambda_ = lambda_
self.anne_func = anne_func
self.anne_rate = anne_rate
self.burnin = burnin
# Keep one step before and Initialize.
self.alpha_w = np.sqrt(eta)*C
self.alpha_b = np.sqrt(eta)*C
self.vw = 0
self.vb = 0
self.gw = 0
self.gb = 0
def update(self, grad_w, grad_b, t=1, N=200, *args, **kwds):
try:
shape_w = grad_w.shape
except:
shape_w = (1, )
try:
shape_b = grad_b.shape
except:
shape_b = (1, )
if t == 1:
# Initialize uw, ub.
self.uw = np.sqrt(self.eta)*np.random.randn(*shape_w)
self.ub = np.sqrt(self.eta)*np.random.randn(*shape_b)
self.vw = (self.sigma*self.vw
+ grad_w*grad_w * (1 - self.sigma) / N**2)
self.vb = (self.sigma*self.vb
+ grad_b*grad_b * (1 - self.sigma) / N**2)
gw = 1/np.sqrt(self.lambda_ + np.sqrt(self.vw))
gb = 1/np.sqrt(self.lambda_ + np.sqrt(self.vb))
beta = self.anne_func(t, self.anne_rate)
if t < self.burnin:
# Exploration.
self.alpha_w += self.uw*self.uw - self.eta/beta
self.alpha_b += self.ub*self.ub - self.eta/beta
uw = (self.eta/beta * (1 - self.gw/gw)/self.uw
+ np.sqrt(2*self.eta/beta * self.gw)
* np.random.randn(*shape_w))
ub = (self.eta/beta * (1 - self.gb/gb)/self.ub
+ np.sqrt(2*self.eta/beta * self.gb)
* np.random.randn(*shape_b))
else:
# Refinement.
uw = 0
ub = 0
uw += (1 - self.alpha_w)*self.uw - self.eta*gw*grad_w
ub += (1 - self.alpha_b)*self.ub - self.eta*gb*grad_b
# Update values.
self.uw = uw
self.ub = ub
self.gw = gw
self.gb = gb
dw = gw*uw
db = gb*ub
return dw, db
class SantaSSS(Optimizer):
def __init__(self, eta=1e-2, sigma=0.95, lambda_=1e-8,
anne_func=lambda t, n: t**n, anne_rate=0.5,
burnin=100, C=5,
*args, **kwds):
"""
Args:
eta: Learning rate
sigma: Maybe in other cases;
'rho' in RMSprop, AdaDelta, RMSpropGraves.
'rhow' or 'rhob' in SMORMS3.
'beta2' in Adam, Eve.
'nu' in Nadam.
To use calculation 'v'.
lambda_: Named 'eps'(ε) in other cases.
anne_func: Annealing function.
To use calculation 'beta' at each timestep.
Default is 'timestep'**'annealing rate'.
The calculated value should be towards infinity
as 't' increases.
anne_rate: Annealing rate.
To use calculation 'beta' at each timestep.
The second Argument of 'anne_func'.
burnin: Swith exploration and refinement.
This should be specified by users.
C: To calculate first 'alpha'.
"""
super().__init__(*args, **kwds)
self.eta = eta
self.sigma = sigma
self.lambda_ = lambda_
self.anne_func = anne_func
self.anne_rate = anne_rate
self.burnin = burnin
# Keep one step before and Initialize.
self.alpha_w = np.sqrt(eta)*C
self.alpha_b = np.sqrt(eta)*C
self.vw = 0
self.vb = 0
self.gw = 0
self.gb = 0
def update(self, grad_w, grad_b, t=1, N=200, *args, **kwds):
try:
shape_w = grad_w.shape
except:
shape_w = (1, )
try:
shape_b = grad_b.shape
except:
shape_b = (1, )
if t == 1:
# Initialize uw, ub.
self.uw = np.sqrt(self.eta)*np.random.randn(*shape_w)
self.ub = np.sqrt(self.eta)*np.random.randn(*shape_b)
self.vw = (self.sigma*self.vw
+ grad_w*grad_w * (1 - self.sigma) / N**2)
self.vb = (self.sigma*self.vb
+ grad_b*grad_b * (1 - self.sigma) / N**2)
gw = 1/np.sqrt(self.lambda_ + np.sqrt(self.vw))
gb = 1/np.sqrt(self.lambda_ + np.sqrt(self.vb))
dw = 0.5*gw*self.uw
db = 0.5*gb*self.ub
beta = self.anne_func(t, self.anne_rate)
if t < self.burnin:
# Exploration.
self.alpha_w += (self.uw*self.uw - self.eta/beta)*0.5
self.alpha_b += (self.ub*self.ub - self.eta/beta)*0.5
uw = np.exp(-0.5*self.alpha_w)*self.uw
ub = np.exp(-0.5*self.alpha_b)*self.ub
uw += (-gw*grad_w*self.eta
+ np.sqrt(2*self.gw*self.eta/beta)
* np.random.randn(*shape_w)
+ self.eta/beta*(1-self.gw/gw)/self.uw)
ub += (-gb*grad_b*self.eta
+ np.sqrt(2*self.gb*self.eta/beta)
* np.random.randn(*shape_b)
+ self.eta/beta*(1-self.gb/gb)/self.ub)
uw *= np.exp(-0.5*self.alpha_w)
ub *= np.exp(-0.5*self.alpha_b)
self.alpha_w += (uw*uw - self.eta/beta)*0.5
self.alpha_b += (ub*ub - self.eta/beta)*0.5
else:
# Refinement.
uw = np.exp(-0.5*self.alpha_w)*self.uw
ub = np.exp(-0.5*self.alpha_b)*self.ub
uw -= gw*grad_w*self.eta
ub -= gb*grad_b*self.eta
uw *= np.exp(-0.5*self.alpha_w)
ub *= np.exp(-0.5*self.alpha_b)
# Update values.
self.uw = uw
self.ub = ub
self.gw = gw
self.gb = gb
dw = gw*uw*0.5
db = gb*ub*0.5
return dw, db
class AMSGrad(Optimizer):
def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, eps=1e-8,
*args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
self.vhatw = 0
self.vhatb = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
self.vhatw = np.maximum(self.vhatw, self.vw)
self.vhatb = np.maximum(self.vhatb, self.vb)
alpha_t = self.alpha / np.sqrt(t)
dw = - alpha_t * self.mw/np.sqrt(self.vhatw + self.eps)
db = - alpha_t * self.mb/np.sqrt(self.vhatb + self.eps)
return dw, db
class AdaBound(Optimizer):
def __init__(self, alpha=1e-3, eta=1e-1, beta1=0.9, beta2=0.999,
eps=1e-8, *args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.eta = eta
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
etal = self.eta*(1 - 1/((1-self.beta2)*t + 1))
etau = self.eta*(1 + 1/((1-self.beta2)*t + self.eps))
etahatw_t = np.clip(self.alpha/np.sqrt(self.vw), etal, etau)
etahatb_t = np.clip(self.alpha/np.sqrt(self.vb), etal, etau)
etaw_t = etahatw_t/np.sqrt(t)
etab_t = etahatb_t/np.sqrt(t)
dw = - etaw_t*self.mw
db = - etab_t*self.mb
return dw, db
class AMSBound(Optimizer):
def __init__(self, alpha=1e-3, eta=1e-1, beta1=0.9, beta2=0.999,
eps=1e-8, *args, **kwds):
super().__init__(*args, **kwds)
self.alpha = alpha
self.eta = eta
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
#Hold the value of the previous step
self.mw = 0
self.mb = 0
self.vw = 0
self.vb = 0
self.vhatw = 0
self.vhatb = 0
def update(self, grad_w, grad_b, t=1, *args, **kwds):
self.mw += (1-self.beta1)*(grad_w - self.mw)
self.mb += (1-self.beta1)*(grad_b - self.mb)
self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
self.vhatw = np.maximum(self.vhatw, self.vw)
self.vhatb = np.maximum(self.vhatb, self.vb)
etal = self.eta*(1 - 1/((1-self.beta2)*t + 1))
etau = self.eta*(1 + 1/((1-self.beta2)*t + self.eps))
etahatw_t = np.clip(self.alpha/np.sqrt(self.vhatw), etal, etau)
etahatb_t = np.clip(self.alpha/np.sqrt(self.vhatb), etal, etau)
etaw_t = etahatw_t/np.sqrt(t)
etab_t = etahatb_t/np.sqrt(t)
dw = - etaw_t*self.mw
db = - etab_t*self.mb
return dw, db
opt_example.py
%matplotlib nbagg
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
def f(x):
return ((1/4) * x**4 - (1/6) * x**3 - (17/8) * x**2 - (15/8) * x) * 4
def df(x):
return (x**3 - (1/2) * x**2 - (17/4) * x - 15/8) * 4
x = np.arange(-3, 3, 1e-2)
start_x = -3
start_y = f(start_x)
exact = 2.5
epoch = 100
seed=32
np.random.seed(seed=seed)
# epoch=100 is seed=32
opt_dict = {
"SDG": SGD(),
"MSGD": MSGD(),
"NAG": NAG(mu=0.95),
"AdaGrad": AdaGrad(eta=4e-1),
"RMSprop": RMSprop(eta=4e-2),
"AdaDelta": AdaDelta(eps=1e-2),
"Adam": Adam(alpha=4e-1),
"RMSpropGraves": RMSpropGraves(eta=4e-2),
"SMORMS3": SMORMS3(eta=4e-2),
"AdaMax": AdaMax(),
"Nadam": Nadam(),
"Eve": Eve(alpha=4e-1),
"SantaE": SantaE(eta=4e-4, burnin=0.5*epoch),
"SantaSSS": SantaSSS(eta=4e-4, burnin=0.5*epoch),
"AMSGrad": AMSGrad(alpha=4e-1),
"AdaBound": AdaBound(alpha=4e-1),
"AMSBound": AMSBound(alpha=4e-1),
}
current_x = np.full(len(opt_dict), start_x, dtype=float)
current_y = np.full(len(opt_dict), start_y, dtype=float)
err_list = np.zeros((epoch, len(opt_dict)), dtype=float)
cmap = plt.get_cmap("rainbow")
coloring = [cmap(i) for i in np.linspace(0, 1, len(opt_dict))]
fig, ax = plt.subplots(1)
ax.set_position([0.1, 0.1, 0.6, 0.8])
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.grid()
images = []
fixed_im, = ax.plot(x, f(x))
for i in range(1, epoch+1):
im_buf = []
for j, opt in enumerate(opt_dict):
im, = ax.plot(current_x[j], current_y[j],
marker="o", color=coloring[j])
err_list[i-1, j] = 0.5 * (current_x[j] - exact)**2
dx, _ = opt_dict[opt].update(df(current_x[j]), 1,
t=i, N=epoch,
w=current_x[j],
df=df, fw=f(current_x[j]))
current_x[j] += dx
current_y[j] = f(current_x[j])
im_buf.append(im)
images.append([fixed_im] + im_buf)
for j, opt in enumerate(opt_dict):
im, = ax.plot(current_x[j], current_y[j],
marker="o", color=coloring[j],
label=opt)
im_buf.append(im)
images.append([fixed_im] + im_buf)
fig.suptitle("Optimiser comparison")
fig.legend(bbox_to_anchor=(0.96, 0.9))
ani = animation.ArtistAnimation(fig, images, interval=100)
fig2, ax2 = plt.subplots(1)
fig2.suptitle("Optimiser comparison of errors")
ax2.set_position([0.1, 0.1, 0.6, 0.8])
ax2.set_xlabel("x")
ax2.set_ylabel("errors")
ax2.set_yscale("log")
ax2.grid()
for j, opt in enumerate(opt_dict):
ax2.plot(err_list[:, j], color=coloring[j], label=opt)
fig2.legend(bbox_to_anchor=(0.96, 0.9))
It took a lot of time ... If you find any mistakes, please let me know.
-Deep learning optimization algorithm -[2020 definitive edition] Super easy-to-understand optimization algorithm -Adam and Newton's method from loss function- -[Plot different colors with matplotlib](https://www.it-swarm.dev/ja/python/ Plot different colors with matplotlib / 1072229568 /) -Adjust the position of the matplotlib legend -matplotlib reverse lookup memo (frame edition)
-Introduction to Deep Learning ~ Basics ~ -Introduction to Deep Learning ~ Coding Preparation ~ -Introduction to Deep Learning ~ Forward Propagation ~ -Introduction to Deep Learning ~ Backpropagation ~ -List of activation functions (2020) -Gradient descent method list (2020) -Thorough understanding of im2col -Complete understanding of numpy.pad function
Recommended Posts