Target person

I summarized the stochastic gradient descent method, which is an optimization method for deep learning. However, based on the formulas and papers summarized in here, Pytorch decides the initial values and constants that were not included in the paper. , Keras, Chainer, etc. are just summarized. Formulas etc. are summarized in an implementation-like manner, so please take a look if you want to implement it yourself.

Please feel free to point out any mistakes. Also, if you have information on new optimization methods, please let us know about the optimization methods you are creating!

-Introduction -[What is the steepest descent method](# What is the steepest descent method) -[What is Stochastic Gradient Descent](# What is Stochastic Gradient Descent)

SGD
Momentum SGD
NAG
AdaGrad
RMSprop
AdaDelta
Adam
RMSpropGraves
SMORMS3
AdaMax
Nadam
Eve
Santa-E
Santa-SSS -\ [GD by GD](under construction ...) -\ [AdaSecant](under construction ...)
AMSGrad
AdaBound
AMSBound -[Implementation code example](#Implementation code example)

Introduction

First of all, as prior knowledge, I will explain the difference between the steepest descent method and the stochastic gradient descent method. That's not a big deal.

What is the steepest descent method?

The steepest descent method is a method of descent in the direction that reduces the loss function most (the direction in which the gradient is maximum) **. It's also called gradient descent. As a procedure,

Input all input data to the neural network and let it calculate
Calculate the error using a ** loss function ** such as ** squared error ** or ** cross entropy error **
Calculate the partial derivative of the loss function for each parameter using the ** chain rule **
** Update parameters according to certain rules **
Repeat steps 1 to 4 until the ** change ** of the loss function value is small enough.

The chain rule is briefly mentioned in here.

** A certain rule ** is the heart of the gradient descent method. Performance is determined by this part. Unfortunately, there is no best gradient descent method for all problems (at least for now). The parameter update formula is for $ t $ at a certain time.

w_{t+1} = w_t - \eta \nabla_{w_t} \mathcal{L}(w_t)

It can be expressed as When expressed for each parameter element

w_{t+1}^{(i)} = w_t^{(i)} - \eta \nabla_{w_{t}^{(i)}} \mathcal{L}(w_t)

And so on (of course the parameter dimensions are usually more multidimensional). As an explanation of the symbol

--Parameters at $ w_t \ cdots $ time $ t $ -$ \ eta \ cdots $ Learning rate -$ \ mathcal {L} (w_t) \ cdots $ Loss function

is. The parameter update according to this update formula is repeated until the loss function hardly changes. Learning is considered complete when there is little change. Note that this is not "until the value of the loss function is small enough" **. For example, it is possible that learning does not converge due to insufficient parameters, so ** "until the amount of change is sufficiently small" ** learning is performed.

What is stochastic gradient descent?

Before discussing stochastic gradient descent, let's first introduce the drawbacks of steepest descent.

The drawback of the steepest descent method is that it ** updates the parameters using all the input data **. That's because this method ** you can't get out of it when you hit the minimum value **. Details are available at here, so I will omit it, but in short, if you learn using all the data at once, you will easily fall into such a situation. By the way, when the parameter dimension is 2 or more, a more troublesome point called ** saddle point ** appears. The saddle point is "a point that is maximal for one parameter but minimal for another", which is very annoying. This is because it is known that the saddle point increases exponentially as the parameter dimension becomes multidimensional. This is called a ** curse of dimensionality **, but since this saddle point still has a gradient of $ 0 $, learning does not proceed. Moreover, as the parameter dimension increases, it is much more likely to appear than the minimum and maximum points, so it is no different from the situation where there are "pitfalls" everywhere. Furthermore, the parameter dimension is basically the total number of weights $ w $ and bias $ b $, so it quickly becomes a ridiculous number.

Quiet talk ...

In any case, learning rules are a very delicate and very complicated problem. Let's get back to the topic. The drawback of the steepest descent method is that it uses all the input data, so it seems that it can be solved by doing this one by one. Therefore, the method of updating parameters using input data one by one is called ** stochastic gradient descent **. In addition, "learning method that sequentially updates parameters for each input data" in this way is called ** online learning **. Since the parameters are updated sequentially for each input data, it seems that a more accurate model can be created. By the way, what is probabilistic is that random numbers are used to extract data one by one from all input data.

Although it is such a stochastic gradient descent method, it has the disadvantage that it cannot be parallelized **. It's a waste to not be able to benefit from the progress of parallel processing such as GPUs in modern times. As a result of , a method called ** mini-batch learning ** was devised. The idea is simple: update the parameters using a certain number of input data at a time (likely $ 16 $ or $ 32 $). You can also parallelize this.

However, using mini-batch learning has the drawback of ** "vibrating in sharp valleys" **. It seems that the only way to overcome this is to leave it to the learning rules themselves.

SGD First is the most basic gradient descent method. SGD: Stochastic Gradient Descent is translated as stochastic gradient descent in Japanese. In other words, it's the learning rule I mentioned earlier. As a mathematical expression, it is the same as the steepest descent method.

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
  \Delta w_t &= - \eta g_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

In the implementation, $ g_t $ is calculated by the chain rule as the flowing error (gradient) (TensorFlow should be done by automatic differentiation?). So you don't have to worry about $ g_t $. $ \ Delta w_t $ is the amount of updates in this time step. Simply multiply the gradient $ g_t $ by the learning rate $ \ eta $. The reason for making it negative like $ \ Delta w_t =-\ eta g_t $ is to add it in the last expression. There is no further meaning, so if you set $ w_ {t + 1} = w_t-\ Delta w_t $, you can use $ \ Delta w_t = \ eta g_t $.

The value of hyperparameter is

symbol	Notation in code	value
\eta	eta	1e-2=10^{-2}

is.

Implementation example

`optimizers.py`


class SGD(Optimizer):
    def __init__(self, eta=1e-2, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        dw = -self.eta*grad_w
        db = -self.eta*grad_b
        return dw, db

MomentumSGD A simple SGD has slow convergence and is prone to problems such as vibration and saddle points, so it needs to be improved. One way to do this is to consider ** Momentum **. Moment is, in short, inertia. Vibration is suppressed by using the information of the previous gradient.

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
  \Delta w_t &= \mu \Delta w_{t-1} - (1 - \mu) \eta g_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

There are quite a few fluctuations in the notation depending on where $ \ eta $ is multiplied and whether $ 1- \ mu $ is multiplied by the second term on the right side of the second equation. Just a quick look

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
  \Delta w_t &= \mu \Delta w_{t-1} - \eta g_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
  \Delta w_t &= \mu \Delta w_{t-1} - g_t \\
  w_{t+1} &= w_t + \eta \Delta w_t
\end{align}

There was also. In this article, it is implemented by the first formula. The fluctuation of the notation around here can be made into the same expression by modifying the hyperparameters $ \ mu $ and $ \ eta $ appropriately, so don't worry about it.

Hyperparameters

symbol	Notation in code	value
\eta	eta	1e-2=10^{-2}
\mu	mu	0.9
\Delta w_0	dw, db	0

It has become.

Implementation example

`optimizers.py`


class MSGD(Optimizer):
    def __init__(self, eta=1e-2, mu=0.9, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.mu = mu

        #Hold the value of the previous step
        self.dw = 0
        self.db = 0
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        dw = self.mu*self.dw - (1-self.mu)*self.eta*grad_w
        db = self.mu*self.db - (1-self.mu)*self.eta*grad_b
        
        #Assigning in the view instead of copying is because these values may be used
        #This is because it will not be changed.
        self.dw = dw
        self.db = db
        
        return dw, db

NAG NAG: Nesterov's Accelerated Gradient method is a modification of MSGD to accelerate acceleration to convergence. By performing the gradient calculation using the previous ** update amount **, it is possible to roughly estimate the destination by one time step.

** Maybe [this formula](https://qiita.com/ZoneTsuyoshi/items/8ef6fa1e154d176e25b8#%E3%83%8D%E3%82%B9%E3%83%86%E3%83%AD%E3 % 83% 95% E3% 81% AE% E5% 8A% A0% E9% 80% 9F% E6% B3% 95nag-1983) Wrong. ** **

Probably the following formula.

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t+\mu \Delta w_{t-1}) \\
  \Delta w_t &= \mu \Delta w_{t-1} - (1-\mu)\eta g_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

Besides [here](https://qiita.com/omiita/items/1735c1d048fe5f611f80#54-%E3%83%A2%E3%83%BC%E3%83%A1%E3%83%B3%E3% 82% BF% E3% 83% A0% E3% 81% AE% E6% 94% B9% E8% 89% AF% E7% 89% 88-nag)

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t - \mu \Delta w_{t-1}) \\
  \Delta w_t &= \mu \Delta w_{t-1} + (1-\mu)g_t \\
  w_t &= w_{t-1} - \eta \Delta w_t
\end{align}

(The notation matches that of this article). We're doing the same thing, and we've actually implemented and tried both, but they worked the same, so it shouldn't be a problem.

Hyperparameters

symbol	Notation in code	value
\eta	eta	1e-2=10^{-2}
\mu	mu	0.9
\Delta w_0	dw, db	0

It is said.

Implementation example

`optimizers.py`


class NAG(Optimizer):
    def __init__(self, eta=1e-2, mu=0.9, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.mu = mu
        
        #Holds the value of the previous step
        self.dw = 0
        self.db = 0
    
    
    def update(self, grad_w, grad_b, w=0, b=0, df=None, *args, **kwds):
        grad_w = df(w + self.mu*self.dw)
        grad_b = df(b + self.mu*self.db)
        
        dw = self.mu*self.dw - (1-self.mu)*self.eta*grad_w
        db = self.mu*self.db - (1-self.mu)*self.eta*grad_b
        
        #Assigning in the view instead of copying is because these values may be used
        #This is because it will not be changed.
        self.dw = dw
        self.db = db
        
        return dw, db

It seems difficult to use in deep learning because it is necessary to re-flow the gradient that has flowed in the chain rule ... I wonder how to do it ...

AdaGrad Since MSGD did not use information on the direction of convergence, there is a problem that convergence is slow in the direction of a gentle gradient. AdaGrad is a pioneer in the method of adjusting the learning rate for each dimension to solve this problem.

\begin{align}
  g_t &= \nabla_{w_t} \mathcal{L}(w_t) \\
  \Delta w_t &= - \cfrac{\eta}{\displaystyle\sqrt{\sum_{\tau=1}^{t}{g_{\tau}^2}}} g_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

In terms of implementation, the sum of squares of the second equation

\begin{align}
  gw_t = gw_{t-1} + g_t^2
\end{align}

It saves memory and computational complexity by holding it like this.

Hyperparameters

symbol	Notation in code	value
\eta	eta	1e-3=10^{-3}
gw_0	gw, gb	0

It has become.

Implementation example

`optimizers.py`


class AdaGrad(Optimizer):
    def __init__(self, eta=1e-3, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta

        #Hold the value of the previous step
        self.gw = 0
        self.gb = 0
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        self.gw += grad_w*grad_w
        self.gb += grad_b*grad_b
        
        dw = -self.eta*grad_w/np.sqrt(self.gw)
        db = -self.eta*grad_b/np.sqrt(self.gb)

        return dw, db

RMSprop In AdaGrad, there is a steep slope and then a gentle slope, and even if there is another valley after that, it does not descend much into the valley. The reason is that the square of the gradient continues to accumulate and the learning rate continues to decrease. RMSprop has improved this. By "forgetting" the influence of the gradient with the coefficient $ \ rho $, the problem that the learning rate keeps decreasing is solved.

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
  v_t &= \rho v_{t-1} + (1-\rho)g_t^2 \\
  \Delta w_t &= - \cfrac{\eta}{\sqrt{v_t + \varepsilon}}g_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

When implementing, the second formula

\begin{align}
  v_t &= \rho v_{t-1} + (1-\rho)g_t^2 \\
\Leftrightarrow
  v_t &= (v_{t-1} - v_{t-1}) + \rho v_{t-1} + (1-\rho)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} - (1-\rho)v_{t-1} + (1-\rho)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} + (1-\rho)(g_t^2 - v_{t-1})
\end{align}

The formula is transformed as follows.

Hyperparameters

symbol	Notation in code	value
\eta	eta	1e-2=10^{-2}
\rho	rho	0.99
\varepsilon	eps	1e-8=10^{-8}
v_0	vw, vb	0

It has become.

Implementation example

`optimizers.py`


class RMSprop(Optimizer):
    def __init__(self, eta=1e-2, rho=0.99, eps=1e-8, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.rho = rho
        self.eps = eps
        
        #Hold the value of the previous step
        self.vw = 0
        self.vb = 0
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        self.vw += (1-self.rho)*(grad_w**2 - self.vw)
        self.vb += (1-self.rho)*(grad_b**2 - self.vb)
        
        dw = -self.eta*grad_w/np.sqrt(self.vw+self.eps)
        db = -self.eta*grad_b/np.sqrt(self.vb+self.eps)
        
        return dw, db

AdaDelta RMSprop is actually physically wrong. Simply put, the dimensions of both sides do not match. I'd like to say that it doesn't really matter because I'm not solving a physics problem ... but there are a few problems with this. After all, we need to fit the hyperparameters $ \ eta $ for each problem. For example

\begin{align}
  f(x) &= x^2 \\
  g(x) &= 10x^2
\end{align}

Consider fitting each of these two functions. At this time, if the learning rules whose dimensions do not match are used, it becomes necessary to adjust the learning rate for each function. In other words, you need to adjust the hyperparameters. This is a tedious and difficult task, and I can't imagine how much time would be wasted if I did this for a large-scale study that would take several days to study once. So, the dimensional mismatch is quite a problem, and it is this AdaDelta that solves it.

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
  v_t &= \rho v_{t-1} + (1-\rho) g_t^2 \\
  \Delta w_t &= - \cfrac{\sqrt{u_{t-1} + \varepsilon}}{\sqrt{v_t + \varepsilon}} g_t \\
  u_t &= \rho u_{t-1} + (1-\rho) (\Delta w_t)^2 \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

Formulas 2 and 4 for mounting

\begin{align}
  v_t &= \rho v_{t-1} + (1-\rho)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} + (1-\rho)(g_t^2 - v_{t-1}) \\
  u_t &= \rho u_{t-1} + (1-\rho) (\Delta w_t)^2 \\
\Leftrightarrow
  u_t &= u_{t-1} + (1-\rho) \left\{ ( \Delta w_t)^2 - u_{t-1} \right\}
\end{align}

The formula is transformed as follows.

Hyperparameters

symbol	Notation in code	value
\rho	rho	0.95
\varepsilon	eps	1e-6=10^{-6}
v_0	vw, vb	0
u_0	uw, ub	0

It has become.

Implementation example

`optimizers.py`


class AdaDelta(Optimizer):
    def __init__(self, rho=0.95, eps=1e-6, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.rho = rho
        self.eps = eps
        
        #Hold the value of the previous step
        self.vw = 0
        self.vb = 0
        self.uw = 0
        self.ub = 0
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        self.vw += (1-self.rho)*(grad_w**2 - self.vw)
        self.vb += (1-self.rho)*(grad_b**2 - self.vb)
        
        dw = -grad_w*np.sqrt(self.uw+self.eps)/np.sqrt(self.vw+self.eps)
        db = -grad_b*np.sqrt(self.ub+self.eps)/np.sqrt(self.vb+self.eps)
        
        self.uw += (1-self.rho)*(dw**2 - self.uw)
        self.ub += (1-self.rho)*(db**2 - self.ub)
        
        return dw, db

Adam Needless to say, it's the famous algorithm Adam. Since then, various improvements have been made, but it still seems to be active. After all, its strength is its high versatility. However, there is no difference in the improved version of RMSprop, so there is also the problem that the dimensions do not actually match. Even so, compared to other learning rules, it is thought that the dimensional problem is alleviated due to the characteristic of ** forgetting gradient information **.

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
  m_t &= \beta_1 m_{t-1} + (1- \beta_1) g_t \\
  v_t &= \beta_2 v_{t-1} + (1- \beta_2) g_t^2 \\
  \alpha_t &= \alpha_0 \cfrac{\sqrt{1-\beta_2^t}}{1- \beta_1^2} \\
  \Delta w_t &= - \alpha_t \cfrac{m_t}{\sqrt{v_t + \varepsilon}} \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

The above formula uses the improved code found in the text of the Original Paper. If it is a mathematical formula as it is

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
  m_t &= \beta_1 m_{t-1} + (1- \beta_1) g_t \\
  v_t &= \beta_2 v_{t-1} + (1- \beta_2) g_t^2 \\
  \hat{m}_t &= \cfrac{m_t}{1-\beta_1^t} \\
  \hat{v}_t &= \cfrac{v_t}{1-\beta_2^t} \\
  \Delta w_t &= - \alpha_0 \cfrac{\hat{m}_t}{\sqrt{\hat{v}_t + \varepsilon}} \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

is not it.

Formulas 2 and 3 for mounting

\begin{align}
  m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\
\Leftrightarrow
  m_t &= m_{t-1} + (1-\beta_1)(g_t - m_{t-1}) \\
  v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} + (1-\beta_2)(g_t^2 - v_{t-1})
\end{align}

The formula is transformed as follows.

Hyperparameters

symbol	Notation in code	value
\alpha_0	alpha	1e-3=10^{-3}
\beta_1	beta1	0.9
\beta_2	beta2	0.999
\varepsilon	eps	1e-8=10^{-8}
m_0	mw, mb	0
v_0	vw, vb	0

It has become.

Implementation example

`optimizers.py`


class Adam(Optimizer):
    def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, eps=1e-8,
                 *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
    
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
        self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
        
        alpha_t = self.alpha*np.sqrt(1-self.beta2**t)/(1-self.beta1**t)
        
        dw = -alpha_t*self.mw/(np.sqrt(self.vw+self.eps))
        db = -alpha_t*self.mb/(np.sqrt(self.vb+self.eps))
        
        return dw, db

RSMpropGraves This is an improved version of RMSprop devised by Graves. It seems to be used in the field of handwritten character recognition.

\begin{align}
  g_t &= \nabla_{w_t} (w_t) \\
  m_t &= \rho m_{t-1} + (1-\rho) g_t \\
  v_t &= \rho v_{t-1} + (1-\rho) g_t^2 \\
  \Delta w_t &= - \eta \cfrac{g_t}{\sqrt{v_t - m_t^2 + \varepsilon}} \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

Formulas 2 and 3 for mounting

\begin{align}
  m_t &= \rho m_{t-1} + (1-\rho)g_t \\
\Leftrightarrow
  m_t &= m_{t-1} + (1-\rho)(g_t - m_{t-1}) \\
  v_t &= \rho v_{t-1} + (1-\rho)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} + (1-\rho)(g_t^2 - v_{t-1})
\end{align}

The formula is transformed as follows.

Hyperparameters

symbol	Notation in code	value
\eta	eta	1e-4=10^{-4}
\rho	rho	0.95
\varepsilon	eps	1e-4=10^{-4}
m_0	mw, mb	0
v_0	vw, vb	0

It has become.

Implementation example

`optimizers.py`


class RMSpropGraves(Optimizer):
    def __init__(self, eta=1e-4, rho=0.95, eps=1e-4, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.rho = rho
        self.eps = eps

        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
    
    
    def update(self,grad_w, grad_b, *args, **kwds):
        self.mw += (1-self.rho)*(grad_w - self.mw)
        self.mb += (1-self.rho)*(grad_b - self.mb)
        self.vw += (1-self.rho)*(grad_w**2 - self.vw)
        self.vb += (1-self.rho)*(grad_b**2 - self.vb)
        
        dw = -self.eta*grad_w/np.sqrt(self.vw - self.mw**2 + self.eps)
        db = -self.eta*grad_b/np.sqrt(self.vb - self.mb**2 + self.eps)
        
        return dw, db

SMORMS3 This is also an improvement of RMSprop. It seems that the dimensional problem was overcome by the curvature and the LeCun method. I'm not sure what the LeCun method is, but it seems that LeCun is advocating a technique for the backpropagation method and announcing a model called LeNet.

\begin{align}
  g_t &= \nabla_{w_t}(w_t) \\
  s_t &= 1 + (1-\zeta_{t-1} s_{t-1}) \\
  \rho_t &= \cfrac{1}{s_t + 1} \\
  m_t &= \rho_t m_{t-1} + (1-\rho_t) g_t \\
  v_t &= \rho_t v_{t-1} + (1-\rho_t) g_t^2 \\
  \zeta_t &= \cfrac{m_t^2}{v_t + \varepsilon} \\
  \Delta w_t &= \cfrac{\textrm{min}\{ \eta, \zeta_t \}}{\sqrt{v_t + \varepsilon}} g_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

Formulas 4 and 5 for mounting

\begin{align}
  m_t &= \rho_t m_{t-1} + (1-\rho_t)g_t \\
\Leftrightarrow
  m_t &= m_{t-1} + (1-\rho_t)(g_t - m_{t-1}) \\
  v_t &= \rho_t v_{t-1} + (1-\rho_t)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} + (1-\rho_t)(g_t^2 - v_{t-1})
\end{align}

The formula is transformed as follows.

Hyperparameters

symbol	Notation in code	value
\eta	eta	1e-3=10^{-3}
\varepsilon	eps	1e-8=10^{-8}
\zeta_0	zetaw, zetab	0
s_t	sw, sb	0
m_0	mw, mb	0
v_0	vw, vb	0

It has become.

Implementation example

`optimizers.py`


class SMORMS3(Optimizer):
    def __init__(self, eta=1e-3, eps=1e-8, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.eps = eps
        
        #Hold the value of the previous step
        self.zetaw = 0
        self.zetab = 0
        self.sw = 0
        self.sb = 0
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        self.sw = 1 + (1 - self.zetaw*self.sw)
        self.sb = 1 + (1 - self.zetab*self.sb)
        rhow = 1/(1+self.sw)
        rhob = 1/(1+self.sb)
        
        self.mw += (1-rhow)*(grad_w - self.mw)
        self.mb += (1-rhob)*(grad_b - self.mb)
        self.vw += (1-rhow)*(grad_w**2 - self.vw)
        self.vb += (1-rhob)*(grad_b**2 - self.vb)
        
        self.zetaw = self.mw**2 / (self.vw + self.eps)
        self.zetaw = self.mb**2 / (self.vb + self.eps)
        
        dw = -grad_w*(np.minimum(self.eta, self.zetaw)
                      /np.sqrt(self.vw + self.eps))
        db = -grad_b*(np.minimum(self.eta, self.zetab)
                      /np.sqrt(self.vb + self.eps))
        
        return dw, db

AdaMax It's a dimensionless version of Adam.

\begin{align}
  g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
  m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\
  u_t &= \textrm{max}\left\{ \beta_2 u_{t-1}, |g_t| \right\} \\
  \Delta w_t &= - \cfrac{\alpha}{1-\beta_1^t}\cfrac{m_t}{u_t} \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

The second formula for mounting

\begin{align}
  m_t &= \rho_t m_{t-1} + (1-\rho_t)g_t \\
\Leftrightarrow
  m_t &= m_{t-1} + (1-\rho_t)(g_t - m_{t-1}) \end{align}

The formula is transformed as follows.

Hyperparameters

symbol	Notation in code	value
\alpha	alpha	2e-3=2 \times 10^{-3}
\beta_1	beta1	0.9
\beta_2	beta2	0.999
m_0	mw, mb	0
u_0	uw, ub	0

It has become.

Implementation example

`optimizers.py`


class AdaMax(Optimizer):
    def __init__(self, alpha=2e-3, beta1=0.9, beta2=0.999,
                 *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.uw = 0
        self.ub = 0
    
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        self.uw = np.maximum(self.beta2*self.uw, np.abs(grad_w))
        self.ub = np.maximum(self.beta2*self.ub, np.abs(grad_b))
        
        alpha_t = self.alpha/(1 - self.beta1**t)
        
        dw = -alpha_t*self.mw/self.uw
        db = -alpha_t*self.mb/self.ub
        
        return dw, db

Nadam It's a combination of NAG into Adam.

\begin{align}
  g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
  m_t &= \mu m_{t-1} + (1-\mu) g_t \\
  v_t &= \nu v_{t-1} + (1-\nu) g_t^2 \\
  \hat{m}_t &= \cfrac{\mu}{1 - \mu^{t+1}} m_t + \cfrac{1-\mu}{1-\mu^t} g_t \\
  \hat{v}_t &= \cfrac{\nu}{1-\nu^t} v_t \\
  \Delta w_t &= - \alpha \cfrac{\hat{m}_t}{\sqrt{\hat{v}_t + \varepsilon}} \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

Actually, $ \ alpha $ and $ \ mu $ are timestep-dependent parameters, if true

\begin{align}
  g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
  m_t &= \mu_t m_{t-1} + (1-\mu_t) g_t \\
  v_t &= \nu v_{t-1} + (1-\nu) g_t^2 \\
  \hat{m}_t &= \cfrac{\mu_t}{1 - \displaystyle\prod_{\tau=1}^{t+1}{\mu_{\tau}}} m_t + \cfrac{1-\mu_t}{1-\displaystyle\prod_{\tau=1}^{t}{\mu_{\tau}}} g_t \\
  \hat{v}_t &= \cfrac{\nu}{1-\nu^t} v_t \\
  \Delta w_t &= - \alpha_t \cfrac{\hat{m}_t}{\sqrt{\hat{v}_t + \varepsilon}} \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

It is expressed by the formula. However, it seems that it is fixed in most cases, so I fixed it for the time being.

Formulas 2 and 3 for mounting

\begin{align}
  m_t &= \mu m_{t-1} + (1-\mu)g_t \\
\Leftrightarrow
  m_t &= m_{t-1} + (1-\mu)(g_t - m_{t-1}) \\
  v_t &= \nu v_{t-1} + (1-\nu)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} + (1-\nu)(g_t^2 - v_{t-1})
\end{align}

The formula is transformed as follows.

Hyperparameters

symbol	Notation in code	value
\alpha	alpha	2e-3=2 \times 10^{-3}
\mu	mu	0.975
\nu	nu	0.999
\varepsilon	eps	1e-8=10^{-8}
m_0	mw, mb	0
v_0	vw, vb	0

It has become.

Implementation example

`optimizers.py`


class Nadam(Optimizer):
    def __init__(self, alpha=2e-3, mu=0.975, nu=0.999, eps=1e-8,
                 *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.mu = mu
        self.nu = nu
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
        
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.mu)*(grad_w - self.mw)
        self.mb += (1-self.mu)*(grad_b - self.mb)
        self.vw += (1-self.nu)*(grad_w**2 - self.vw)
        self.vb += (1-self.nu)*(grad_b**2 - self.vb)
        
        mhatw = (self.mu*self.mw/(1-self.mu**(t+1)) 
                 + (1-self.mu)*grad_w/(1-self.mu**t))
        mhatb = (self.mu*self.mb/(1-self.mu**(t+1)) 
                 + (1-self.mu)*grad_b/(1-self.mu**t))
        vhatw = self.nu*self.vw/(1-self.nu**t)
        vhatb = self.nu*self.vb/(1-self.nu**t)
        
        dw = -self.alpha*mhatw/np.sqrt(vhatw + self.eps)
        db = -self.alpha*mhatb/np.sqrt(vhatb + self.eps)
        
        return dw, db

Eve It's an improved version of Adam. It speeds up learning in flat areas (areas with almost zero gradient). The names are fashionable, aren't they? Adam and Eve

\begin{align}
  g_t &= \nabla_{w_t}\mathcal{L}(w_t) \\
  m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\
  v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\
  \hat{m}_t &= \cfrac{m_t}{1-\beta_1^t} \\
  \hat{v}_t &= \cfrac{v_t}{1-\beta_2^t} \\
  \textrm{if} \quad &t \gt 1 \\
  & d_t = \cfrac{\left| f_t - f_{t-1} \right|}{\textrm{min} \left\{ f_t, f_{t-1} \right\} - f^{\star}} \\
  & \hat{d}_t = \textrm{clip}\left( d_t, \left[ \frac{1}{c}, c \right] \right) \\
  & \tilde{d}_t = \beta_3 \tilde{d}_{t-1} + (1-\beta_3)\hat{d}_t \\
  \textrm{else} \\
  & \tilde{d}_t = 1 \\
  \textrm{end} &\textrm{if} \\
  \Delta w_t &= - \cfrac{\alpha}{\tilde{d}_t} \cfrac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

It's too hard ... I'll add a few notes about the symbols in the formula. $ f_t $ is the value of the objective function $ f $ at the time step $ t $. $ f ^ {\ star} $ is the minimum value of the objective function. However, according to the original paper, if the squared error or cross entropy error is used for the loss function, $ 0 $ is sufficient. $ \ textrm {clip} $ is a function that sets the minimum and maximum values and rounds the values below / above it to the set value.

Formulas 2 and 3 for mounting

\begin{align}
  m_t &= \mu m_{t-1} + (1-\mu)g_t \\
\Leftrightarrow
  m_t &= m_{t-1} + (1-\mu)(g_t - m_{t-1}) \\
  v_t &= \nu v_{t-1} + (1-\nu)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} + (1-\nu)(g_t^2 - v_{t-1})
\end{align}

The formula is transformed as follows.

Hyperparameters

symbol	Notation in code	value
\alpha	alpha	1e-3=10^{-3}
\beta_1	beta1	0.9
\beta_2	beta2	0.999
\beta_3	beta3	0.999
c	c	10
\varepsilon	eps	1e-8=10^{-8}
f^{\star}	fstar	0
m_0	mw, mb	0
v_0	vw, vb	0
f_t	fw, fb	\infty
\tilde{d}_0	dtilde_w, dtilde_b	0

It has become.

Implementation example

`optimizers.py`


class Eve(Optimizer):
    def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, beta3=0.999,
                 c=10, eps=1e-8, fstar=0, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta
        self.beta3 = beta3
        self.c = c
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
        self.fw = np.inf
        self.fb = np.inf
        self.fstar = fstar
        self.dtilde_w = 0
        self.dtilde_b = 0
    
    
    def update(self, grad_w, grad_b, t=1, fw=1, fb=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
        self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
        
        mhatw = self.mw/(1 - self.beta1**t)
        mhatb = self.mb/(1 - self.beta1**t)
        vhatw = self.vw/(1 - self.beta2**t)
        vhatb = self.vb/(1 - self.beta2**t)
        
        if t > 1:
            d_w = (np.abs(fw-self.fstar)
                    /(np.minimum(fw, self.fw) - self.fstar))
            d_b = (np.abs(fb-self.fstar)
                    /(np.minimum(fb, self.fb) - self.fstar))
            dhat_w = np.clip(d_w, 1/self.c, self.c)
            dhat_b = np.clip(d_b, 1/self.c, self.c)
            self.dtilde_w += (1 - self.beta3)*(dhat_w - self.dtilde_w)
            self.dtilde_b += (1 - self.beta3)*(dhat_b - self.dtilde_b)
        else:
            self.dtilde_w = 1
            self.dtilde_b = 1
        
        self.fw = fw
        self.fb = fb
        
        dw = -(self.alpha*mhatw
               /(self.dtilde_w*(np.sqrt(vhatw) + self.eps)))
        db = -(self.alpha*mhatb
               /(self.dtilde_b*(np.sqrt(vhatb) + self.eps)))
        
        return dw, db

Santa-E It is a Markov chain Monte Carlo method (MCMC) incorporated into Adam and RMSprop. Among them, Santa-E seems to use Euler type? Original paper was read through when implementing, but honestly I don't understand at all ... lol But even if you don't understand, you can write the code if you design according to the formula! So I did my best. Or rather, I was in trouble because I didn't give me the initial values of the original paper ... I took a peek at the implementations of Pytorch, Keras, Chainer, etc. (all were disjointed ...), and on my own implementation I use only the hyperparameters that I feel are necessary.

\begin{align}
  g_t &= \nabla_{w_t} \mathcal{L}(w_t) \\
  v_t &= \sigma v_{t-1} + \cfrac{1-\sigma}{N^2} g_t^2 \\
  \mathcal{G}_t &= \cfrac{1}{\sqrt{\lambda + \sqrt{v_t}}} \\
  \textrm{if} \quad &t \lt burnin \\
  & \alpha_t = \alpha_{t-1} + (u_t^2 - \cfrac{\eta}{\beta_t}) \\
  & u_t = \cfrac{\eta}{\beta_t}\cfrac{1-\frac{\mathcal{G}_{t-1}}{\mathcal{G}_t}}{u_{t-1}} + \sqrt{\cfrac{2\eta}{\beta_t}\mathcal{G}_{t-1}} \otimes \zeta_t \\
  \textrm{else} \\
  & \alpha_t = \alpha_{t-1} \\
  & u_t = 0 \\
  \textrm{end} &\textrm{if} \\
  u_t &= u_t + (1-\alpha_t) \otimes u_{t-1} - \eta \mathcal{G}_t \otimes g_t \\
  \Delta w_t &= \mathcal{G}_t \otimes u_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

There are a lot of element products $ \ otimes $, but you don't have to worry about it. It is the matrix product that you should be careful about when implementing normally.

Hyperparameters

symbol	Notation in code	value
\eta	eta	1e-2=10^{-2}
\sigma	sigma	0.95
\lambda	lambda_	1e-8=10^{-8}
burnin	burnin	100
C	C	5
\alpha	alpha_w, alpha_b	\sqrt{\eta}C
\zeta_t	Generate random numbers directly without storing in variables	\mathcal{N} (0, 1)
v_0	vw, vb	0
\mathcal{G}_t	gw, gb	0
u_0	uw, ub	\sqrt{\eta} \mathcal{N} (0, 1)

It has become. $ \ mathcal {N} (0, 1) $ is a random number according to the standard normal distribution. Also, $ \ beta $ is determined by ʻanne_func and ʻanne_rate in the code. The original paper does not specify this, but the condition is that $ t \ to \ infinty $ must be $ \ beta_t \ to \ infinty $. Apparently Keras, [Pytorch](https://github.com/ReyhaneAskari/santa/blob/master/santa. $ \ Beta_t = \ sqrt {t} $ in famous machine learning libraries such as py), Chainer You are calculating. But is it all right if the conditions are met?

Implementation example

`optimizers.py`


class SantaE(Optimizer):
    def __init__(self, eta=1e-2, sigma=0.95, lambda_=1e-8,
                 anne_func=lambda t, n: t**n, anne_rate=0.5,
                 burnin=100, C=5,
                 *args, **kwds):
        """
        Args:
            eta: Learning rate
            sigma: Maybe in other cases;
                    'rho' in RMSprop, AdaDelta, RMSpropGraves.
                    'rhow' or 'rhob' in SMORMS3.
                    'beta2' in Adam, Eve.
                    'nu' in Nadam.
                   To use calculation 'v'.
            lambda_: Named 'eps'(ε) in other cases.
            anne_func: Annealing function.
                       To use calculation 'beta' at each timestep.
                       Default is 'timestep'**'annealing rate'.
                       The calculated value should be towards infinity
                       as 't' increases.
            anne_rate: Annealing rate.
                       To use calculation 'beta' at each timestep.
                       The second Argument of 'anne_func'.
            burnin: Swith exploration and refinement.
                    This should be specified by users.
            C: To calculate first 'alpha'.
        """
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.sigma = sigma
        self.lambda_ = lambda_
        self.anne_func = anne_func
        self.anne_rate = anne_rate
        self.burnin = burnin
        
        # Keep one step before and Initialize.
        self.alpha_w = np.sqrt(eta)*C
        self.alpha_b = np.sqrt(eta)*C
        self.vw = 0
        self.vb = 0
        self.gw = 0
        self.gb = 0
    
    
    def update(self, grad_w, grad_b, t=1, N=200, *args, **kwds):
        try:
            shape_w = grad_w.shape
        except:
            shape_w = (1, )
        try:
            shape_b = grad_b.shape
        except:
            shape_b = (1, )
        
        if t == 1:
            # Initialize uw, ub.
            self.uw = np.sqrt(self.eta)*np.random.randn(*shape_w)
            self.ub = np.sqrt(self.eta)*np.random.randn(*shape_b)
        
        self.vw = (self.sigma*self.vw 
                   + grad_w*grad_w * (1 - self.sigma) / N**2)
        self.vb = (self.sigma*self.vb 
                   + grad_b*grad_b * (1 - self.sigma) / N**2)
        
        gw = 1/np.sqrt(self.lambda_ + np.sqrt(self.vw))
        gb = 1/np.sqrt(self.lambda_ + np.sqrt(self.vb))
        
        beta = self.anne_func(t, self.anne_rate)
        if t < self.burnin:
            # Exploration.
            self.alpha_w += self.uw*self.uw - self.eta/beta
            self.alpha_b += self.ub*self.ub - self.eta/beta
            
            uw = (self.eta/beta * (1 - self.gw/gw)/self.uw
                  + np.sqrt(2*self.eta/beta * self.gw) 
                  * np.random.randn(*shape_w))
            ub = (self.eta/beta * (1 - self.gb/gb)/self.ub
                  + np.sqrt(2*self.eta/beta * self.gb) 
                  * np.random.randn(*shape_b))
        else:
            # Refinement.
            uw = 0
            ub = 0
        
        uw += (1 - self.alpha_w)*self.uw - self.eta*gw*grad_w
        ub += (1 - self.alpha_b)*self.ub - self.eta*gb*grad_b
        
        # Update values.
        self.uw = uw
        self.ub = ub
        self.gw = gw
        self.gb = gb
        
        dw = gw*uw
        db = gb*ub
        
        return dw, db

Santa-SSS This is also the algorithm described in the same paper. I don't understand at all, but I can implement it. What is Lehman Information Geometry?

\begin{align}
  g_t &= \nabla_{w_t} \mathcal{L}(w_t) \\
  v_t &= \sigma v_{t-1} + \cfrac{1-\sigma}{N^2} g_t^2 \\
  \mathcal{G}_t &= \cfrac{1}{\sqrt{\lambda + \sqrt{v_t}}} \\
  \Delta w_t &= \mathcal{G}_t \otimes u_{t-1} \times 0.5 \\
  \textrm{if} \quad &t \lt burnin \\
  & \alpha_t = \alpha_{t-1} + (u_t^2 - \cfrac{\eta}{\beta_t}) \times 0.5 \\
  & u_t = e^{- 0.5\alpha_t} \otimes u_{t-1} \\
  & u_t = u_t - \mathcal{G}_t \otimes g_t \eta + \sqrt{2\mathcal{G}_t \cfrac{\eta}{\beta_t}} \otimes \zeta_t + \cfrac{\eta}{\beta_t}\cfrac{ 
1 - \frac{\mathcal{G}_{t-1}}{\mathcal{G}_t}}{u_{t-1}} \\
  & u_t = e^{-0.5\alpha_t} \otimes u_t \\
  & \alpha_t = \alpha_t + \left( u_t^2 - \cfrac{\eta}{\beta_t} \right) \times 0.5 \\
  \textrm{else} \\
  & \alpha_t = \alpha_{t-1} \\
  & u_t = e^{-0.5\alpha_t} \otimes u_{t-1} \\
  & u_t = u_t - \mathcal{G}_t \otimes g_t \eta \\
  & u_t = e^{-0.5\alpha_t} \otimes u_t \\
  \textrm{end} &\textrm{if} \\
  \Delta w_t &= \Delta w_t + \mathcal{G}_t \otimes u_t \times 0.5 \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

Hyperparameters

symbol	Notation in code	value
\eta	eta	1e-2=10^{-2}
\sigma	sigma	0.95
\lambda	lambda_	1e-8=10^{-8}
burnin	burnin	100
C	C	5
\alpha	alpha_w, alpha_b	\sqrt{\eta}C
\zeta_t	Generate random numbers directly without storing in variables	\mathcal{N} (0, 1)
v_0	vw, vb	0
\mathcal{G}_t	gw, gb	0
u_0	uw, ub	\sqrt{\eta} \mathcal{N} (0, 1)

It has become.

Implementation example

`optimizers.py`


class SantaSSS(Optimizer):
    def __init__(self, eta=1e-2, sigma=0.95, lambda_=1e-8,
                 anne_func=lambda t, n: t**n, anne_rate=0.5,
                 burnin=100, C=5,
                 *args, **kwds):
        """
        Args:
            eta: Learning rate
            sigma: Maybe in other cases;
                    'rho' in RMSprop, AdaDelta, RMSpropGraves.
                    'rhow' or 'rhob' in SMORMS3.
                    'beta2' in Adam, Eve.
                    'nu' in Nadam.
                   To use calculation 'v'.
            lambda_: Named 'eps'(ε) in other cases.
            anne_func: Annealing function.
                       To use calculation 'beta' at each timestep.
                       Default is 'timestep'**'annealing rate'.
                       The calculated value should be towards infinity
                       as 't' increases.
            anne_rate: Annealing rate.
                       To use calculation 'beta' at each timestep.
                       The second Argument of 'anne_func'.
            burnin: Swith exploration and refinement.
                    This should be specified by users.
            C: To calculate first 'alpha'.
        """
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.sigma = sigma
        self.lambda_ = lambda_
        self.anne_func = anne_func
        self.anne_rate = anne_rate
        self.burnin = burnin
        
        # Keep one step before and Initialize.
        self.alpha_w = np.sqrt(eta)*C
        self.alpha_b = np.sqrt(eta)*C
        self.vw = 0
        self.vb = 0
        self.gw = 0
        self.gb = 0
    
    
    def update(self, grad_w, grad_b, t=1, N=200, *args, **kwds):
        try:
            shape_w = grad_w.shape
        except:
            shape_w = (1, )
        try:
            shape_b = grad_b.shape
        except:
            shape_b = (1, )
        
        if t == 1:
            # Initialize uw, ub.
            self.uw = np.sqrt(self.eta)*np.random.randn(*shape_w)
            self.ub = np.sqrt(self.eta)*np.random.randn(*shape_b)
        
        self.vw = (self.sigma*self.vw 
                   + grad_w*grad_w * (1 - self.sigma) / N**2)
        self.vb = (self.sigma*self.vb 
                   + grad_b*grad_b * (1 - self.sigma) / N**2)
        
        gw = 1/np.sqrt(self.lambda_ + np.sqrt(self.vw))
        gb = 1/np.sqrt(self.lambda_ + np.sqrt(self.vb))
        
        dw = 0.5*gw*self.uw
        db = 0.5*gb*self.ub
        
        beta = self.anne_func(t, self.anne_rate)
        if t < self.burnin:
            # Exploration.
            self.alpha_w += (self.uw*self.uw - self.eta/beta)*0.5
            self.alpha_b += (self.ub*self.ub - self.eta/beta)*0.5
            
            uw = np.exp(-0.5*self.alpha_w)*self.uw
            ub = np.exp(-0.5*self.alpha_b)*self.ub
            uw += (-gw*grad_w*self.eta 
                        + np.sqrt(2*self.gw*self.eta/beta)
                        * np.random.randn(*shape_w)
                        + self.eta/beta*(1-self.gw/gw)/self.uw)
            ub += (-gb*grad_b*self.eta 
                        + np.sqrt(2*self.gb*self.eta/beta)
                        * np.random.randn(*shape_b)
                        + self.eta/beta*(1-self.gb/gb)/self.ub)
            uw *= np.exp(-0.5*self.alpha_w)
            ub *= np.exp(-0.5*self.alpha_b)
            
            self.alpha_w += (uw*uw - self.eta/beta)*0.5
            self.alpha_b += (ub*ub - self.eta/beta)*0.5
        else:
            # Refinement.
            uw = np.exp(-0.5*self.alpha_w)*self.uw
            ub = np.exp(-0.5*self.alpha_b)*self.ub
            
            uw -= gw*grad_w*self.eta
            ub -= gb*grad_b*self.eta
            
            uw *= np.exp(-0.5*self.alpha_w)
            ub *= np.exp(-0.5*self.alpha_b)
        
        # Update values.
        self.uw = uw
        self.ub = ub
        self.gw = gw
        self.gb = gb
        
        dw = gw*uw*0.5
        db = gb*ub*0.5
        
        return dw, db

GD by GD Papers I read ... I don't know at all ...

AdaSecant Papers Read ... I don't understand the moving average relationship ...

AMSGrad RMSprop, AdaDelta, Adam, Nadam had exponential decay due to the exponential moving average. Even if important gradient information flows, it may not be possible to converge to the optimum solution because it is forgotten immediately. AMS Grad has solved that problem.

\begin{align}
  g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
  m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\
  v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\
  \hat{v}_t &= \textrm{max}\{ \hat{v}_{t-1}, v_t \} \\
  \Delta w_t &= - \alpha_t \cfrac{m_t}{\sqrt{\hat{v}_t + \varepsilon}} \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

Formulas 2 and 3 for mounting

\begin{align}
  m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\
\Leftrightarrow
  m_t &= m_{t-1} + (1-\beta_1)(g_t - m_{t-1}) \\
  v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} + (1-\beta_2)(g_t^2 - v_{t-1})
\end{align}

The formula is transformed as follows. Also, is it better to calculate with $ \ alpha_t = \ frac {\ alpha} {\ sqrt {t}} $? There was a description like that in the paper.

Hyperparameters

symbol	Notation in code	value
\alpha_0	alpha	1e-3=10^{-3}
\beta_1	beta1	0.9
\beta_2	beta2	0.999
\varepsilon	eps	1e-8=10^{-8}
m_0	mw, mb	0
v_0	vw, vb	0
\hat{v}_0	vhatw, vhatb	0

It has become. Regarding the initial value $ \ alpha_0 $, there was no specific value in the paper, so Chainer I brought it from.

Implementation example

`optimizers.py`


class AMSGrad(Optimizer):
    def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, eps=1e-8,
                 *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
        self.vhatw = 0
        self.vhatb = 0
    
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        
        self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
        self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
        
        self.vhatw = np.maximum(self.vhatw, self.vw)
        self.vhatb = np.maximum(self.vhatb, self.vb)
        
        alpha_t = self.alpha / np.sqrt(t)
        
        dw = - alpha_t * self.mw/np.sqrt(self.vhatw + self.eps)
        db = - alpha_t * self.mb/np.sqrt(self.vhatb + self.eps)
        
        return dw, db

AdaBound AMSGrad suppressed the problem of unnecessarily large learning rate, but did not solve the opposite problem of unnecessarily small. AdaBound and AMSBound have Adam's "quick learning but cannot suppress generalization error well" and SGD' s "final generalization error can be reduced but until then". Combining the "slow" points, it was proposed as an optimization method that behaves like Adam at the beginning and SGD at the end. AdaBound is an improved version of Adam.

\begin{align}
  g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
  m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\
  v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\
  \hat{\eta}_t &= \textrm{clip}\left( \cfrac{\alpha}{\sqrt{v_t}}, \left[ \eta_l(t), \eta_h(t) \right] \right) \\
  \eta_t &= \cfrac{\hat{\eta}_t}{\sqrt{t}} \\
  \Delta w_t &= - \eta_t m_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

Formulas 2 and 3 for mounting

\begin{align}
  m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\
\Leftrightarrow
  m_t &= m_{t-1} + (1-\beta_1)(g_t - m_{t-1}) \\
  v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} + (1-\beta_2)(g_t^2 - v_{t-1})
\end{align}

The formula is transformed as follows. Also, if the learning rate as SGD (learning rate at the end) is $ \ eta $,

\begin{align}
  \eta_l(t) &= \eta \left( 1 - \cfrac{1}{(1-\beta_2)t+1} \right) \\
  \eta_u(t) &= \eta \left( 1 - \cfrac{1}{(1-\beta_2)t} \right)
\end{align}

It seems that it is calculated in the paper, so (probably) I will adopt it. actually,

- $ \ eta_l $ is $ 0 $ at $ t = 0 $, $ \ eta_-$ at $ t \ to \ infinty $ - $ \ eta_u $ is $ \ infinty $ for $ t = 0 $, $ \ eta_ + $ for $ infinty $ - $ \ eta_ {\ pm} $ is the right limit and the left limit , respectively.

It seems that it should be.

Hyperparameters

symbol	Notation in code	value
\alpha_0	alpha	1e-3=10^{-3}
\eta	eta	1e-1=10^{-1}
\beta_1	beta1	0.9
\beta_2	beta2	0.999
\varepsilon	eps	1e-8=10^{-8}
m_0	mw, mb	0
v_0	vw, vb	0

It has become.

Implementation example

`optimizers.py`


class AdaBound(Optimizer):
    def __init__(self, alpha=1e-3, eta=1e-1, beta1=0.9, beta2=0.999,
                 eps=1e-8, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.eta = eta
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
    
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
        self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
        
        etal = self.eta*(1 - 1/((1-self.beta2)**(t+1)))
        etau = self.eta*(1 + 1/((1-self.beta2)*t))
        
        etahatw_t = np.clip(self.alpha/np.sqrt(self.vw), etal, etau)
        etahatb_t = np.clip(self.alpha/np.sqrt(self.vb), etal, etau)
        
        etaw_t = etahatw_t/np.sqrt(t)
        etab_t = etahatb_t/np.sqrt(t)
        
        dw = - etaw_t*self.mw
        db = - etab_t*self.mb
        
        return dw, db

AMSBound AMSGrad suppressed the problem of unnecessarily large learning rate, but did not solve the opposite problem of unnecessarily small. AdaBound and AMSBound have Adam's "quick learning but cannot suppress generalization error well" and SGD' s "final generalization error can be reduced but until then". Combining the "slow" points, it was proposed as an optimization method that behaves like Adam at the beginning and SGD at the end. AdaBound is an improved version of Adam.

\begin{align}
  g_t &= \nabla_{w_t} \mathcal{L} (w_t) \\
  m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\
  v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\
  \hat{v}_t &= \textrm{max}(\hat{v}_{t-1}, v_t) \\
  \hat{\eta}_t &= \textrm{clip}\left( \cfrac{\alpha}{\sqrt{v_t}}, \left[ \eta_l(t), \eta_h(t) \right] \right) \\
  \eta_t &= \cfrac{\hat{\eta}_t}{\sqrt{t}} \\
  \Delta w_t &= - \eta_t m_t \\
  w_{t+1} &= w_t + \Delta w_t
\end{align}

Formulas 2 and 3 for mounting

\begin{align}
  m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\
\Leftrightarrow
  m_t &= m_{t-1} + (1-\beta_1)(g_t - m_{t-1}) \\
  v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\
\Leftrightarrow
  v_t &= v_{t-1} + (1-\beta_2)(g_t^2 - v_{t-1})
\end{align}

The formula is transformed as follows. $ \ eta_l $ and $ \ eta_u $ are calculated in the same way as AdaBound.

Hyperparameters

symbol	Notation in code	value
\alpha_0	alpha	1e-3=10^{-3}
\eta	eta	1e-1=10^{-1}
\beta_1	beta1	0.9
\beta_2	beta2	0.999
\varepsilon	eps	1e-8=10^{-8}
m_0	mw, mb	0
v_0	vw, vb	0
\hat{v}_0	vhatw, vhatb	0

It has become.

Implementation example

`optimizers.py`


class AMSBound(Optimizer):
    def __init__(self, alpha=1e-3, eta=1e-1, beta1=0.9, beta2=0.999,
                 eps=1e-8, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.eta = eta
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
        self.vhatw = 0
        self.vhatb = 0
    
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
        self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
        self.vhatw = np.maximum(self.vhatw, self.vw)
        self.vhatb = np.maximum(self.vhatb, self.vb)
        
        etal = self.eta*(1 - 1/((1-self.beta2)*t + 1))
        etau = self.eta*(1 + 1/((1-self.beta2)*t + self.eps))
        
        etahatw_t = np.clip(self.alpha/np.sqrt(self.vhatw), etal, etau)
        etahatb_t = np.clip(self.alpha/np.sqrt(self.vhatb), etal, etau)
        
        etaw_t = etahatw_t/np.sqrt(t)
        etab_t = etahatb_t/np.sqrt(t)
        
        dw = - etaw_t*self.mw
        db = - etab_t*self.mb
        
        return dw, db

Implementation code example

Here is a list of code examples. I will also post a simple experimental result. We may add a saddle point comparison over time.

optimizers.py

`optimizers.py`


import numpy as np


class Optimizer():
    """
A superclass inherited by the optimization method.
    """
    def __init__(self, *args, **kwds):
        pass
    
    
    def update(self, *args, **kwds):
        pass


class SGD(Optimizer):
    def __init__(self, eta=1e-2, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        dw = -self.eta*grad_w
        db = -self.eta*grad_b
        return dw, db


class MSGD(Optimizer):
    def __init__(self, eta=1e-2, mu=0.9, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.mu = mu
        
        #Hold the value of the previous step
        self.dw = 0
        self.db = 0
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        dw = self.mu*self.dw - (1-self.mu)*self.eta*grad_w
        db = self.mu*self.db - (1-self.mu)*self.eta*grad_b
        
        #Assigning in the view instead of copying is because these values may be used
        #This is because it will not be changed.
        self.dw = dw
        self.db = db
        
        return dw, db


class NAG(Optimizer):
    def __init__(self, eta=1e-2, mu=0.9, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.mu = mu
        
        #Holds the value of the previous step
        self.dw = 0
        self.db = 0
    
    
    def update(self, grad_w, grad_b, w=0, b=0, df=None, *args, **kwds):
        grad_w = df(w + self.mu*self.dw)
        grad_b = df(b + self.mu*self.db)
        
        dw = self.mu*self.dw - (1-self.mu)*self.eta*grad_w
        db = self.mu*self.db - (1-self.mu)*self.eta*grad_b
        
        #Assigning in the view instead of copying is because these values may be used
        #This is because it will not be changed.
        self.dw = dw
        self.db = db
        
        return dw, db


class AdaGrad(Optimizer):
    def __init__(self, eta=1e-3, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        
        #Hold the value of the previous step
        self.gw = 0
        self.gb = 0
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        self.gw += grad_w*grad_w
        self.gb += grad_b*grad_b
        
        dw = -self.eta*grad_w/np.sqrt(self.gw)
        db = -self.eta*grad_b/np.sqrt(self.gb)

        return dw, db


class RMSprop(Optimizer):
    def __init__(self, eta=1e-2, rho=0.99, eps=1e-8, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.rho = rho
        self.eps = eps
        
        #Hold the value of the previous step
        self.vw = 0
        self.vb = 0
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        self.vw += (1-self.rho)*(grad_w**2 - self.vw)
        self.vb += (1-self.rho)*(grad_b**2 - self.vb)
        
        dw = -self.eta*grad_w/np.sqrt(self.vw+self.eps)
        db = -self.eta*grad_b/np.sqrt(self.vb+self.eps)
        
        return dw, db


class AdaDelta(Optimizer):
    def __init__(self, rho=0.95, eps=1e-6, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.rho = rho
        self.eps = eps
        
        #Hold the value of the previous step
        self.vw = 0
        self.vb = 0
        self.uw = 0
        self.ub = 0
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        self.vw += (1-self.rho)*(grad_w**2 - self.vw)
        self.vb += (1-self.rho)*(grad_b**2 - self.vb)
        
        dw = -grad_w*np.sqrt(self.uw+self.eps)/np.sqrt(self.vw+self.eps)
        db = -grad_b*np.sqrt(self.ub+self.eps)/np.sqrt(self.vb+self.eps)
        
        self.uw += (1-self.rho)*(dw**2 - self.uw)
        self.ub += (1-self.rho)*(db**2 - self.ub)
        
        return dw, db


class Adam(Optimizer):
    def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, eps=1e-8,
                 *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
    
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
        self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
        
        alpha_t = self.alpha*np.sqrt(1-self.beta2**t)/(1-self.beta1**t)
        
        dw = -alpha_t*self.mw/(np.sqrt(self.vw+self.eps))
        db = -alpha_t*self.mb/(np.sqrt(self.vb+self.eps))
        
        return dw, db


class RMSpropGraves(Optimizer):
    def __init__(self, eta=1e-4, rho=0.95, eps=1e-4, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.rho = rho
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
    
    
    def update(self,grad_w, grad_b, *args, **kwds):
        self.mw += (1-self.rho)*(grad_w - self.mw)
        self.mb += (1-self.rho)*(grad_b - self.mb)
        self.vw += (1-self.rho)*(grad_w**2 - self.vw)
        self.vb += (1-self.rho)*(grad_b**2 - self.vb)
        
        dw = -self.eta*grad_w/np.sqrt(self.vw - self.mw**2 + self.eps)
        db = -self.eta*grad_b/np.sqrt(self.vb - self.mb**2 + self.eps)
        
        return dw, db


class SMORMS3(Optimizer):
    def __init__(self, eta=1e-3, eps=1e-8, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.eps = eps
        
        #Hold the value of the previous step
        self.zetaw = 0
        self.zetab = 0
        self.sw = 0
        self.sb = 0
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
    
    
    def update(self, grad_w, grad_b, *args, **kwds):
        self.sw = 1 + (1 - self.zetaw*self.sw)
        self.sb = 1 + (1 - self.zetab*self.sb)
        rhow = 1/(1+self.sw)
        rhob = 1/(1+self.sb)
        
        self.mw += (1-rhow)*(grad_w - self.mw)
        self.mb += (1-rhob)*(grad_b - self.mb)
        self.vw += (1-rhow)*(grad_w**2 - self.vw)
        self.vb += (1-rhob)*(grad_b**2 - self.vb)
        
        self.zetaw = self.mw**2 / (self.vw + self.eps)
        self.zetaw = self.mb**2 / (self.vb + self.eps)
        
        dw = -grad_w*(np.minimum(self.eta, self.zetaw)
                      /np.sqrt(self.vw + self.eps))
        db = -grad_b*(np.minimum(self.eta, self.zetab)
                      /np.sqrt(self.vb + self.eps))
        
        return dw, db


class AdaMax(Optimizer):
    def __init__(self, alpha=2e-3, beta1=0.9, beta2=0.999,
                 *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.uw = 0
        self.ub = 0
    
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        self.uw = np.maximum(self.beta2*self.uw, np.abs(grad_w))
        self.ub = np.maximum(self.beta2*self.ub, np.abs(grad_b))
        
        alpha_t = self.alpha/(1 - self.beta1**t)
        
        dw = -alpha_t*self.mw/self.uw
        db = -alpha_t*self.mb/self.ub
        
        return dw, db


class Nadam(Optimizer):
    def __init__(self, alpha=2e-3, mu=0.975, nu=0.999, eps=1e-8,
                 *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.mu = mu
        self.nu = nu
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
        
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.mu)*(grad_w - self.mw)
        self.mb += (1-self.mu)*(grad_b - self.mb)
        self.vw += (1-self.nu)*(grad_w**2 - self.vw)
        self.vb += (1-self.nu)*(grad_b**2 - self.vb)
        
        mhatw = (self.mu*self.mw/(1-self.mu**(t+1)) 
                 + (1-self.mu)*grad_w/(1-self.mu**t))
        mhatb = (self.mu*self.mb/(1-self.mu**(t+1)) 
                 + (1-self.mu)*grad_b/(1-self.mu**t))
        vhatw = self.nu*self.vw/(1-self.nu**t)
        vhatb = self.nu*self.vb/(1-self.nu**t)
        
        dw = -self.alpha*mhatw/np.sqrt(vhatw + self.eps)
        db = -self.alpha*mhatb/np.sqrt(vhatb + self.eps)
        
        return dw, db


class Eve(Optimizer):
    def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, beta3=0.999,
                 c=10, eps=1e-8, fstar=0, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2
        self.beta3 = beta3
        self.c = c
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
        self.fw = np.inf
        self.fb = np.inf
        self.fstar = fstar
        self.dtilde_w = 0
        self.dtilde_b = 0
    
    
    def update(self, grad_w, grad_b, t=1, fw=1, fb=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
        self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
        
        mhatw = self.mw/(1 - self.beta1**t)
        mhatb = self.mb/(1 - self.beta1**t)
        vhatw = self.vw/(1 - self.beta2**t)
        vhatb = self.vb/(1 - self.beta2**t)
        
        if t > 1:
            d_w = (np.abs(fw-self.fstar)
                    /(np.minimum(fw, self.fw) - self.fstar))
            d_b = (np.abs(fb-self.fstar)
                    /(np.minimum(fb, self.fb) - self.fstar))
            dhat_w = np.clip(d_w, 1/self.c, self.c)
            dhat_b = np.clip(d_b, 1/self.c, self.c)
            self.dtilde_w += (1 - self.beta3)*(dhat_w - self.dtilde_w)
            self.dtilde_b += (1 - self.beta3)*(dhat_b - self.dtilde_b)
        else:
            self.dtilde_w = 1
            self.dtilde_b = 1
        
        self.fw = fw
        self.fb = fb
        
        dw = -(self.alpha*mhatw
               /(self.dtilde_w*(np.sqrt(vhatw) + self.eps)))
        db = -(self.alpha*mhatb
               /(self.dtilde_b*(np.sqrt(vhatb) + self.eps)))
        
        return dw, db


class SantaE(Optimizer):
    def __init__(self, eta=1e-2, sigma=0.95, lambda_=1e-8,
                 anne_func=lambda t, n: t**n, anne_rate=0.5,
                 burnin=100, C=5,
                 *args, **kwds):
        """
        Args:
            eta: Learning rate
            sigma: Maybe in other cases;
                    'rho' in RMSprop, AdaDelta, RMSpropGraves.
                    'rhow' or 'rhob' in SMORMS3.
                    'beta2' in Adam, Eve.
                    'nu' in Nadam.
                   To use calculation 'v'.
            lambda_: Named 'eps'(ε) in other cases.
            anne_func: Annealing function.
                       To use calculation 'beta' at each timestep.
                       Default is 'timestep'**'annealing rate'.
                       The calculated value should be towards infinity
                       as 't' increases.
            anne_rate: Annealing rate.
                       To use calculation 'beta' at each timestep.
                       The second Argument of 'anne_func'.
            burnin: Swith exploration and refinement.
                    This should be specified by users.
            C: To calculate first 'alpha'.
        """
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.sigma = sigma
        self.lambda_ = lambda_
        self.anne_func = anne_func
        self.anne_rate = anne_rate
        self.burnin = burnin
        
        # Keep one step before and Initialize.
        self.alpha_w = np.sqrt(eta)*C
        self.alpha_b = np.sqrt(eta)*C
        self.vw = 0
        self.vb = 0
        self.gw = 0
        self.gb = 0
    
    
    def update(self, grad_w, grad_b, t=1, N=200, *args, **kwds):
        try:
            shape_w = grad_w.shape
        except:
            shape_w = (1, )
        try:
            shape_b = grad_b.shape
        except:
            shape_b = (1, )
        
        if t == 1:
            # Initialize uw, ub.
            self.uw = np.sqrt(self.eta)*np.random.randn(*shape_w)
            self.ub = np.sqrt(self.eta)*np.random.randn(*shape_b)
        
        self.vw = (self.sigma*self.vw 
                   + grad_w*grad_w * (1 - self.sigma) / N**2)
        self.vb = (self.sigma*self.vb 
                   + grad_b*grad_b * (1 - self.sigma) / N**2)
        
        gw = 1/np.sqrt(self.lambda_ + np.sqrt(self.vw))
        gb = 1/np.sqrt(self.lambda_ + np.sqrt(self.vb))
        
        beta = self.anne_func(t, self.anne_rate)
        if t < self.burnin:
            # Exploration.
            self.alpha_w += self.uw*self.uw - self.eta/beta
            self.alpha_b += self.ub*self.ub - self.eta/beta
            
            uw = (self.eta/beta * (1 - self.gw/gw)/self.uw
                  + np.sqrt(2*self.eta/beta * self.gw) 
                  * np.random.randn(*shape_w))
            ub = (self.eta/beta * (1 - self.gb/gb)/self.ub
                  + np.sqrt(2*self.eta/beta * self.gb) 
                  * np.random.randn(*shape_b))
        else:
            # Refinement.
            uw = 0
            ub = 0
        
        uw += (1 - self.alpha_w)*self.uw - self.eta*gw*grad_w
        ub += (1 - self.alpha_b)*self.ub - self.eta*gb*grad_b
        
        # Update values.
        self.uw = uw
        self.ub = ub
        self.gw = gw
        self.gb = gb
        
        dw = gw*uw
        db = gb*ub
        
        return dw, db


class SantaSSS(Optimizer):
    def __init__(self, eta=1e-2, sigma=0.95, lambda_=1e-8,
                 anne_func=lambda t, n: t**n, anne_rate=0.5,
                 burnin=100, C=5,
                 *args, **kwds):
        """
        Args:
            eta: Learning rate
            sigma: Maybe in other cases;
                    'rho' in RMSprop, AdaDelta, RMSpropGraves.
                    'rhow' or 'rhob' in SMORMS3.
                    'beta2' in Adam, Eve.
                    'nu' in Nadam.
                   To use calculation 'v'.
            lambda_: Named 'eps'(ε) in other cases.
            anne_func: Annealing function.
                       To use calculation 'beta' at each timestep.
                       Default is 'timestep'**'annealing rate'.
                       The calculated value should be towards infinity
                       as 't' increases.
            anne_rate: Annealing rate.
                       To use calculation 'beta' at each timestep.
                       The second Argument of 'anne_func'.
            burnin: Swith exploration and refinement.
                    This should be specified by users.
            C: To calculate first 'alpha'.
        """
        super().__init__(*args, **kwds)
        
        self.eta = eta
        self.sigma = sigma
        self.lambda_ = lambda_
        self.anne_func = anne_func
        self.anne_rate = anne_rate
        self.burnin = burnin
        
        # Keep one step before and Initialize.
        self.alpha_w = np.sqrt(eta)*C
        self.alpha_b = np.sqrt(eta)*C
        self.vw = 0
        self.vb = 0
        self.gw = 0
        self.gb = 0
    
    
    def update(self, grad_w, grad_b, t=1, N=200, *args, **kwds):
        try:
            shape_w = grad_w.shape
        except:
            shape_w = (1, )
        try:
            shape_b = grad_b.shape
        except:
            shape_b = (1, )
        
        if t == 1:
            # Initialize uw, ub.
            self.uw = np.sqrt(self.eta)*np.random.randn(*shape_w)
            self.ub = np.sqrt(self.eta)*np.random.randn(*shape_b)
        
        self.vw = (self.sigma*self.vw 
                   + grad_w*grad_w * (1 - self.sigma) / N**2)
        self.vb = (self.sigma*self.vb 
                   + grad_b*grad_b * (1 - self.sigma) / N**2)
        
        gw = 1/np.sqrt(self.lambda_ + np.sqrt(self.vw))
        gb = 1/np.sqrt(self.lambda_ + np.sqrt(self.vb))
        
        dw = 0.5*gw*self.uw
        db = 0.5*gb*self.ub
        
        beta = self.anne_func(t, self.anne_rate)
        if t < self.burnin:
            # Exploration.
            self.alpha_w += (self.uw*self.uw - self.eta/beta)*0.5
            self.alpha_b += (self.ub*self.ub - self.eta/beta)*0.5
            
            uw = np.exp(-0.5*self.alpha_w)*self.uw
            ub = np.exp(-0.5*self.alpha_b)*self.ub
            uw += (-gw*grad_w*self.eta 
                        + np.sqrt(2*self.gw*self.eta/beta)
                        * np.random.randn(*shape_w)
                        + self.eta/beta*(1-self.gw/gw)/self.uw)
            ub += (-gb*grad_b*self.eta 
                        + np.sqrt(2*self.gb*self.eta/beta)
                        * np.random.randn(*shape_b)
                        + self.eta/beta*(1-self.gb/gb)/self.ub)
            uw *= np.exp(-0.5*self.alpha_w)
            ub *= np.exp(-0.5*self.alpha_b)
            
            self.alpha_w += (uw*uw - self.eta/beta)*0.5
            self.alpha_b += (ub*ub - self.eta/beta)*0.5
        else:
            # Refinement.
            uw = np.exp(-0.5*self.alpha_w)*self.uw
            ub = np.exp(-0.5*self.alpha_b)*self.ub
            
            uw -= gw*grad_w*self.eta
            ub -= gb*grad_b*self.eta
            
            uw *= np.exp(-0.5*self.alpha_w)
            ub *= np.exp(-0.5*self.alpha_b)
        
        # Update values.
        self.uw = uw
        self.ub = ub
        self.gw = gw
        self.gb = gb
        
        dw = gw*uw*0.5
        db = gb*ub*0.5
        
        return dw, db


class AMSGrad(Optimizer):
    def __init__(self, alpha=1e-3, beta1=0.9, beta2=0.999, eps=1e-8,
                 *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
        self.vhatw = 0
        self.vhatb = 0
    
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        
        self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
        self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
        
        self.vhatw = np.maximum(self.vhatw, self.vw)
        self.vhatb = np.maximum(self.vhatb, self.vb)
        
        alpha_t = self.alpha / np.sqrt(t)
        
        dw = - alpha_t * self.mw/np.sqrt(self.vhatw + self.eps)
        db = - alpha_t * self.mb/np.sqrt(self.vhatb + self.eps)
        
        return dw, db


class AdaBound(Optimizer):
    def __init__(self, alpha=1e-3, eta=1e-1, beta1=0.9, beta2=0.999,
                 eps=1e-8, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.eta = eta
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
    
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
        self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
        
        etal = self.eta*(1 - 1/((1-self.beta2)*t + 1))
        etau = self.eta*(1 + 1/((1-self.beta2)*t + self.eps))
        
        etahatw_t = np.clip(self.alpha/np.sqrt(self.vw), etal, etau)
        etahatb_t = np.clip(self.alpha/np.sqrt(self.vb), etal, etau)
        
        etaw_t = etahatw_t/np.sqrt(t)
        etab_t = etahatb_t/np.sqrt(t)
        
        dw = - etaw_t*self.mw
        db = - etab_t*self.mb
        
        return dw, db


class AMSBound(Optimizer):
    def __init__(self, alpha=1e-3, eta=1e-1, beta1=0.9, beta2=0.999,
                 eps=1e-8, *args, **kwds):
        super().__init__(*args, **kwds)
        
        self.alpha = alpha
        self.eta = eta
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        
        #Hold the value of the previous step
        self.mw = 0
        self.mb = 0
        self.vw = 0
        self.vb = 0
        self.vhatw = 0
        self.vhatb = 0
    
    
    def update(self, grad_w, grad_b, t=1, *args, **kwds):
        self.mw += (1-self.beta1)*(grad_w - self.mw)
        self.mb += (1-self.beta1)*(grad_b - self.mb)
        self.vw += (1-self.beta2)*(grad_w**2 - self.vw)
        self.vb += (1-self.beta2)*(grad_b**2 - self.vb)
        self.vhatw = np.maximum(self.vhatw, self.vw)
        self.vhatb = np.maximum(self.vhatb, self.vb)
        
        etal = self.eta*(1 - 1/((1-self.beta2)*t + 1))
        etau = self.eta*(1 + 1/((1-self.beta2)*t + self.eps))
        
        etahatw_t = np.clip(self.alpha/np.sqrt(self.vhatw), etal, etau)
        etahatb_t = np.clip(self.alpha/np.sqrt(self.vhatb), etal, etau)
        
        etaw_t = etahatw_t/np.sqrt(t)
        etab_t = etahatb_t/np.sqrt(t)
        
        dw = - etaw_t*self.mw
        db = - etab_t*self.mb
        
        return dw, db

Experimental code

`opt_example.py`


%matplotlib nbagg
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation


def f(x):
    return ((1/4) * x**4 - (1/6) * x**3 - (17/8) * x**2 - (15/8) * x) * 4


def df(x):
    return (x**3 - (1/2) * x**2 - (17/4) * x - 15/8) * 4


x = np.arange(-3, 3, 1e-2)
start_x = -3
start_y = f(start_x)
exact = 2.5
epoch = 100
seed=32
np.random.seed(seed=seed)
# epoch=100 is seed=32


opt_dict = {
    "SDG": SGD(),
    "MSGD": MSGD(),
    "NAG": NAG(mu=0.95),
    "AdaGrad": AdaGrad(eta=4e-1),
    "RMSprop": RMSprop(eta=4e-2),
    "AdaDelta": AdaDelta(eps=1e-2),
    "Adam": Adam(alpha=4e-1),
    "RMSpropGraves": RMSpropGraves(eta=4e-2),
    "SMORMS3": SMORMS3(eta=4e-2),
    "AdaMax": AdaMax(),
    "Nadam": Nadam(),
    "Eve": Eve(alpha=4e-1),
    "SantaE": SantaE(eta=4e-4, burnin=0.5*epoch),
    "SantaSSS": SantaSSS(eta=4e-4, burnin=0.5*epoch),
    "AMSGrad": AMSGrad(alpha=4e-1),
    "AdaBound": AdaBound(alpha=4e-1),
    "AMSBound": AMSBound(alpha=4e-1),
}
current_x = np.full(len(opt_dict), start_x, dtype=float)
current_y = np.full(len(opt_dict), start_y, dtype=float)
err_list = np.zeros((epoch, len(opt_dict)), dtype=float)

cmap = plt.get_cmap("rainbow")
coloring = [cmap(i) for i in np.linspace(0, 1, len(opt_dict))]

fig, ax = plt.subplots(1)
ax.set_position([0.1, 0.1, 0.6, 0.8])
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.grid()
images = []
fixed_im, = ax.plot(x, f(x))
for i in range(1, epoch+1):
    im_buf = []
    for j, opt in enumerate(opt_dict):
        im, = ax.plot(current_x[j], current_y[j], 
                      marker="o", color=coloring[j])
        err_list[i-1, j] = 0.5 * (current_x[j] - exact)**2
        
        dx, _ = opt_dict[opt].update(df(current_x[j]), 1,
                                     t=i, N=epoch, 
                                     w=current_x[j],
                                     df=df, fw=f(current_x[j]))
        current_x[j] += dx
        current_y[j] = f(current_x[j])
        im_buf.append(im)
    images.append([fixed_im] + im_buf)
for j, opt in enumerate(opt_dict):
    im, = ax.plot(current_x[j], current_y[j], 
                  marker="o", color=coloring[j],
                  label=opt)
    im_buf.append(im)
images.append([fixed_im] + im_buf)

fig.suptitle("Optimiser comparison")
fig.legend(bbox_to_anchor=(0.96, 0.9))
ani = animation.ArtistAnimation(fig, images, interval=100)

fig2, ax2 = plt.subplots(1)
fig2.suptitle("Optimiser comparison of errors")
ax2.set_position([0.1, 0.1, 0.6, 0.8])
ax2.set_xlabel("x")
ax2.set_ylabel("errors")
ax2.set_yscale("log")
ax2.grid()
for j, opt in enumerate(opt_dict):
    ax2.plot(err_list[:, j], color=coloring[j], label=opt)
fig2.legend(bbox_to_anchor=(0.96, 0.9))

in conclusion

It took a lot of time ... If you find any mistakes, please let me know.

reference

-Deep learning optimization algorithm -[2020 definitive edition] Super easy-to-understand optimization algorithm -Adam and Newton's method from loss function- -[Plot different colors with matplotlib](https://www.it-swarm.dev/ja/python/ Plot different colors with matplotlib / 1072229568 /) -Adjust the position of the matplotlib legend -matplotlib reverse lookup memo (frame edition)

Deep learning series

-Introduction to Deep Learning ~ Basics ~ -Introduction to Deep Learning ~ Coding Preparation ~ -Introduction to Deep Learning ~ Forward Propagation ~ -Introduction to Deep Learning ~ Backpropagation ~ -List of activation functions (2020) -Gradient descent method list (2020) -Thorough understanding of im2col -Complete understanding of numpy.pad function

Gradient descent method list (2020)

Target person

table of contents

Introduction

What is the steepest descent method?

What is stochastic gradient descent?

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

optimizers.py

Implementation code example

optimizers.py

opt_example.py

in conclusion

reference

Deep learning series

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`optimizers.py`

`opt_example.py`