1.First of all

I'm reading a masterpiece, ** "Deep Learning from Zero" **. This time is a memo of Chapter 6. To execute the code, download the entire code from Github and use jupyter notebook in ch06.

2. Code that tries the optimization method

In order to actually try the optimization method, we will use ch06 / optimizer_compare_mnist.py with some modifications / additions. The network is a 100x4 layer that classifies MNIST. In the code below, ʻoptimizer key setting`, comment out only the optimizer that you do not use and execute it.

import os
import sys
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.util import smooth_curve  # smooth_curve (A function that smoothes the transition of the loss value)import
from common.multi_layer_net import MultiLayerNet  #MultiLayerNet import
from common.optimizer import *  #optimizer import

#Read MNIST data
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

#Initial setting
train_size = x_train.shape[0]
batch_size = 128
max_iterations = 2001

#optimizer key setting
optimizers = {}
optimizers['SGD'] = SGD()
optimizers['Momentum'] = Momentum()
optimizers['Nesterov'] = Nesterov()
optimizers['AdaGrad'] = AdaGrad()
optimizers['RMSprop'] = RMSprop() 
optimizers['Adam'] = Adam()

#network and train_Set loss for each optimizer key
networks = {}
train_loss = {}
for key in optimizers.keys():
    networks[key] = MultiLayerNet(
        input_size=784, hidden_size_list=[100, 100, 100, 100],
        output_size=10)
    train_loss[key] = []    


#Learning loop
for i in range(max_iterations):
    
    #Extract mini-batch data
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    #Update gradient and record loss for each optimizer key
    for key in optimizers.keys():
        grads = networks[key].gradient(x_batch, t_batch)
        optimizers[key].update(networks[key].params, grads)
    
        loss = networks[key].loss(x_batch, t_batch)
        train_loss[key].append(loss)
    
    #Loss display(Every 500 iter)
    if i % 500 == 0:
        print( "===========" + "iteration:" + str(i) + "===========")
        for key in optimizers.keys():
            loss = networks[key].loss(x_batch, t_batch)
            print(key + ":" + str(loss))


#Drawing a graph
fig = plt.figure(figsize=(8,6))  #Graph size specification
markers = {"SGD": "o", "Momentum": "x", "Nesterov": "^", "AdaGrad": "s", "RMSprop":"*", "Adam": "D", } 
x = np.arange(max_iterations)
for key in optimizers.keys():
    plt.plot(x, smooth_curve(train_loss[key]), marker=markers[key], markevery=100, label=key)
plt.xlabel("iterations")
plt.ylabel("loss")
plt.ylim(0, 1)
plt.legend()
plt.show()

3.SGD The base model of the optimization method is SGD, which was used up to Chapter 5. スクリーンショット 2020-05-06 10.20.23.png

Looking at the implementation of SGD in common / optimizer.py,

class SGD:

    def __init__(self, lr=0.01):
        self.lr = lr
        
    def update(self, params, grads):
        for key in params.keys():
            params[key] -= self.lr * grads[key]

4.Momentum SGD takes time to optimize, especially in the early stages, as gradient updates are always constant. That's where the Momentum comes in.

スクリーンショット 2020-05-06 10.22.35.png

Momentum is a method of gradually increasing the degree of gradient update while the direction of the gradient does not change. It is just an image of a ball rolling according to the slope of the ground, and $ \ alpha = 0.9 $ can be thought of as friction and air resistance on the ground.

To express the image a little more concretely, for example, assuming that the results of the four gradient calculations are the same for $ \ frac {\ partial L} {\ partial W} $, v is スクリーンショット 2020-05-04 14.24.01.png You can see that the degree of updating the gradient gradually increases to -1.0, -1.9, -2.71, -3.439. Looking at the implementation of Momentum in common / optimizer.py,

class Momentum:

    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
        
    def update(self, params, grads):
        if self.v is None:
            self.v = {}
            for key, val in params.items():                                
                self.v[key] = np.zeros_like(val)
                
        for key in params.keys():
            self.v[key] = self.momentum*self.v[key] - self.lr*grads[key] 
            params[key] += self.v[key]

5.Nesterov Momentum is prone to overshoots when the direction of the gradient is reversed after increasing the degree of gradient update. Therefore, Nesterov (also called Momentum of Nestrov), which is a partial modification of momentum, is introduced.

Nesterov changes the position where the gradient is calculated to the position after the gradient update, one step ahead, instead of the current position. Of course, I don't know the exact position after the gradient update, but I will substitute it by finding the approximate v using the current gradient. This can be expected to suppress overshoots. スクリーンショット 2020-05-04 19.34.07.png Looking at the implementation in common / optimizer.py,

class Nesterov:

    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.v = None
        
    def update(self, params, grads):
        if self.v is None:
            self.v = {}
            for key, val in params.items():
                self.v[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.v[key] *= self.momentum
            self.v[key] -= self.lr * grads[key]
            params[key] += self.momentum * self.momentum * self.v[key]
            params[key] -= (1 + self.momentum) * self.lr * grads[key]

Now let's compare SGD, Momentum, Nesterov.

スクリーンショット 2020-05-05 19.14.14.png

Compared to SGD, Momentum and Nesterov have overwhelmingly improved initial loss reduction speed and final loss rate. Nesterov is one step better than Momentum, and it seems that the variation in loss is slightly smaller.

6.AdaGrad ʻAda Grad` introduces two important ideas.

The first is the idea of adaptive learning rate (Adaptive), in which a huge number of parameters should be optimized according to the parameters rather than being optimized at once.

The second is the idea of attenuation of the learning coefficient to increase the learning rate at the beginning of learning and decrease the learning rate as learning progresses to promote learning efficiently.

スクリーンショット 2020-05-06 10.25.44.png

The learning rate is corrected by accumulating the sum of squares of the gradient on h and multiplying it by $ \ frac {1} {\ sqrt {h} + \ epsilon} $ when updating the gradient. In other words, the learning rate of the greatly updated parameters is gradually reduced. By the way, $ \ epsilon $ is a very small number (to prevent division by zero). Looking at the implementation of common / optimizer.py,

class AdaGrad:

    def __init__(self, lr=0.01):
        self.lr = lr
        self.h = None
        
    def update(self, params, grads):
        if self.h is None:
            self.h = {}
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.h[key] += grads[key] * grads[key]
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

7.RMSprop RMSprop is an improved version of ʻAdaGrad` that stores the ** exponential moving average ** of the square of the gradient in h (gradually forget past calculation results and incorporate new calculation results) and update the gradient. The learning rate is corrected by multiplying $ \ frac {1} {\ sqrt {h} + \ epsilon} $ when doing so.

スクリーンショット 2020-05-06 10.28.25.png

Looking at the implementation of common / optimizer.py,

class RMSprop:

    def __init__(self, lr=0.01, decay_rate = 0.99):
        self.lr = lr
        self.decay_rate = decay_rate
        self.h = None
        
    def update(self, params, grads):
        if self.h is None:
            self.h = {}
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.h[key] *= self.decay_rate
            self.h[key] += (1 - self.decay_rate) * grads[key] * grads[key]
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

8.Adam ʻAdam is the result of the idea of taking the best of both Momentum and ʻAdaGrad.

スクリーンショット 2020-05-06 10.31.16.png

m is like the exponential moving average of Momentum, and v is ʻAdaGraditself. After that,m and v` are used to a large extent while the iter is small, and the degree of utilization is weakened as the iter increases. The implementation is done by slightly modifying the formula as shown below.

スクリーンショット 2020-05-06 10.02.53.png

Looking at the implementation of common / optimizer.py,

class Adam:

    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
        
    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = {}, {}
            for key, val in params.items():
                self.m[key] = np.zeros_like(val)
                self.v[key] = np.zeros_like(val)
        
        self.iter += 1
        lr_t  = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)         
        
        for key in params.keys():
            self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])
            self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])            
            params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)

Now let's compare ʻAdaGrad, RMSprop, ʻAdam.

![Screenshot 2020-05-05 19.06.24.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/209705/4ba770be-3cfe-15a1-16bb- 030899994296.png)

There is no optimization method that gives the best results for any task. This time, ʻAda Grad` gave the best results.

The learning rate correction by the exponential moving average of the square of the gradient of RMSprop was probably overkill for this task, and the loss amplitude became very large, and the final loss rate also became high.

ʻAdam` is said to always show stable performance with the optimization method that should be used first for the time being. This task is not the best, but it is showing an honor student loss reduction transition.

9. Weight initialization

If the initial value of the weight is zero, all the weight values will be updated uniformly, and the expressiveness will be lost. What kind of initialization is good depends on the type of activation function.

When the sigmoid function and tanh function are symmetrical and the center area can be regarded as a linear function, the Gaussian distribution with $ \ sqrt {\ frac {1} {n}} $ as the standard deviation called Xavier initialization is said to be optimal. It is

When using ReUL, it is said that the Gaussian distribution with $ \ sqrt {\ frac {2} {n}} $ as the standard deviation called He initialization is optimal.

10.Dropout Dropout is a method to suppress overfitting even in a highly expressive network by disconnecting neurons randomly selected for each iter during learning. Looking at the implementation code,

class Dropout:

    def __init__(self, dropout_ratio=0.5):
        self.dropout_ratio = dropout_ratio
        self.mask = None

    #Forward propagation
    def forward(self, x, train_flg=True):
        #At the time of learning, create a mask that determines whether or not to connect, and multiply the forward propagating signal by mask.
        if train_flg:
            self.mask = np.random.rand(*x.shape) > self.dropout_ratio
            return x * self.mask

        #When inferring, do not apply mask to the entire signal(1 - dropout_ratio)Multiply
        else:
            return x * (1.0 - self.dropout_ratio)
    
    #Mask the backpropagating signal
    def backward(self, dout):
        return dout * self.mask

At each learning session, a uniform random number and threshold value (dropout_ratio) are used to create a mask (True for connectable, False for non-connectible) that determines whether or not to connect. Then mask the forward propagating signal (x * self.mask). Similarly, when backpropagating, the signal is masked.

In a concrete image, it looks like this,

スクリーンショット 2020-05-06 18.49.05.png

Also, when inferring, the entire signal is multiplied by (1 --dropout_ratio) without masking, and only the magnitude of the entire signal is adjusted.

11.Batch Normalization Batch Normalization is a method announced in 2015 that improves learning convergence speed, reduces the need for Dropout, and initially weights by normalizing each mini-batch so that it has an average of 0 and a variance of 1. Effects such as reduced need for conversion (robust for weight initialization) can be obtained. Below is the algorithm. スクリーンショット 2020-05-06 14.05.00.png

12.Weight decay Weight decay is a method of suppressing overfitting by penalizing having a large weight in the learning process.

When the weight is W, by adding $ \ frac {1} {2} \ lambda W ^ 2 $ to the loss function, it is possible to suppress the increase of the weight W, and this method is ** L2 regularized ** Is called. If you increase $ \ lambda $ here, you can increase the penalty. ![Screenshot 2020-05-06 14.31.12.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/209705/ddbf678e-4b99-e2f8-a182- 76a1329147c5.png)

By the way, in the loss function\lambda |W|What was added toL1 regularizationIs called.

Deep learning / Deep learning made from scratch Chapter 6 Memo

1.First of all

2. Code that tries the optimization method

9. Weight initialization