I'm reading a masterpiece, ** "Deep Learning from Zero" **. This time is a memo of Chapter 6. To execute the code, download the entire code from Github and use jupyter notebook in ch06.
In order to actually try the optimization method, we will use ch06 / optimizer_compare_mnist.py
with some modifications / additions. The network is a 100x4 layer that classifies MNIST. In the code below, ʻoptimizer key setting`, comment out only the optimizer that you do not use and execute it.
import os
import sys
sys.path.append(os.pardir) #Settings for importing files in the parent directory
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.util import smooth_curve # smooth_curve (A function that smoothes the transition of the loss value)import
from common.multi_layer_net import MultiLayerNet #MultiLayerNet import
from common.optimizer import * #optimizer import
#Read MNIST data
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)
#Initial setting
train_size = x_train.shape[0]
batch_size = 128
max_iterations = 2001
#optimizer key setting
optimizers = {}
optimizers['SGD'] = SGD()
optimizers['Momentum'] = Momentum()
optimizers['Nesterov'] = Nesterov()
optimizers['AdaGrad'] = AdaGrad()
optimizers['RMSprop'] = RMSprop()
optimizers['Adam'] = Adam()
#network and train_Set loss for each optimizer key
networks = {}
train_loss = {}
for key in optimizers.keys():
networks[key] = MultiLayerNet(
input_size=784, hidden_size_list=[100, 100, 100, 100],
output_size=10)
train_loss[key] = []
#Learning loop
for i in range(max_iterations):
#Extract mini-batch data
batch_mask = np.random.choice(train_size, batch_size)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]
#Update gradient and record loss for each optimizer key
for key in optimizers.keys():
grads = networks[key].gradient(x_batch, t_batch)
optimizers[key].update(networks[key].params, grads)
loss = networks[key].loss(x_batch, t_batch)
train_loss[key].append(loss)
#Loss display(Every 500 iter)
if i % 500 == 0:
print( "===========" + "iteration:" + str(i) + "===========")
for key in optimizers.keys():
loss = networks[key].loss(x_batch, t_batch)
print(key + ":" + str(loss))
#Drawing a graph
fig = plt.figure(figsize=(8,6)) #Graph size specification
markers = {"SGD": "o", "Momentum": "x", "Nesterov": "^", "AdaGrad": "s", "RMSprop":"*", "Adam": "D", }
x = np.arange(max_iterations)
for key in optimizers.keys():
plt.plot(x, smooth_curve(train_loss[key]), marker=markers[key], markevery=100, label=key)
plt.xlabel("iterations")
plt.ylabel("loss")
plt.ylim(0, 1)
plt.legend()
plt.show()
3.SGD
The base model of the optimization method is SGD
, which was used up to Chapter 5.
Looking at the implementation of SGD
in common / optimizer.py
,
class SGD:
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
for key in params.keys():
params[key] -= self.lr * grads[key]
4.Momentum
SGD
takes time to optimize, especially in the early stages, as gradient updates are always constant. That's where the Momentum
comes in.
Momentum
is a method of gradually increasing the degree of gradient update while the direction of the gradient does not change. It is just an image of a ball rolling according to the slope of the ground, and $ \ alpha = 0.9 $ can be thought of as friction and air resistance on the ground.
To express the image a little more concretely, for example, assuming that the results of the four gradient calculations are the same for $ \ frac {\ partial L} {\ partial W} $, v is
You can see that the degree of updating the gradient gradually increases to -1.0, -1.9, -2.71, -3.439. Looking at the implementation of Momentum
in common / optimizer.py
,
class Momentum:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None
def update(self, params, grads):
if self.v is None:
self.v = {}
for key, val in params.items():
self.v[key] = np.zeros_like(val)
for key in params.keys():
self.v[key] = self.momentum*self.v[key] - self.lr*grads[key]
params[key] += self.v[key]
5.Nesterov
Momentum is prone to overshoots when the direction of the gradient is reversed after increasing the degree of gradient update. Therefore, Nesterov
(also called Momentum of Nestrov), which is a partial modification of momentum, is introduced.
Nesterov
changes the position where the gradient is calculated to the position after the gradient update, one step ahead, instead of the current position. Of course, I don't know the exact position after the gradient update, but I will substitute it by finding the approximate v using the current gradient. This can be expected to suppress overshoots.
Looking at the implementation in common / optimizer.py
,
class Nesterov:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None
def update(self, params, grads):
if self.v is None:
self.v = {}
for key, val in params.items():
self.v[key] = np.zeros_like(val)
for key in params.keys():
self.v[key] *= self.momentum
self.v[key] -= self.lr * grads[key]
params[key] += self.momentum * self.momentum * self.v[key]
params[key] -= (1 + self.momentum) * self.lr * grads[key]
Now let's compare SGD
, Momentum
, Nesterov
.
Compared to SGD
, Momentum
and Nesterov
have overwhelmingly improved initial loss reduction speed and final loss rate. Nesterov
is one step better than Momentum
, and it seems that the variation in loss is slightly smaller.
6.AdaGrad ʻAda Grad` introduces two important ideas.
The first is the idea of adaptive learning rate
(Adaptive), in which a huge number of parameters should be optimized according to the parameters rather than being optimized at once.
The second is the idea of attenuation of the learning coefficient
to increase the learning rate at the beginning of learning and decrease the learning rate as learning progresses to promote learning efficiently.
The learning rate is corrected by accumulating the sum of squares of the gradient on h and multiplying it by $ \ frac {1} {\ sqrt {h} + \ epsilon} $ when updating the gradient. In other words, the learning rate of the greatly updated parameters is gradually reduced.
By the way, $ \ epsilon $ is a very small number (to prevent division by zero). Looking at the implementation of common / optimizer.py
,
class AdaGrad:
def __init__(self, lr=0.01):
self.lr = lr
self.h = None
def update(self, params, grads):
if self.h is None:
self.h = {}
for key, val in params.items():
self.h[key] = np.zeros_like(val)
for key in params.keys():
self.h[key] += grads[key] * grads[key]
params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)
7.RMSprop
RMSprop
is an improved version of ʻAdaGrad` that stores the ** exponential moving average ** of the square of the gradient in h (gradually forget past calculation results and incorporate new calculation results) and update the gradient. The learning rate is corrected by multiplying $ \ frac {1} {\ sqrt {h} + \ epsilon} $ when doing so.
Looking at the implementation of common / optimizer.py
,
class RMSprop:
def __init__(self, lr=0.01, decay_rate = 0.99):
self.lr = lr
self.decay_rate = decay_rate
self.h = None
def update(self, params, grads):
if self.h is None:
self.h = {}
for key, val in params.items():
self.h[key] = np.zeros_like(val)
for key in params.keys():
self.h[key] *= self.decay_rate
self.h[key] += (1 - self.decay_rate) * grads[key] * grads[key]
params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)
8.Adam
ʻAdam is the result of the idea of taking the best of both
Momentum and ʻAdaGrad
.
m
is like the exponential moving average of Momentum
, and v
is ʻAdaGraditself. After that,
m and
v` are used to a large extent while the iter is small, and the degree of utilization is weakened as the iter increases. The implementation is done by slightly modifying the formula as shown below.
Looking at the implementation of common / optimizer.py
,
class Adam:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.iter = 0
self.m = None
self.v = None
def update(self, params, grads):
if self.m is None:
self.m, self.v = {}, {}
for key, val in params.items():
self.m[key] = np.zeros_like(val)
self.v[key] = np.zeros_like(val)
self.iter += 1
lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)
for key in params.keys():
self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])
self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])
params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)
Now let's compare ʻAdaGrad,
RMSprop, ʻAdam
.
![Screenshot 2020-05-05 19.06.24.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/209705/4ba770be-3cfe-15a1-16bb- 030899994296.png)
There is no optimization method that gives the best results for any task. This time, ʻAda Grad` gave the best results.
The learning rate correction by the exponential moving average of the square of the gradient of RMSprop
was probably overkill for this task, and the loss amplitude became very large, and the final loss rate also became high.
ʻAdam` is said to always show stable performance with the optimization method that should be used first for the time being. This task is not the best, but it is showing an honor student loss reduction transition.
If the initial value of the weight is zero, all the weight values will be updated uniformly, and the expressiveness will be lost. What kind of initialization is good depends on the type of activation function.
When the sigmoid function and tanh function are symmetrical and the center area can be regarded as a linear function, the Gaussian distribution with $ \ sqrt {\ frac {1} {n}} $ as the standard deviation called Xavier initialization
is said to be optimal. It is
When using ReUL, it is said that the Gaussian distribution with $ \ sqrt {\ frac {2} {n}} $ as the standard deviation called He initialization
is optimal.
10.Dropout
Dropout
is a method to suppress overfitting even in a highly expressive network by disconnecting neurons randomly selected for each iter during learning. Looking at the implementation code,
class Dropout:
def __init__(self, dropout_ratio=0.5):
self.dropout_ratio = dropout_ratio
self.mask = None
#Forward propagation
def forward(self, x, train_flg=True):
#At the time of learning, create a mask that determines whether or not to connect, and multiply the forward propagating signal by mask.
if train_flg:
self.mask = np.random.rand(*x.shape) > self.dropout_ratio
return x * self.mask
#When inferring, do not apply mask to the entire signal(1 - dropout_ratio)Multiply
else:
return x * (1.0 - self.dropout_ratio)
#Mask the backpropagating signal
def backward(self, dout):
return dout * self.mask
At each learning session, a uniform random number and threshold value (dropout_ratio) are used to create a mask (True for connectable, False for non-connectible) that determines whether or not to connect. Then mask the forward propagating signal (x * self.mask). Similarly, when backpropagating, the signal is masked.
In a concrete image, it looks like this,
Also, when inferring, the entire signal is multiplied by (1 --dropout_ratio) without masking, and only the magnitude of the entire signal is adjusted.
11.Batch Normalization
Batch Normalization
is a method announced in 2015 that improves learning convergence speed, reduces the need for Dropout
, and initially weights by normalizing each mini-batch so that it has an average of 0 and a variance of 1. Effects such as reduced need for conversion (robust for weight initialization) can be obtained. Below is the algorithm.
12.Weight decay Weight decay is a method of suppressing overfitting by penalizing having a large weight in the learning process.
When the weight is W, by adding $ \ frac {1} {2} \ lambda W ^ 2 $ to the loss function, it is possible to suppress the increase of the weight W, and this method is ** L2 regularized ** Is called. If you increase $ \ lambda $ here, you can increase the penalty. ![Screenshot 2020-05-06 14.31.12.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/209705/ddbf678e-4b99-e2f8-a182- 76a1329147c5.png)
By the way, in the loss function
Recommended Posts