Don't call it Deep Learning This chapter 6 deals with very important things in Machine Learning.
Purpose of neural network = Finding parameters that make the value of the loss function as small as possible Solving such problems is called "optimization"
W \leftarrow W - \eta\frac{\partial L}{\partial W}
class SGD:
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
for key in params.keys():
params[key] -= self.lr * grads[key]
network = TwoLayerNet(...)
#optimizer:Those who perform optimization
optimizer = SGD()
for i in range(10000) :
...
x_bath, t_bath = get_mini_batch(...)
grads = network.gradient(x_batch, t_batch)
params = network.params
optimizer.update(params, grads)
...
Parameters are updated by opetimizer All you have to do is pass the parameter and gradient information to the optimizer
By implementing the optimization class separately as described above, it becomes easier to modularize the function.
For example, when implementing another optimization method called Momentum that comes out next, implement it so that it also has a common method called update (params, grads). Then you can switch SGD to Momentum just by changing optimizer = SGD () to optimizer = Momentum ().
The disadvantage of SGD is that if the shape of the function does not appear isotropic, it will be searched by an inefficient route (if it is an extended function).
The root cause of the above drawbacks is that the direction of the gradient points to a direction other than the original minimum value.
In order to improve this shortcoming of SGD, three methods are introduced as alternative methods to SGD. ・ Momentum ・ AdaGrad ・ Adam
Momentum
Momentum is "momentum" An image of a ball rolling on the ground gradually decelerating due to friction or air resistance when it receives no force
v \leftarrow \alpha v - \eta\frac{\partial L}{\partial W}\\
W \leftarrow W + v
class Momentum:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None
def update(self, params, grads):
if self.v is None:
self.v = {}
for key, val in params.items():
self.v[key] = np.zeros_like(val)
for key in params.keys():
self.v[key] = self.momentum*self.v[key] - self.lr * grads[key]
param[key] += self.v[key]
x-axis direction: The force received is small, but since the force is always directed in the same direction, it will accelerate constantly in the same direction. y-axis direction: The force received is large, but the positive and negative forces are alternately received and sung, canceling each other out, and the velocity to the y-axis is not stable. → You can get closer to the x-axis direction faster than SGD and reduce the zigzag movement.
AdaGrad
This method reduces the learning rate (learning rate decay). This is a method of reducing the learning coefficient as learning progresses. Ada: Derived from Adaptive Adaptive
h \leftarrow h + \frac{\partial L}{\partial W}\odot\frac{\partial L}{\partial W}\\
W \leftarrow W - \eta \frac{1}{\sqrt{h}} \frac{\partial L}{\partial W}
Multiplying 1 / √h means that the elements that move well (largely updated) in the parameter update will have a smaller learning factor.
class AdaGrad:
def __init__(self, lr=0.01):
self.lr = lr
self.h = None
def update(self, params, grads):
if self.h is None:
self.h = {}
for key, val in params.items():
self.h[key] = np.zeros_like(val)
for key in params.keys():
self.h[key] += grads[key] * grads[key]
param[key] += self.lr * grads[key] / (np.sqrt(self.h[key] + le-7)
The small value le-7 is added because if there is 0 in self.h [key], it will be divided by 0.
RMSProp AdaGrad records all past gradients as the sum of squares Therefore, if you proceed with learning infinitely, the renewal fee will be 0 and it will not be renewed at all.
There is also a different method from RMS Drop that solved this problem RMS Prop is a calculation method that does not add all the gradients uniformly, but gradually forgets the past gradients and reflects the new gradient information greatly. This is called the "exponential moving average"
There is no formula in the book, but it looks like this formula when calculating back from the program
\begin{align}
Initial value d&=0.99\\
h &\leftarrow h * d + (1 - d)\frac{\partial L}{\partial W}\odot\frac{\partial L}{\partial W}\\
W &\leftarrow W - \eta \frac{1}{\sqrt{h}} \frac{\partial L}{\partial W}
\end{align}
class RMSprop:
"""RMSprop"""
def __init__(self, lr=0.01, decay_rate = 0.99):
self.lr = lr
self.decay_rate = decay_rate
self.h = None
def update(self, params, grads):
if self.h is None:
self.h = {}
for key, val in params.items():
self.h[key] = np.zeros_like(val)
for key in params.keys():
self.h[key] *= self.decay_rate
self.h[key] += (1 - self.decay_rate) * grads[key] * grads[key]
params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)
Adam
Momentum: Movement according to the laws of physics AdaGrad: Adaptive update step adjustment for each parameter element
Adam = Momentum + AdaGrad + Hyperparameter "bias correction (bias correction)"
Similarly, there are no mathematical formulas in the book When I googled, the formula came out, so check that
Quote: http://postd.cc/optimizing-gradient-descent/#adam
Adam (Adaptive Moment Estimate) 14 uses another method to calculate and adapt the learning rate for each parameter. Adadelta and RMSprop accumulated the exponential decay average of the squared v_t of the past gradient. In addition to this, Adam keeps the exponential subtractive mean of the past gradient m_t as well. It's similar to Momentum.
\begin{align}
initial value\beta_1 &= 0.9,\beta_2=0.999,\epsilon=10^{-8}\\
\\
m_t &= \beta_1 m_{t-1} + (1 – \beta_1) g_t\\
v_t &= \beta_2 v_{t-1} + (1 – \beta_2) g_t^2\\
\\
\hat{m}_t &= \dfrac{m_t}{1 – \beta^t_1}\\
\hat{v}_t &= \dfrac{v_t}{1 – \beta^t_2}\\
\\
\theta_{t+1} &= \theta_{t} – \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\\
\end{align}
class Adam:
"""Adam (http://arxiv.org/abs/1412.6980v8)"""
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.iter = 0
self.m = None
self.v = None
def update(self, params, grads):
if self.m is None:
self.m, self.v = {}, {}
for key, val in params.items():
self.m[key] = np.zeros_like(val)
self.v[key] = np.zeros_like(val)
self.iter += 1
lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)
for key in params.keys():
#self.m[key] = self.beta1*self.m[key] + (1-self.beta1)*grads[key]
#self.v[key] = self.beta2*self.v[key] + (1-self.beta2)*(grads[key]**2)
self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])
self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])
params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)
#unbias_m += (1 - self.beta1) * (grads[key] - self.m[key]) # correct bias
#unbisa_b += (1 - self.beta2) * (grads[key]*grads[key] - self.v[key]) # correct bias
#params[key] += self.lr * unbias_m / (np.sqrt(unbisa_b) + 1e-7)
AdaGrad looks good in the image below But unfortunately no method is better (for now) Each has its own characteristics, and it seems that there are problems that it is good at and problems that it is not good at.
Weight decay, a technique that suppresses overfitting and enhances generalization performance Weight decay: A method aimed at learning to reduce the value of the weight parameter By reducing the weight value, overfitting is less likely to occur.
If you want to reduce the weight, it is natural that you want to start with the initial value as small as possible. The initial weight value so far was 0.01 * np.random.randn (10, 100). (Gaussian distribution with standard deviation of 0.01)
However, when the weight is 0, it seems to be a bad idea. (To be exact, the weight value should not be set to a uniform value) The reason is ** in the error propagation method, all weight values are updated uniformly **. Therefore, a random initial value is required.
The conclusion is as follows.
When the number of nodes in the previous layer is n
Initial value of Xavier: Standard deviation\frac{1}{\sqrt{n}}Gaussian distribution with standard deviation of\\
Initial value of He: Standard deviation\sqrt\frac{2}{n}Gaussian distribution with standard deviation of\\
Data biased to 0 and 1 → The value of the backpropagation gradient becomes smaller and smaller and disappears. This problem is called "** gradient disappearance **"
Each neuron outputs almost the same value → Activation bias It becomes a problem of "** limit of expressiveness **"
Therefore, it is desirable that the results are reasonably scattered.
Distribution of activations for each layer when a Gaussian distribution with a standard deviation of 1 is used as the initial weight
Distribution of activations for each layer when a Gaussian distribution with a standard deviation of 0.01 is used as the initial weight
Distribution of activation of each layer when the initial value of Xavier is used as the initial value of the weight
Distribution of activation of each layer when the initial value of He is used as the initial value of the weight
Distribution of activations for each layer when a Gaussian distribution with a standard deviation of 1 is used as the initial weight
Distribution of activations for each layer when a Gaussian distribution with a standard deviation of 0.01 is used as the initial weight
Distribution of activation of each layer when the initial value of Xavier is used as the initial value of the weight
Distribution of activation of each layer when the initial value of He is used as the initial value of the weight
When std = 0.01, learning has hardly progressed. Crispy learning is progressing in the case of He and Xavier → You can see that the problem of the initial value is very important
Batch Normalization
"Force" activation adjustments so that the activation distribution of each layer has a moderate spread
Benefits of Batch Normalization
It's kind of my personal opinion, but it looks like Ajinomoto in cooking.
Adjust the activation distribution of each layer so that it has a moderate spread → In other words, insert a layer that normalizes the data distribution (distribution with mean 0 and variance 1) into the neural network.
\begin{align}
&B as a mini batch=\{x_1, x_2, \cdots, x_m\}For a set of m input data\\
&average\mu_B, dispersion\sigma_B^Ask for 2\\
&Also\epsilon is 10^{-7}Very small value such as
\end{align}
\begin{align}
\mu_B &\leftarrow \frac{1}{m} \sum_{i-1}^{m} x_i\\
\sigma_B^2 &\leftarrow \frac{1}{m} \sum_{i-1}^{m} (x_i - \mu_B)^2\\
\hat{x_i} &\leftarrow \frac{x_i-\mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
\end{align}
In addition, the Bath Norm layer transforms this normalized data with its own scale and shift. γ and β are parameters, starting from γ = 1 and β = 0, and adjusting to appropriate values by learning.
y_i \leftarrow \gamma \hat{x_i} + \beta
Back propagation etc. You read Frederik Kratzert's blog.
https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
Causes of overfitting ・ A model with a large number of parameters and high expressiveness ・ There is little training data
Deliberately caused overfitting
Weight decay
Weight decay: load decay
The purpose of learning neural networks is to reduce the value of the loss function. At this time, if the double norm of the weight (L2 norm) is added, it is possible to prevent the weight from becoming large.
If the weight is W, the weight decay of the L2 norm is
\frac{1}{2}\lambda W^2
λ is a hyperparameter that controls the strength of regularization 1/2 is a constant for adjustment to make the result of differentiating W ^ 2 λW.
L2 norm
\sqrt{w_1^2+w_2^2+\cdots+w_n^2}
L1 norm
|w_1^2|+|w_2^2|+\cdots+|w_n^2|
To be honest, it was easier to understand by referring to others than the explanation of the book
Quote: http://qiita.com/supersaiakujin/items/97f4c0017ef76e547976
In a deep neural network, the more layers there are, the more expressive the model will be. However, the higher the number of layers, the higher the risk of overfitting. The risk of overfitting is reduced by limiting the degree of freedom of the parameter while maintaining the expressive power of the Model. One of the methods is weight decay. The> weight update formula is written as follows.
{w \leftarrow w -\eta \frac{\partial C(w)}{\partial w} - \eta \lambda w\\
}
The above formula is a bit confusing as to what you want to do, but it actually comes from a cost function like the one below.
{\tilde C(w) = C(w) + \frac{\lambda}{2}||w||^2
}
This is a cost function with an L2 regularization term. This section reduces the weight value. So, when actually implementing it, the L2 regularization section will be added to the cost.
Although it is passed through in the text, only the source of the part used is excerpted: It is easy to understand if you search with weight_decay_lambda Used for initialization, loss function calculation, and weight setting
def __init__(self, input_size, hidden_size_list, output_size,
activation='relu', weight_init_std='relu', weight_decay_lambda=0):
self.input_size = input_size
self.output_size = output_size
self.hidden_size_list = hidden_size_list
self.hidden_layer_num = len(hidden_size_list)
self.weight_decay_lambda = weight_decay_lambda
self.params = {}
#Weight initialization
self.__init_weight(weight_init_std)
#Layer generation
activation_layer = {'sigmoid': Sigmoid, 'relu': Relu}
self.layers = OrderedDict()
for idx in range(1, self.hidden_layer_num+1):
self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
self.params['b' + str(idx)])
self.layers['Activation_function' + str(idx)] = activation_layer[activation]()
idx = self.hidden_layer_num + 1
self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
self.params['b' + str(idx)])
self.last_layer = SoftmaxWithLoss()
def loss(self, x, t):
"""Find the loss function
Parameters
----------
x :Input data
t :Teacher label
Returns
-------
Loss function value
"""
y = self.predict(x)
weight_decay = 0
for idx in range(1, self.hidden_layer_num + 2):
W = self.params['W' + str(idx)]
weight_decay += 0.5 * self.weight_decay_lambda * np.sum(W ** 2)
return self.last_layer.forward(y, t) + weight_decay
def gradient(self, x, t):
"""Find the gradient (error backpropagation method)
Parameters
----------
x :Input data
t :Teacher label
Returns
-------
Dictionary variable with gradient for each layer
grads['W1']、grads['W2']、...Is the weight of each layer
grads['b1']、grads['b2']、...Is the bias of each layer
"""
# forward
self.loss(x, t)
# backward
dout = 1
dout = self.last_layer.backward(dout)
layers = list(self.layers.values())
layers.reverse()
for layer in layers:
dout = layer.backward(dout)
#Setting
grads = {}
for idx in range(1, self.hidden_layer_num+2):
grads['W' + str(idx)] = self.layers['Affine' + str(idx)].dW + self.weight_decay_lambda * self.layers['Affine' + str(idx)].W
grads['b' + str(idx)] = self.layers['Affine' + str(idx)].db
return grads
Dropout
Dropout: A method of learning while randomly erasing neurons
↓
Dropout implemented in Chainer
class Dropout:
def __init__(self, dropout_ratio=0.5):
self.dropout_ratio = dropout_ratio
self.mask = None
def forward(self, x, train_flg=True):
if train_flg:
self.mask = np.random.rand(*x.shape) > self.dropout_ratio
return x * self.mask
else
return x * (1.0 - self.dropout_ratio)
def backward(self, dout):
return dout * self.mask
Results when using Dropout
Hyperparameter examples so far ・ Number of neurons in each layer ・ Batch size ・ Learning coefficient ・ Weight decay
Hyperparameters should not be evaluated for performance with test data → Because it will cause overfitting
Therefore, we use * validation data *, which is the validation data dedicated to hyperparameters.
Depending on the content of the data, it may be necessary to create it by the user Code that separates about 20% of training data first as verification data
(x_train, t_train), (x_test, t_test) = load_mnist()
#Shuffle training data
x_train, t_train = shuffle_dataset(x_train, t_train)
#Split validation data
validation_rate = 0.20
validation_num = int(x_train.shape[0] * validation_rate)
x_val - x_train[:validation_num]
t_val - t_train[:validation_num]
x_train - x_train[validation_num:]
t_train - t_train[validation_num:]
Repeat the following steps to optimize hyperparameters
STEP0 Specify the range of hyperparameters: Roughly specify at first
STEP1 Randomly sample from the set hyperparameter range
STEP2 Learn using the hyperparameter values sampled in STEP1 Evaluate the recognition accuracy of hustle and bustle data (However, the epoch is set small)
STEP3 STEP21 and STEP2wp Repeat a certain number of times (100 times, etc.), and from the result of their recognition accuracy Narrow the range of hyperparameters
Random sampling implementation
wight_decay = 10 ** np.random.uniform(-8, -4)
lr = 10 ** np.random.uniform(-6, -2)
Best-1(val acc:0.84) | lr:0.008596628403945712, weight decay:3.075068633526172e-06 Best-2(val acc:0.83) | lr:0.009688160706596694, weight decay:5.876005684736357e-08 Best-3(val acc:0.78) | lr:0.007897858091143213, weight decay:3.792675246120474e-08 Best-4(val acc:0.77) | lr:0.008962267845301249, weight decay:4.0961888275354916e-07 Best-5(val acc:0.74) | lr:0.009453193380059509, weight decay:1.5625175027026464e-08 Best-6(val acc:0.73) | lr:0.0066257479672272536, weight decay:4.6591905625864734e-05 Best-7(val acc:0.72) | lr:0.007814005955583136, weight decay:4.9330072714643424e-06 Best-8(val acc:0.72) | lr:0.008895526423573389, weight decay:4.297901358238285e-06 Best-9(val acc:0.71) | lr:0.006419577071135049, weight decay:1.0848308972057103e-08 Best-10(val acc:0.69) | lr:0.006304961469167366, weight decay:1.6652787617252613e-07
Looking at the above results, is it next? wight_decay:10^-5-10^-8 lr:0.01-0,0001
Squeeze and run again
Best-1(val acc:0.82) | lr:0.009567378324697062, weight decay:8.329914422037397e-07 Best-2(val acc:0.81) | lr:0.009548817455702163, weight decay:1.9982550859731867e-08 Best-3(val acc:0.8) | lr:0.009291306660458992, weight decay:2.2402127139457002e-07 Best-4(val acc:0.8) | lr:0.008381207344259718, weight decay:8.66434339086022e-08 Best-5(val acc:0.8) | lr:0.009034895918329205, weight decay:1.2694550788849033e-08 Best-6(val acc:0.78) | lr:0.0057717685490679006, weight decay:5.933415739833589e-08 Best-7(val acc:0.77) | lr:0.005287013083466725, weight decay:5.585759633899539e-06 Best-8(val acc:0.77) | lr:0.006997138970399023, weight decay:3.1968420191793365e-06 Best-9(val acc:0.77) | lr:0.007756581950864435, weight decay:1.0281187459919625e-08 Best-10(val acc:0.77) | lr:0.008298200180190944, weight decay:7.389218444784364e-06
again wight_decay:10^-6-10^-8 lr:0.01-0.001
Best-1(val acc:0.84) | lr:0.00971135118325034, weight decay:1.0394539789935165e-07 Best-2(val acc:0.83) | lr:0.009584343636422769, weight decay:3.1009381429608424e-07 Best-3(val acc:0.8) | lr:0.00832916652339643, weight decay:6.618592237280191e-07 Best-4(val acc:0.8) | lr:0.00959218016681805, weight decay:1.6405007969017657e-07 Best-5(val acc:0.78) | lr:0.006451172600874767, weight decay:4.0323875599954127e-07 Best-6(val acc:0.77) | lr:0.008024291255610844, weight decay:2.0050763243482884e-07 Best-7(val acc:0.77) | lr:0.009809009860349643, weight decay:4.934310445408953e-07 Best-8(val acc:0.77) | lr:0.009275309843754197, weight decay:5.343909279054936e-08 Best-9(val acc:0.76) | lr:0.00741122584285725, weight decay:1.588771824270857e-07 Best-10(val acc:0.75) | lr:0.006528687212003595, weight decay:1.3251120646717308e-07
Recommended Posts