Deep Learning from scratch Chapter 6

Don't call it Deep Learning This chapter 6 deals with very important things in Machine Learning.

Parameter update

Purpose of neural network = Finding parameters that make the value of the loss function as small as possible Solving such problems is called "optimization"

SGD: Stochastic Gradient Descent Method


W \leftarrow  W - \eta\frac{\partial L}{\partial W}

class SGD:
    def __init__(self, lr=0.01):
        self.lr = lr
    
    def update(self, params, grads):
        for key in params.keys():
            params[key] -= self.lr * grads[key]
            
    network = TwoLayerNet(...)



#optimizer:Those who perform optimization
optimizer = SGD()

for i in range(10000) :
	    ...
	x_bath, t_bath = get_mini_batch(...)
    grads = network.gradient(x_batch, t_batch)
    params = network.params
    optimizer.update(params, grads)
    ...

Parameters are updated by opetimizer All you have to do is pass the parameter and gradient information to the optimizer

By implementing the optimization class separately as described above, it becomes easier to modularize the function.

For example, when implementing another optimization method called Momentum that comes out next, implement it so that it also has a common method called update (params, grads). Then you can switch SGD to Momentum just by changing optimizer = SGD () to optimizer = Momentum ().

Disadvantages of SGD

The disadvantage of SGD is that if the shape of the function does not appear isotropic, it will be searched by an inefficient route (if it is an extended function).

The root cause of the above drawbacks is that the direction of the gradient points to a direction other than the original minimum value.

Kobito.a81EO2.png

In order to improve this shortcoming of SGD, three methods are introduced as alternative methods to SGD. ・ Momentum ・ AdaGrad ・ Adam

Momentum

Momentum is "momentum" An image of a ball rolling on the ground gradually decelerating due to friction or air resistance when it receives no force

v \leftarrow  \alpha v - \eta\frac{\partial L}{\partial W}\\

W \leftarrow  W + v

class Momentum:
	def __init__(self, lr=0.01, momentum=0.9): 
		self.lr = lr
		self.momentum = momentum
		self.v = None
		
	def update(self, params, grads):
		if self.v is None:
			self.v = {}
			for key, val in params.items():
				self.v[key] = np.zeros_like(val)
				
		for key in params.keys():
			self.v[key] = self.momentum*self.v[key] - self.lr * grads[key]
			param[key] += self.v[key]

Kobito.g9jOgk.png

x-axis direction: The force received is small, but since the force is always directed in the same direction, it will accelerate constantly in the same direction. y-axis direction: The force received is large, but the positive and negative forces are alternately received and sung, canceling each other out, and the velocity to the y-axis is not stable. → You can get closer to the x-axis direction faster than SGD and reduce the zigzag movement.

AdaGrad

This method reduces the learning rate (learning rate decay). This is a method of reducing the learning coefficient as learning progresses. Ada: Derived from Adaptive Adaptive

h \leftarrow  h + \frac{\partial L}{\partial W}\odot\frac{\partial L}{\partial W}\\

W \leftarrow  W - \eta \frac{1}{\sqrt{h}} \frac{\partial L}{\partial W}

Multiplying 1 / √h means that the elements that move well (largely updated) in the parameter update will have a smaller learning factor.

class AdaGrad:
	def __init__(self, lr=0.01): 
		self.lr = lr
		self.h = None
		
	def update(self, params, grads):
		if self.h is None:
			self.h = {}
			for key, val in params.items():
				self.h[key] = np.zeros_like(val)
				
		for key in params.keys():
			self.h[key] += grads[key] * grads[key]
			param[key] += self.lr * grads[key] / (np.sqrt(self.h[key] + le-7)

The small value le-7 is added because if there is 0 in self.h [key], it will be divided by 0.

RMSProp AdaGrad records all past gradients as the sum of squares Therefore, if you proceed with learning infinitely, the renewal fee will be 0 and it will not be renewed at all.

There is also a different method from RMS Drop that solved this problem RMS Prop is a calculation method that does not add all the gradients uniformly, but gradually forgets the past gradients and reflects the new gradient information greatly. This is called the "exponential moving average"

There is no formula in the book, but it looks like this formula when calculating back from the program

\begin{align}
Initial value d&=0.99\\
h &\leftarrow  h * d + (1 - d)\frac{\partial L}{\partial W}\odot\frac{\partial L}{\partial W}\\

W &\leftarrow  W - \eta \frac{1}{\sqrt{h}} \frac{\partial L}{\partial W}
\end{align}

class RMSprop:

    """RMSprop"""

    def __init__(self, lr=0.01, decay_rate = 0.99):
        self.lr = lr
        self.decay_rate = decay_rate
        self.h = None
        
    def update(self, params, grads):
        if self.h is None:
            self.h = {}
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
            
        for key in params.keys():
            self.h[key] *= self.decay_rate
            self.h[key] += (1 - self.decay_rate) * grads[key] * grads[key]
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

Adam

Momentum: Movement according to the laws of physics AdaGrad: Adaptive update step adjustment for each parameter element

Adam = Momentum + AdaGrad + Hyperparameter "bias correction (bias correction)"

Similarly, there are no mathematical formulas in the book When I googled, the formula came out, so check that

Quote: http://postd.cc/optimizing-gradient-descent/#adam

Adam (Adaptive Moment Estimate) 14 uses another method to calculate and adapt the learning rate for each parameter. Adadelta and RMSprop accumulated the exponential decay average of the squared v_t of the past gradient. In addition to this, Adam keeps the exponential subtractive mean of the past gradient m_t as well. It's similar to Momentum.



\begin{align}
initial value\beta_1 &= 0.9,\beta_2=0.999,\epsilon=10^{-8}\\
\\
m_t &= \beta_1 m_{t-1} + (1 – \beta_1) g_t\\
v_t &= \beta_2 v_{t-1} + (1 – \beta_2) g_t^2\\
\\
\hat{m}_t &= \dfrac{m_t}{1 – \beta^t_1}\\
\hat{v}_t &= \dfrac{v_t}{1 – \beta^t_2}\\
\\
\theta_{t+1} &= \theta_{t} – \dfrac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\\

\end{align}

class Adam:

    """Adam (http://arxiv.org/abs/1412.6980v8)"""

    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
        
    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = {}, {}
            for key, val in params.items():
                self.m[key] = np.zeros_like(val)
                self.v[key] = np.zeros_like(val)
        
        self.iter += 1
        lr_t  = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)         
        
        for key in params.keys():
            #self.m[key] = self.beta1*self.m[key] + (1-self.beta1)*grads[key]
            #self.v[key] = self.beta2*self.v[key] + (1-self.beta2)*(grads[key]**2)
            self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])
            self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])
            
            params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)
            
            #unbias_m += (1 - self.beta1) * (grads[key] - self.m[key]) # correct bias
            #unbisa_b += (1 - self.beta2) * (grads[key]*grads[key] - self.v[key]) # correct bias
            #params[key] += self.lr * unbias_m / (np.sqrt(unbisa_b) + 1e-7)

Comparison

AdaGrad looks good in the image below But unfortunately no method is better (for now) Each has its own characteristics, and it seems that there are problems that it is good at and problems that it is not good at.

Initial value of weight

Problems when the initial value of the weight is 0

Weight decay, a technique that suppresses overfitting and enhances generalization performance Weight decay: A method aimed at learning to reduce the value of the weight parameter By reducing the weight value, overfitting is less likely to occur.

If you want to reduce the weight, it is natural that you want to start with the initial value as small as possible. The initial weight value so far was 0.01 * np.random.randn (10, 100). (Gaussian distribution with standard deviation of 0.01)

However, when the weight is 0, it seems to be a bad idea. (To be exact, the weight value should not be set to a uniform value) The reason is ** in the error propagation method, all weight values are updated uniformly **. Therefore, a random initial value is required.

Best practice for initial weights

The conclusion is as follows.

When using ReLU as the activation function, "Initial value of He"
For S-shaped curves such as sigmoid and tanh, "Initial value of Xavier"

When the number of nodes in the previous layer is n

Initial value of Xavier: Standard deviation\frac{１}{\sqrt{n}}Gaussian distribution with standard deviation of\\
Initial value of He: Standard deviation\sqrt\frac{2}{n}Gaussian distribution with standard deviation of\\

Hidden layer activation distribution (output data after activation function)

Data biased to 0 and 1 → The value of the backpropagation gradient becomes smaller and smaller and disappears. This problem is called "** gradient disappearance **"

Each neuron outputs almost the same value → Activation bias It becomes a problem of "** limit of expressiveness **"

Therefore, it is desirable that the results are reasonably scattered.

Changes in activation distribution when using sigmoid as the activation function

Distribution of activations for each layer when a Gaussian distribution with a standard deviation of 1 is used as the initial weight

Distribution of activations for each layer when a Gaussian distribution with a standard deviation of 0.01 is used as the initial weight

Distribution of activation of each layer when the initial value of Xavier is used as the initial value of the weight

Distribution of activation of each layer when the initial value of He is used as the initial value of the weight

Changes in activation distribution when using ReLU as the activation function

Distribution of activations for each layer when a Gaussian distribution with a standard deviation of 1 is used as the initial weight

Distribution of activations for each layer when a Gaussian distribution with a standard deviation of 0.01 is used as the initial weight

Distribution of activation of each layer when the initial value of Xavier is used as the initial value of the weight

Distribution of activation of each layer when the initial value of He is used as the initial value of the weight

Weight comparison by MNIST dataset

When std = 0.01, learning has hardly progressed. Crispy learning is progressing in the case of He and Xavier → You can see that the problem of the initial value is very important

Batch Normalization

"Force" activation adjustments so that the activation distribution of each layer has a moderate spread

Benefits of Batch Normalization

Learning can proceed faster (learning coefficient can be increased)
Not so dependent on the initial value (you don't have to be so nervous about the initial value)
Suppress overfitting (reduce the need for Dropout, etc.)

It's kind of my personal opinion, but it looks like Ajinomoto in cooking.

Batch Normalization algorithm

Adjust the activation distribution of each layer so that it has a moderate spread → In other words, insert a layer that normalizes the data distribution (distribution with mean 0 and variance 1) into the neural network.

\begin{align}
&B as a mini batch=\{x_1, x_2, \cdots, x_m\}For a set of m input data\\
&average\mu_B, dispersion\sigma_B^Ask for 2\\
&Also\epsilon is 10^{-7}Very small value such as
\end{align}

\begin{align}
\mu_B &\leftarrow \frac{1}{m} \sum_{i-1}^{m} x_i\\
\sigma_B^2 &\leftarrow \frac{1}{m} \sum_{i-1}^{m} (x_i - \mu_B)^2\\
\hat{x_i} &\leftarrow \frac{x_i-\mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

\end{align}

In addition, the Bath Norm layer transforms this normalized data with its own scale and shift. γ and β are parameters, starting from γ = 1 and β = 0, and adjusting to appropriate values by learning.

y_i \leftarrow \gamma \hat{x_i} + \beta

Back propagation etc. You read Frederik Kratzert's blog.

https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

Evaluation of Batch Normalization

Regularization

Overfitting

Causes of overfitting ・ A model with a large number of parameters and high expressiveness ・ There is little training data

Deliberately caused overfitting

Weight decay

Weight decay: load decay

The purpose of learning neural networks is to reduce the value of the loss function. At this time, if the double norm of the weight (L2 norm) is added, it is possible to prevent the weight from becoming large.

If the weight is W, the weight decay of the L2 norm is

\frac{1}{2}\lambda W^2

λ is a hyperparameter that controls the strength of regularization 1/2 is a constant for adjustment to make the result of differentiating W ^ 2 λW.

L2 norm

\sqrt{w_1^2+w_2^2+\cdots+w_n^2}

L1 norm

|w_1^2|+|w_2^2|+\cdots+|w_n^2|

To be honest, it was easier to understand by referring to others than the explanation of the book

Quote: http://qiita.com/supersaiakujin/items/97f4c0017ef76e547976

In a deep neural network, the more layers there are, the more expressive the model will be. However, the higher the number of layers, the higher the risk of overfitting. The risk of overfitting is reduced by limiting the degree of freedom of the parameter while maintaining the expressive power of the Model. One of the methods is weight decay. The> weight update formula is written as follows.

{w \leftarrow w -\eta \frac{\partial C(w)}{\partial w} - \eta \lambda w\\
}

The above formula is a bit confusing as to what you want to do, but it actually comes from a cost function like the one below.

{\tilde C(w) = C(w) + \frac{\lambda}{2}||w||^2
}

This is a cost function with an L2 regularization term. This section reduces the weight value. So, when actually implementing it, the L2 regularization section will be added to the cost.

Although it is passed through in the text, only the source of the part used is excerpted: It is easy to understand if you search with weight_decay_lambda Used for initialization, loss function calculation, and weight setting

    def __init__(self, input_size, hidden_size_list, output_size,
                 activation='relu', weight_init_std='relu', weight_decay_lambda=0):
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size_list = hidden_size_list
        self.hidden_layer_num = len(hidden_size_list)
        self.weight_decay_lambda = weight_decay_lambda
        self.params = {}

        #Weight initialization
        self.__init_weight(weight_init_std)

        #Layer generation
        activation_layer = {'sigmoid': Sigmoid, 'relu': Relu}
        self.layers = OrderedDict()
        for idx in range(1, self.hidden_layer_num+1):
            self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
                                                      self.params['b' + str(idx)])
            self.layers['Activation_function' + str(idx)] = activation_layer[activation]()

        idx = self.hidden_layer_num + 1
        self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
            self.params['b' + str(idx)])

        self.last_layer = SoftmaxWithLoss()

    def loss(self, x, t):
        """Find the loss function

        Parameters
        ----------
        x :Input data
        t :Teacher label

        Returns
        -------
Loss function value
        """
        y = self.predict(x)

        weight_decay = 0
        for idx in range(1, self.hidden_layer_num + 2):
            W = self.params['W' + str(idx)]
            weight_decay += 0.5 * self.weight_decay_lambda * np.sum(W ** 2)

        return self.last_layer.forward(y, t) + weight_decay


    def gradient(self, x, t):
        """Find the gradient (error backpropagation method)

        Parameters
        ----------
        x :Input data
        t :Teacher label

        Returns
        -------
Dictionary variable with gradient for each layer
            grads['W1']、grads['W2']、...Is the weight of each layer
            grads['b1']、grads['b2']、...Is the bias of each layer
        """
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.last_layer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        #Setting
        grads = {}
        for idx in range(1, self.hidden_layer_num+2):
            grads['W' + str(idx)] = self.layers['Affine' + str(idx)].dW + self.weight_decay_lambda * self.layers['Affine' + str(idx)].W
            grads['b' + str(idx)] = self.layers['Affine' + str(idx)].db

        return grads

Dropout

Dropout: A method of learning while randomly erasing neurons

↓

Dropout implemented in Chainer

    class Dropout:
        def __init__(self, dropout_ratio=0.5):
            self.dropout_ratio = dropout_ratio
            self.mask = None

        def forward(self, x, train_flg=True):
            if train_flg:
                self.mask = np.random.rand(*x.shape) > self.dropout_ratio
                return x * self.mask
            else
                return x * (1.0 - self.dropout_ratio)

        def backward(self, dout):
            return dout * self.mask

Results when using Dropout

Hyperparameter validation

Hyperparameter examples so far ・ Number of neurons in each layer ・ Batch size ・ Learning coefficient ・ Weight decay

Validation data

Hyperparameters should not be evaluated for performance with test data → Because it will cause overfitting

Therefore, we use * validation data *, which is the validation data dedicated to hyperparameters.

Depending on the content of the data, it may be necessary to create it by the user Code that separates about 20% of training data first as verification data

(x_train, t_train), (x_test, t_test) = load_mnist()

#Shuffle training data
x_train, t_train = shuffle_dataset(x_train, t_train)

#Split validation data
validation_rate = 0.20
validation_num = int(x_train.shape[0] * validation_rate)

x_val - x_train[:validation_num]
t_val - t_train[:validation_num]
x_train - x_train[validation_num:]
t_train - t_train[validation_num:]

Hyperparameter optimization

Repeat the following steps to optimize hyperparameters

STEP0 Specify the range of hyperparameters: Roughly specify at first
STEP1 Randomly sample from the set hyperparameter range
STEP2 Learn using the hyperparameter values sampled in STEP1 Evaluate the recognition accuracy of hustle and bustle data (However, the epoch is set small)
STEP3 STEP21 and STEP2wp Repeat a certain number of times (100 times, etc.), and from the result of their recognition accuracy Narrow the range of hyperparameters

Random sampling implementation

wight_decay = 10 ** np.random.uniform(-8, -4)
lr = 10 ** np.random.uniform(-6, -2)

Best-1(val acc:0.84) | lr:0.008596628403945712, weight decay:3.075068633526172e-06 Best-2(val acc:0.83) | lr:0.009688160706596694, weight decay:5.876005684736357e-08 Best-3(val acc:0.78) | lr:0.007897858091143213, weight decay:3.792675246120474e-08 Best-4(val acc:0.77) | lr:0.008962267845301249, weight decay:4.0961888275354916e-07 Best-5(val acc:0.74) | lr:0.009453193380059509, weight decay:1.5625175027026464e-08 Best-6(val acc:0.73) | lr:0.0066257479672272536, weight decay:4.6591905625864734e-05 Best-7(val acc:0.72) | lr:0.007814005955583136, weight decay:4.9330072714643424e-06 Best-8(val acc:0.72) | lr:0.008895526423573389, weight decay:4.297901358238285e-06 Best-9(val acc:0.71) | lr:0.006419577071135049, weight decay:1.0848308972057103e-08 Best-10(val acc:0.69) | lr:0.006304961469167366, weight decay:1.6652787617252613e-07

Looking at the above results, is it next? wight_decay:10^-5-10^-8 lr:0.01-0,0001

Squeeze and run again

Best-1(val acc:0.82) | lr:0.009567378324697062, weight decay:8.329914422037397e-07 Best-2(val acc:0.81) | lr:0.009548817455702163, weight decay:1.9982550859731867e-08 Best-3(val acc:0.8) | lr:0.009291306660458992, weight decay:2.2402127139457002e-07 Best-4(val acc:0.8) | lr:0.008381207344259718, weight decay:8.66434339086022e-08 Best-5(val acc:0.8) | lr:0.009034895918329205, weight decay:1.2694550788849033e-08 Best-6(val acc:0.78) | lr:0.0057717685490679006, weight decay:5.933415739833589e-08 Best-7(val acc:0.77) | lr:0.005287013083466725, weight decay:5.585759633899539e-06 Best-8(val acc:0.77) | lr:0.006997138970399023, weight decay:3.1968420191793365e-06 Best-9(val acc:0.77) | lr:0.007756581950864435, weight decay:1.0281187459919625e-08 Best-10(val acc:0.77) | lr:0.008298200180190944, weight decay:7.389218444784364e-06

again wight_decay:10^-6-10^-8 lr:0.01-0.001

Best-1(val acc:0.84) | lr:0.00971135118325034, weight decay:1.0394539789935165e-07 Best-2(val acc:0.83) | lr:0.009584343636422769, weight decay:3.1009381429608424e-07 Best-3(val acc:0.8) | lr:0.00832916652339643, weight decay:6.618592237280191e-07 Best-4(val acc:0.8) | lr:0.00959218016681805, weight decay:1.6405007969017657e-07 Best-5(val acc:0.78) | lr:0.006451172600874767, weight decay:4.0323875599954127e-07 Best-6(val acc:0.77) | lr:0.008024291255610844, weight decay:2.0050763243482884e-07 Best-7(val acc:0.77) | lr:0.009809009860349643, weight decay:4.934310445408953e-07 Best-8(val acc:0.77) | lr:0.009275309843754197, weight decay:5.343909279054936e-08 Best-9(val acc:0.76) | lr:0.00741122584285725, weight decay:1.588771824270857e-07 Best-10(val acc:0.75) | lr:0.006528687212003595, weight decay:1.3251120646717308e-07

[Learning memo] Deep Learning made from scratch [Chapter 6]