Section1 Vanishing Gradation Problem

As it progresses to the lower layer by the backpropagation method Gradient becomes more and more gentle Therefore, in the update by the gradient descent method, the parameters of the lower layer are Training does not converge to the optimum location almost unchanged

Activation function

Sigmoid function

Larger values have a smaller output and can cause vanishing gradient problems The maximum value of the derivative of the sigmoid function is 0.25, which is multiplied by this, so it can be seen that the gradient approaches 0. ReLU Good results in avoiding vanishing gradient problem Since it is a function that returns the value as it is when a certain threshold is exceeded, gradient disappearance is unlikely to occur. There are also merits in terms of sparseness It is also guaranteed in terms of accuracy

How to set the initial value

Activation function when setting the initial value of Xavier

relu --Sigmoid --Hyperbolic tangent function

The value of the weight element divided by the square root of the number of nodes in the previous layer

self.params['W1'] = np.random.randn(input_size, hidden_size) / nq.sqrt(input_layer_size)
self.params['W2'] = np.random.randn(hidden_size, output_size) / nq.sqrt(hidden_layer_size)

Activation function when setting the initial value of He

Relu

The weight element divided by the square root of the number of nodes in the previous layer multiplied by $ \ sqrt2 $

self.params['W1'] = np.random.randn(input_size, hidden_size) / nq.sqrt(input_layer_size)*np.sqrt(2)
self.params['W2'] = np.random.randn(hidden_size, output_size) / nq.sqrt(hidden_layer_size)*np.sqrt(2)

Batch normalization

A method to suppress the bias of input value data in mini-batch units Before and after passing the value to the activation function, add a layer containing the batch normalization process.

Section2 Learning rate optimization method

Policy on how to set the initial learning rate

--Set a large initial learning rate and gradually decrease the learning rate --Variable learning rate for each parameter

--Optimize the learning rate using the learning rate optimization method

Momentum

After subtracting the product of the learning rate from the error differentiated by the parameter, add the product of the value obtained by subtracting the current weight from the previous weight and the sensitivity.

Formula

V_t = \mu V_{t-1}-\epsilon \Delta E

W^{(t+1)}=w^{(t)}+V_t

\ text {Inertia:} \ mu

merit

――It does not become a local optimum solution, but a global optimum solution. ――It takes a long time to reach the lowest position from the valley

AdaGrad Subtract the seat of the learning rate redefined as the error differentiated by the parameter

Formula

h_o = \theta \\
h_t = h_(t-1) + (\Delta E)^2 \\
w^(t+1)=w^(t)-\epsilon \frac{1}{\sqrt{h_t}+\theta}\Delta E

merit

--Close to the optimum value for slopes with gentle slopes

Demerit

--Since the learning rate gradually decreases, it may cause a saddle point problem.

RMSProp Subtract the seat of the learning rate redefined as the error differentiated by the parameter Eliminate the disadvantages of AdaGrad

Formula

h_o = \theta \\
h_t = ah_(t-1) + (1-a)(\Delta E)^2 \\
w^(t+1)=w^(t)-\epsilon \frac{1}{\sqrt{h_t}+\theta}\Delta E

merit

--It does not become a local optimum solution, but a global optimum solution. --Hyperparameters need to be adjusted less often

Adam Exponential decay moving average of past gradients of momentum Exponential decay moving average of the square of the past gradient of RMSProp The above two optimization algorithms

merit

Algorithm with the benefits of momentum and RMS Drop

Section3

Weight Decay

If it is large, it may represent an important feature, but if it is too large, it may be overfitting. The weight is suppressed by adding the regularization term to the error (function).

L1 regularization, L2 regularization

What is regularization? Add rules to the degree of freedom of the network and limit in a certain direction

Formula

En(w)+\frac{1}{pλ}||x||p\\
 
||x||_p=(|x_1|^p + \dots+|x_n|^p)^{\frac{1}{p}}

When p = 1, it is called L1 regularization. When p = 2, it is called L2 regularization.

Drop out

Randomly delete nodes to learn

merit

It can be interpreted that different models are trained without changing the amount of data.

Section4 CNN It is mainly used to create models for images. It is composed of a convolution layer, a pooling layer, and a fully connected layer.

Convolution layer

You can learn the 3D data of vertical, horizontal, and channel as it is, and then convey it.

filter

Weight in full connection

Budding

To change the size of the image by adding the input value to the side of the input image When filling with zero → Called zero padding. (It is hard to affect learning)

stride

The amount of movement with respect to x and y when sequentially extracting with a filter is called stride.

Channel

Depth part. Number of layers = number of channels Example (RGB)

Pooling layer

Max pooling

The output of the convolution layer is used as an input, and the maximum value is output.

Average pooling

The output of the convolution layer is used as an input, and the average value of those inputs is output.

Section5 AlexNet Winner of the 2012 ILSVRC Consists of 3 fully connected layers, including 5 convolution layers and a pooling layer Using dropouts for fully connected layer output of size 4096

Confirmation test

Find dz/dx using the principle of chain rule

z = t^2

t = x + y

answer

\frac{dz}{dt}=2t

\frac{dt}{dx}=1

\frac{dz}{dx}= \frac{dz}{dt}\frac{dt}{dx}

\frac{dz}{dx}=2t=2(x + y)

When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Select the correct value from the options.

(1) 0.15 (2) 0.25 (3) 0.35 (4) 0.45

answer

The derivative of the sigmoid function is as follows $ (1-sigmoid (x)) ・ sigmoid (x) $ Because it is the maximum at 0.5 (1-0.5) ・ (0.5) = 0.25

(2)

What kind of problem occurs when the initial value of the weight is set to 0? Explain briefly

answer

Parameter tuning will not be performed because all values are transmitted with the same value

List two commonly considered effects of batch normalization

answer

--Calculation time is shortened. —— Gradient disappearance is less likely to occur.

Briefly explain the characteristics of Momentum, AdaGrad, and RMS Drop.

answer

Momentum

--It is easy to obtain a global optimum solution instead of a local optimum solution.

AdaGrad

――Easy to converge to the optimum place even in a gentle part

RMSProp

--Less parameter adjustment

Regularization of linear models (linear regression, principal component analysis, etc.) used in machine learning is possible by limiting the weights of the models. Among the above-mentioned regularization methods for linear models, there is a method called ridge regression, and select the correct one as its feature.

(a) When hyperparameters are set to large values, all weights approach 0 infinitely. (b) Setting the piper parameter to 0 results in non-linear regression (c) The bias term is also regularized (d) For ridge regression, add a regularization term to the hidden layer

answer

(a)

(a) Ridge regression (L2 regularization) (b) Linear regression does not become non-linear regression even after regularization (c) Bias is not regularized (d) For regularization, add a regularization term to the error function

For the figure below, answer which graph shows L1 regularization.

answer

Left side L2 regularization Right L1 regularization

Answer the size of the output image when the input image of size 6x6 is folded with the filter of size 2x2. The stride and padding are set to 1.

answer

7×7

Practical exercises

Vanishing gradient problem solution implementation

result

https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_2_1_vanishing_gradient.ipynb https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_4_optimizer.ipynb https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_5_overfiting.ipynb https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_6_simple_convolution_network.ipynb

Consideration

The vanishing gradient problem was improved by changing the activation function from sigmoid to ReLU. Regarding weight initialization, Xavier improved the vanishing gradient problem with both relu and sigmoid. In the case of He, only ReLU showed improvement in gradient disappearance. From this result, it was found that the vanishing gradient problem can be improved by adjusting the initial value of the weight and the activation function. It was found that the learning speed and accuracy change depending on the learning rate. Learning has come to progress by using the learning rate optimization method. If the strength of regularization is large, the accuracy will not increase. If it is small, overfitting cannot be suppressed. You can see that overfitting is suppressed by the dropout I was able to confirm that the image becomes a two-dimensional array by im2col im2col and im2col are inconsistent It was confirmed that learning of image data can be promoted by using CNN. I felt that PC specifications such as convolution calculation were necessary.

Rabbit Challenge Deep Learning 2Day

Section1 Vanishing Gradation Problem

Activation function

Sigmoid function

How to set the initial value

Activation function when setting the initial value of Xavier

Activation function when setting the initial value of He

Batch normalization

Section2 Learning rate optimization method

Momentum

Formula

merit

Formula

merit

Demerit

Formula

merit

merit

Weight Decay

L1 regularization, L2 regularization

Formula

Drop out

merit

Convolution layer

filter

Budding

stride

Channel

Pooling layer

Max pooling

Average pooling

Confirmation test

Find dz/dx using the principle of chain rule

answer

When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Select the correct value from the options.

answer

What kind of problem occurs when the initial value of the weight is set to 0? Explain briefly

answer

List two commonly considered effects of batch normalization

answer

Briefly explain the characteristics of Momentum, AdaGrad, and RMS Drop.

answer

answer

For the figure below, answer which graph shows L1 regularization.

answer

Answer the size of the output image when the input image of size 6x6 is folded with the filter of size 2x2. The stride and padding are set to 1.

answer

Practical exercises

Vanishing gradient problem solution implementation

result

Consideration