Rabbit Challenge Deep Learning 2Day

Section1 Vanishing Gradation Problem

As it progresses to the lower layer by the backpropagation method Gradient becomes more and more gentle Therefore, in the update by the gradient descent method, the parameters of the lower layer are Training does not converge to the optimum location almost unchanged

Activation function

Sigmoid function

Larger values ​​have a smaller output and can cause vanishing gradient problems The maximum value of the derivative of the sigmoid function is 0.25, which is multiplied by this, so it can be seen that the gradient approaches 0. ReLU Good results in avoiding vanishing gradient problem Since it is a function that returns the value as it is when a certain threshold is exceeded, gradient disappearance is unlikely to occur. There are also merits in terms of sparseness It is also guaranteed in terms of accuracy

How to set the initial value

Activation function when setting the initial value of Xavier

The value of the weight element divided by the square root of the number of nodes in the previous layer

self.params['W1'] = np.random.randn(input_size, hidden_size) / nq.sqrt(input_layer_size)
self.params['W2'] = np.random.randn(hidden_size, output_size) / nq.sqrt(hidden_layer_size)

Activation function when setting the initial value of He

The weight element divided by the square root of the number of nodes in the previous layer multiplied by $ \ sqrt2 $

self.params['W1'] = np.random.randn(input_size, hidden_size) / nq.sqrt(input_layer_size)*np.sqrt(2)
self.params['W2'] = np.random.randn(hidden_size, output_size) / nq.sqrt(hidden_layer_size)*np.sqrt(2)

Batch normalization

A method to suppress the bias of input value data in mini-batch units Before and after passing the value to the activation function, add a layer containing the batch normalization process.

Section2 Learning rate optimization method

Policy on how to set the initial learning rate

--Set a large initial learning rate and gradually decrease the learning rate --Variable learning rate for each parameter

--Optimize the learning rate using the learning rate optimization method

Momentum

After subtracting the product of the learning rate from the error differentiated by the parameter, add the product of the value obtained by subtracting the current weight from the previous weight and the sensitivity.

Formula

V_t = \mu V_{t-1}-\epsilon \Delta E
W^{(t+1)}=w^{(t)}+V_t
\ text {Inertia:} \ mu

merit

――It does not become a local optimum solution, but a global optimum solution. ――It takes a long time to reach the lowest position from the valley

AdaGrad Subtract the seat of the learning rate redefined as the error differentiated by the parameter

Formula

h_o = \theta \\
h_t = h_(t-1) + (\Delta E)^2 \\
w^(t+1)=w^(t)-\epsilon \frac{1}{\sqrt{h_t}+\theta}\Delta E

merit

--Close to the optimum value for slopes with gentle slopes

Demerit

--Since the learning rate gradually decreases, it may cause a saddle point problem.

RMSProp Subtract the seat of the learning rate redefined as the error differentiated by the parameter Eliminate the disadvantages of AdaGrad

Formula

h_o = \theta \\
h_t = ah_(t-1) + (1-a)(\Delta E)^2 \\
w^(t+1)=w^(t)-\epsilon \frac{1}{\sqrt{h_t}+\theta}\Delta E

merit

--It does not become a local optimum solution, but a global optimum solution. --Hyperparameters need to be adjusted less often

Adam Exponential decay moving average of past gradients of momentum Exponential decay moving average of the square of the past gradient of RMSProp The above two optimization algorithms

merit

Algorithm with the benefits of momentum and RMS Drop

Section3

Weight Decay

If it is large, it may represent an important feature, but if it is too large, it may be overfitting. The weight is suppressed by adding the regularization term to the error (function).

L1 regularization, L2 regularization

What is regularization? Add rules to the degree of freedom of the network and limit in a certain direction

Formula

En(w)+\frac{1}{pλ}||x||p\\
 
||x||_p=(|x_1|^p + \dots+|x_n|^p)^{\frac{1}{p}}

When p = 1, it is called L1 regularization. When p = 2, it is called L2 regularization.

Drop out

Randomly delete nodes to learn

merit

It can be interpreted that different models are trained without changing the amount of data.

Section4 CNN It is mainly used to create models for images. It is composed of a convolution layer, a pooling layer, and a fully connected layer.

Convolution layer

You can learn the 3D data of vertical, horizontal, and channel as it is, and then convey it.

filter

Weight in full connection

Budding

To change the size of the image by adding the input value to the side of the input image When filling with zero → Called zero padding. (It is hard to affect learning)

stride

The amount of movement with respect to x and y when sequentially extracting with a filter is called stride.

Channel

Depth part. Number of layers = number of channels Example (RGB)

Pooling layer

Max pooling

The output of the convolution layer is used as an input, and the maximum value is output.

Average pooling

The output of the convolution layer is used as an input, and the average value of those inputs is output.

Section5 AlexNet Winner of the 2012 ILSVRC Consists of 3 fully connected layers, including 5 convolution layers and a pooling layer Using dropouts for fully connected layer output of size 4096

image.png

Confirmation test

Find dz/dx using the principle of chain rule

z = t^2
t = x + y

answer

\frac{dz}{dt}=2t
\frac{dt}{dx}=1
\frac{dz}{dx}= \frac{dz}{dt}\frac{dt}{dx}
\frac{dz}{dx}=2t=2(x + y)

When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Select the correct value from the options.

(1) 0.15 (2) 0.25 (3) 0.35 (4) 0.45

answer

The derivative of the sigmoid function is as follows $ (1-sigmoid (x)) ・ sigmoid (x) $ Because it is the maximum at 0.5 (1-0.5) ・ (0.5) = 0.25

(2)

What kind of problem occurs when the initial value of the weight is set to 0? Explain briefly

answer

Parameter tuning will not be performed because all values ​​are transmitted with the same value

List two commonly considered effects of batch normalization

answer

--Calculation time is shortened. —— Gradient disappearance is less likely to occur.

Briefly explain the characteristics of Momentum, AdaGrad, and RMS Drop.

answer

Momentum

--It is easy to obtain a global optimum solution instead of a local optimum solution.

AdaGrad

――Easy to converge to the optimum place even in a gentle part

RMSProp

--Less parameter adjustment

Regularization of linear models (linear regression, principal component analysis, etc.) used in machine learning is possible by limiting the weights of the models. Among the above-mentioned regularization methods for linear models, there is a method called ridge regression, and select the correct one as its feature.

(a) When hyperparameters are set to large values, all weights approach 0 infinitely. (b) Setting the piper parameter to 0 results in non-linear regression (c) The bias term is also regularized (d) For ridge regression, add a regularization term to the hidden layer

answer

(a)

(a) Ridge regression (L2 regularization) (b) Linear regression does not become non-linear regression even after regularization (c) Bias is not regularized (d) For regularization, add a regularization term to the error function

For the figure below, answer which graph shows L1 regularization.

image.png

answer

Left side L2 regularization Right L1 regularization

Answer the size of the output image when the input image of size 6x6 is folded with the filter of size 2x2. The stride and padding are set to 1.

answer

7×7

Practical exercises

Vanishing gradient problem solution implementation

result

https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_2_1_vanishing_gradient.ipynb https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_4_optimizer.ipynb https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_5_overfiting.ipynb https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_6_simple_convolution_network.ipynb

Consideration

The vanishing gradient problem was improved by changing the activation function from sigmoid to ReLU. Regarding weight initialization, Xavier improved the vanishing gradient problem with both relu and sigmoid. In the case of He, only ReLU showed improvement in gradient disappearance. From this result, it was found that the vanishing gradient problem can be improved by adjusting the initial value of the weight and the activation function. It was found that the learning speed and accuracy change depending on the learning rate. Learning has come to progress by using the learning rate optimization method. If the strength of regularization is large, the accuracy will not increase. If it is small, overfitting cannot be suppressed. You can see that overfitting is suppressed by the dropout I was able to confirm that the image becomes a two-dimensional array by im2col im2col and im2col are inconsistent It was confirmed that learning of image data can be promoted by using CNN. I felt that PC specifications such as convolution calculation were necessary.

Recommended Posts

Rabbit Challenge Deep Learning 1Day
Rabbit Challenge Deep Learning 2Day
[Rabbit Challenge (E qualification)] Deep learning (day2)
[Rabbit Challenge (E qualification)] Deep learning (day3)
[Rabbit Challenge (E qualification)] Deep learning (day4)
Rabbit Challenge 4Day
Rabbit Challenge 3DAY
Machine learning rabbit challenge
<Course> Deep Learning: Day1 NN
Subjects> Deep Learning: Day3 RNN
Deep Learning
Thoroughly study Deep Learning [DW Day 0]
<Course> Deep Learning Day4 Reinforcement Learning / Tensor Flow
Deep Learning Memorandum
Python learning day 4
Python Deep Learning
Deep learning × Python
First Deep Learning ~ Struggle ~
Python: Deep Learning Practices
Deep learning / activation functions
Deep Learning from scratch
Learning record 4 (8th day)
Learning record 9 (13th day)
Learning record 3 (7th day)
Deep learning 1 Practice of deep learning
Deep learning / cross entropy
Learning record 5 (9th day)
Learning record 6 (10th day)
First Deep Learning ~ Preparation ~
Programming learning record day 2
First Deep Learning ~ Solution ~
Learning record 8 (12th day)
Learning record 1 (4th day)
Learning record 7 (11th day)
I tried deep learning
Python: Deep Learning Tuning
Learning record 2 (6th day)
Deep learning large-scale technology
Learning record 16 (20th day)
Learning record 22 (26th day)
Deep learning / softmax function
Learning record No. 21 (25th day)
Effective Python Learning Memorandum Day 15 [15/100]
Deep Learning from scratch 1-3 chapters
Try deep learning with TensorFlow
Learning record 13 (17th day) Kaggle3
Deep Learning Gaiden ~ GPU Programming ~
Effective Python Learning Memorandum Day 6 [6/100]
Effective Python Learning Memorandum Day 12 [12/100]
Learning record No. 17 (21st day)
Deep learning image recognition 1 theory
Learning record No. 18 (22nd day)
Deep running 2 Tuning of deep learning
Deep learning / LSTM scratch code
Learning record No. 24 (28th day)
Deep Kernel Learning with Pyro
Deep learning for compound formation?
Introducing Udacity Deep Learning Nanodegree
Sparta Camp Python 2019 Day2 Challenge
Effective Python Learning Memorandum Day 14 [14/100]
[Deep learning] Image classification with convolutional neural network [DW day 4]