As it progresses to the lower layer by the backpropagation method Gradient becomes more and more gentle Therefore, in the update by the gradient descent method, the parameters of the lower layer are Training does not converge to the optimum location almost unchanged
Larger values have a smaller output and can cause vanishing gradient problems The maximum value of the derivative of the sigmoid function is 0.25, which is multiplied by this, so it can be seen that the gradient approaches 0. ReLU Good results in avoiding vanishing gradient problem Since it is a function that returns the value as it is when a certain threshold is exceeded, gradient disappearance is unlikely to occur. There are also merits in terms of sparseness It is also guaranteed in terms of accuracy
The value of the weight element divided by the square root of the number of nodes in the previous layer
self.params['W1'] = np.random.randn(input_size, hidden_size) / nq.sqrt(input_layer_size)
self.params['W2'] = np.random.randn(hidden_size, output_size) / nq.sqrt(hidden_layer_size)
The weight element divided by the square root of the number of nodes in the previous layer multiplied by $ \ sqrt2 $
self.params['W1'] = np.random.randn(input_size, hidden_size) / nq.sqrt(input_layer_size)*np.sqrt(2)
self.params['W2'] = np.random.randn(hidden_size, output_size) / nq.sqrt(hidden_layer_size)*np.sqrt(2)
A method to suppress the bias of input value data in mini-batch units Before and after passing the value to the activation function, add a layer containing the batch normalization process.
Policy on how to set the initial learning rate
--Set a large initial learning rate and gradually decrease the learning rate --Variable learning rate for each parameter
--Optimize the learning rate using the learning rate optimization method
After subtracting the product of the learning rate from the error differentiated by the parameter, add the product of the value obtained by subtracting the current weight from the previous weight and the sensitivity.
――It does not become a local optimum solution, but a global optimum solution. ――It takes a long time to reach the lowest position from the valley
AdaGrad Subtract the seat of the learning rate redefined as the error differentiated by the parameter
h_o = \theta \\
h_t = h_(t-1) + (\Delta E)^2 \\
w^(t+1)=w^(t)-\epsilon \frac{1}{\sqrt{h_t}+\theta}\Delta E
--Close to the optimum value for slopes with gentle slopes
--Since the learning rate gradually decreases, it may cause a saddle point problem.
RMSProp Subtract the seat of the learning rate redefined as the error differentiated by the parameter Eliminate the disadvantages of AdaGrad
h_o = \theta \\
h_t = ah_(t-1) + (1-a)(\Delta E)^2 \\
w^(t+1)=w^(t)-\epsilon \frac{1}{\sqrt{h_t}+\theta}\Delta E
--It does not become a local optimum solution, but a global optimum solution. --Hyperparameters need to be adjusted less often
Adam Exponential decay moving average of past gradients of momentum Exponential decay moving average of the square of the past gradient of RMSProp The above two optimization algorithms
Algorithm with the benefits of momentum and RMS Drop
Section3
If it is large, it may represent an important feature, but if it is too large, it may be overfitting. The weight is suppressed by adding the regularization term to the error (function).
What is regularization? Add rules to the degree of freedom of the network and limit in a certain direction
En(w)+\frac{1}{pλ}||x||p\\
||x||_p=(|x_1|^p + \dots+|x_n|^p)^{\frac{1}{p}}
When p = 1, it is called L1 regularization. When p = 2, it is called L2 regularization.
Randomly delete nodes to learn
It can be interpreted that different models are trained without changing the amount of data.
Section4 CNN It is mainly used to create models for images. It is composed of a convolution layer, a pooling layer, and a fully connected layer.
You can learn the 3D data of vertical, horizontal, and channel as it is, and then convey it.
Weight in full connection
To change the size of the image by adding the input value to the side of the input image When filling with zero → Called zero padding. (It is hard to affect learning)
The amount of movement with respect to x and y when sequentially extracting with a filter is called stride.
Depth part. Number of layers = number of channels Example (RGB)
The output of the convolution layer is used as an input, and the maximum value is output.
The output of the convolution layer is used as an input, and the average value of those inputs is output.
Section5 AlexNet Winner of the 2012 ILSVRC Consists of 3 fully connected layers, including 5 convolution layers and a pooling layer Using dropouts for fully connected layer output of size 4096
(1) 0.15 (2) 0.25 (3) 0.35 (4) 0.45
The derivative of the sigmoid function is as follows $ (1-sigmoid (x)) ・ sigmoid (x) $ Because it is the maximum at 0.5 (1-0.5) ・ (0.5) = 0.25
(2)
Parameter tuning will not be performed because all values are transmitted with the same value
--Calculation time is shortened. —— Gradient disappearance is less likely to occur.
Momentum
--It is easy to obtain a global optimum solution instead of a local optimum solution.
AdaGrad
――Easy to converge to the optimum place even in a gentle part
RMSProp
--Less parameter adjustment
(a) When hyperparameters are set to large values, all weights approach 0 infinitely. (b) Setting the piper parameter to 0 results in non-linear regression (c) The bias term is also regularized (d) For ridge regression, add a regularization term to the hidden layer
(a)
(a) Ridge regression (L2 regularization) (b) Linear regression does not become non-linear regression even after regularization (c) Bias is not regularized (d) For regularization, add a regularization term to the error function
Left side L2 regularization Right L1 regularization
7×7
https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_2_1_vanishing_gradient.ipynb https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_4_optimizer.ipynb https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_5_overfiting.ipynb https://github.com/Tomo-Horiuchi/rabbit/blob/master/part2/2Day/2_6_simple_convolution_network.ipynb
The vanishing gradient problem was improved by changing the activation function from sigmoid to ReLU. Regarding weight initialization, Xavier improved the vanishing gradient problem with both relu and sigmoid. In the case of He, only ReLU showed improvement in gradient disappearance. From this result, it was found that the vanishing gradient problem can be improved by adjusting the initial value of the weight and the activation function. It was found that the learning speed and accuracy change depending on the learning rate. Learning has come to progress by using the learning rate optimization method. If the strength of regularization is large, the accuracy will not increase. If it is small, overfitting cannot be suppressed. You can see that overfitting is suppressed by the dropout I was able to confirm that the image becomes a two-dimensional array by im2col im2col and im2col are inconsistent It was confirmed that learning of image data can be promoted by using CNN. I felt that PC specifications such as convolution calculation were necessary.
Recommended Posts