table of contents [Deep Learning: Day1 NN] (https://qiita.com/matsukura04583/items/6317c57bc21de646da8e) [Deep Learning: Day2 CNN] (https://qiita.com/matsukura04583/items/29f0dcc3ddeca4bf69a2) [Deep Learning: Day3 RNN] (https://qiita.com/matsukura04583/items/9b77a238da4441e0f973) [Deep Learning: Day4 Reinforcement Learning / TensorFlow] (https://qiita.com/matsukura04583/items/50806b750c8d77f2305d)
As a merit, the derivative can be calculated while avoiding unnecessary recursive calculation by back-calculating the derivative from the calculation result of the error. Reduction of calculation cost.
Section1) Overview of the vanishing gradient problem
(Previous flow and vision of the overall picture of issues)
Vanishing gradient problem
As the error backpropagation method progresses to the lower layers, the gradient becomes gentler and gentler. Therefore, the parameters of the lower layer are hardly changed by the update by the gradient descent method, and the training does not converge to the optimum value. Sigmoid function → (Problem) With a large value, the change in output is small, which may cause a vanishing gradient problem.
1-1 Activation function
ReLU function
Good results have been achieved by contributing to avoiding the vanishing gradient problem and sparsification.
** Initial weight setting-He ** Activation function when setting the initial value of He Relu function How to set the initial value The value obtained by dividing the weight element by the square root of the number of nodes in the previous layer and multiplying it by route 2.
Batch normalization is a method to suppress the bias of input value data in units of $ \ Rightarrow $ mini-batch. What is the use of batch normalization? ︖ $ \ Rightarrow $ Add a layer containing batch normalization processing before and after passing a value to the activation function.
u^{(l)}=w^{(l)}z^{(l)}+b^{(l)}Or z
Section2) Overview of learning rate optimization method
(Previous flow and vision of the overall picture of issues)
Review of gradient descent
Review of learning rate When the habit rate value is large ・ The optimum value is not reached forever and diverges. When the value of the learning rate is small ・ It does not diverge, but if it is too small, it takes time to converge. ・ It becomes difficult to converge to the global local optimum value.
Section2) Continued (Does the indent collapse when a figure is entered?)
2-1 Momentum
+ 2-3 RMSProp
+ 2-4Adam
Adam is an optimization algorithm that contains the above, the exponential decay average of the past gradient of ︖ $ \ Rightarrow $ momentum, and the exponential decay average of the square of the past gradient of RMSProp.
Adam's merit is an algorithm that has the merit of ︖ $ \ Rightarrow $ momentum and RMS Drop.
Section3) Overview of overfitting (Previous flow and vision of the overall picture of issues) The learning curve deviates between the test error and the training error. ︖ $ \ Rightarrow $ Specialize in learning for a specific training sample. The cause is a large number of parameters, incorrect parameter values, many nodes, etc ... $ \ Rightarrow $ High degree of freedom in network (number of layers, number of nodes, parameter values, etc ...)
3-1 L1 regularization, L2 regularization
Regularization is to constrain the degree of freedom of the network (number of layers, number of nodes, parameter values, etc ...).
** ︖ $ \ Rightarrow $ Use regularization method to suppress overfitting **
Weight decay
** Causes of overfitting ** --Overfitting may occur by taking a value with a large weight. --A value with a large weight is an important value in learning, and a value with a large weight causes overfitting.
** Overfitting solution ** ――If you learn to suppress the weight by adding the regularization term to the error, the weight will vary. ――It is necessary to control the weight below the size of the weight that overfitting is likely to occur, and to make the size of the weight vary.
3-2 dropout
Overfitting tasks ・ What is a dropout with a large number of nodes? ︖ $ \ Rightarrow $ Randomly delete nodes for learning. As a merit, it can be interpreted that different models are trained without changing the amount of data.
About convolutional neural networks
Section4) Concept of convolutional neural network
CNN structure diagram
LeNet structure diagram
4-1 Convolution layer
4-1-1 Bias
Convolution layer arithmetic concept (bias)
4-1-2 padding
Convolution layer arithmetic concept (padding)
4-1-3 Stride
Convolution layer arithmetic concept (padding)
4-1-4 channels
Convolution layer arithmetic concept (channel)
Issues when learning images with fully connected layers Disadvantages of fully connected layers $ \ Rightarrow $ In the case of images, it is 3D data of vertical, horizontal, and channels, but it is processed as 1D data. The relationship between each channel of $ \ Rightarrow $ RGB is not reflected in learning.
4-2 Pooling layer
Conceptual diagram of pooling layer
Section5) Latest CNN + 5-1 AlexNet
AlexNet model description
Model structure
Consists of 3 fully connected layers, including 5 convolution layers and a pooling layer.
Measures to prevent overfitting
Using dropouts for output of fully connected layer of size 4096.
[P12] Find dz / dx using the principle of chain rule.
z = t^2,t=x+y
⇒ [Discussion] It can be calculated by the following calculation.
\frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}
,t=x+y
z = t^Since it is 2, if it is differentiated by t\frac{dz}{dt}=2t
t=x+Since it is y, if it is differentiated by x\frac{dt}{dx}=1
\frac{dz}{dx}=2t ・ 1=2t=2(x+y)
[P20] When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Select the correct value from the options. (1) 0.15 (2) 0.25 (3) 0.35 (4) 0.45
⇒ [Discussion] Differentiation of sigumoid
(sigmoid)'=(1-sigmoid)(sigmoid)
Since the sigmoid function is maximum at 0.5,
(sigmoid)'=(1-0.5)(0.5)=0.Will be 25
[P28] What kind of problem occurs when the initial value of the weight is set to 0? Explain briefly. ⇒ [Discussion] Gradient may not be obtained. Since the formula for the initial value of the weight was mentioned above, we will utilize it.
[P31] List two commonly considered effects of batch normalization. ⇒ [Discussion] The distribution of parameters in the middle layer becomes appropriate. Learning in the middle layer stabilizes This method is currently widely used despite being a new method proposed in 2015. ..
[P36] Example challenge
Correct answer: data_x [i: i_end], data_t [i: i_end] • [Explanation] This is a process to retrieve data for batch size. ⇒ [Discussion] The description is similar and it is easy to make a mistake, so be careful.
[P63] Confirmation test
⇒ [Discussion] The answer is "a" It is good to remember it with the figure.
[P68] Answer either of the graphs showing L1 regularization. ⇒ [Discussion] The answer is right
It is good to remember it with the figure. Lasso is a figure with a characteristic rhombus. (Ridge is circular)
[P69] Example Challenge
⇒ [Discussion] The answer is (4) param It is good to remember it with the calculation formula. Understand L1 and L2 correctly.
[P71] Example challenge
⇒ [Discussion] The answer is "sign (param)" [Explanation] The L1 norm is|param|So that gradient is added to the gradient of the error. That is, sign(param)Is. sign is a sign function. It is also necessary to understand the sign sign function that appears for the first time.
[P78] Example Challenge ⇒ [Discussion] Correct answer: image [top: bottom, left: right,:] [Explanation] Consider that the format of the image is (vertical width, horizontal width, channel).
[P100] Confirmation test Answer the size of the output image when the input image of size 6x6 is folded with the filter of size 2x2. The stride and padding are set to 1. ⇒ [Discussion] Answer 7✖️7 Input size height (H), input size width (W) Output Hight(OH) Output Width(OW) Filler Hight(FH) Filler Width(FW) Stride (S) Panning (P)
OH =\frac{H+2P-FH}{S}+1 =\frac{6+2.1-2}{1}+1=7
OW =\frac{W+2P-FW}{S}+1 =\frac{6+2.1-2}{1}+1=7
Since it is a fixed calculation method, it is convenient to remember it as a formula.
Result of changing to ReLU-Xavier combination Result of changing to Sigmoid-HE combination
** [try] Let's check the processing of im2col -Comment out the line that is transposing in the function and execute the code below. ・ Let's change the size of each dimension of input_data, filter size, stride, and padding **
⇒ [Discussion] The results of the exercise are as follows.
python
#Im2col processing confirmation
input_data = np.random.rand(2, 1, 4, 4)*100//1 # number, channel, height,Represents width
print('========== input_data ===========\n', input_data)
print('==============================')
filter_h = 3
filter_w = 3
stride = 1
pad = 0
col = im2col(input_data, filter_h=filter_h, filter_w=filter_w, stride=stride, pad=pad)
print('============= col ==============\n', col)
print('==============================')
Try changing the size of each dimension of input_data and the filter size, stride, and padding as follows.
python
filter_h = 6
filter_w = 6
stride = 2
pad = 1
・ It is necessary to understand that im2col and col2im do not return in exactly the same way. ・ The scene to use is different in the first place. im2col is used for convolution, while col2im is used for final output.
** [try] Let's check the processing of col2im ・ Let's convert the col output by checking im2col to image and check it ** ⇒ [Discussion]
python
#Added processing of col2im
img = col2im(col, input_shape=input_data.shape, filter_h=filter_h, filter_w=filter_w, stride=stride, pad=pad)
print(img)
## DN37_Jupyter Exercise (3)
・ Please note that the convolution process takes time to learn. To process without stress, it is recommended to raise the PC specifications or prepare a device equipped with GPU.
Recommended Posts