Deep learning

table of contents [Deep Learning: Day1 NN] (https://qiita.com/matsukura04583/items/6317c57bc21de646da8e) [Deep Learning: Day2 CNN] (https://qiita.com/matsukura04583/items/29f0dcc3ddeca4bf69a2) [Deep Learning: Day3 RNN] (https://qiita.com/matsukura04583/items/9b77a238da4441e0f973) [Deep Learning: Day4 Reinforcement Learning / TensorFlow] (https://qiita.com/matsukura04583/items/50806b750c8d77f2305d)

Deep Learning: Day2 CNN (Lecture Summary)

Reviewing the Big Picture of Deep Learning – Learning Concepts

Enter a value in the input layer
The value is transmitted while calculating with the weight, bias, and activation function.
Value is transmitted from the output layer
Find the error using the error function from the value output from the output layer and the correct answer value.
Update weights and biases to reduce error. (Especially important)
By repeating the operations 1 to 5, the output value will be closer to the correct answer value.

Image of error back propagation スクリーンショット 2020-01-02 10.12.55.png

As a merit, the derivative can be calculated while avoiding unnecessary recursive calculation by back-calculating the derivative from the calculation result of the error. Reduction of calculation cost.

Learning techniques for deep models

Section1) Overview of the vanishing gradient problem
(Previous flow and vision of the overall picture of issues) Vanishing gradient problem As the error backpropagation method progresses to the lower layers, the gradient becomes gentler and gentler. Therefore, the parameters of the lower layer are hardly changed by the update by the gradient descent method, and the training does not converge to the optimum value. Sigmoid function → (Problem) With a large value, the change in output is small, which may cause a vanishing gradient problem.
1-1 Activation function
ReLU function
Good results have been achieved by contributing to avoiding the vanishing gradient problem and sparsification.

The vanishing gradient problem and sparsification will be discussed in detail after the explanation of the stochastic gradient descent method.

1-2 How to set the initial value
** Initial weight setting-Xavier **
Activation function when setting the initial value of Xavier ReLU function Sigmoid (logistic) function Hyperbolic tangent function

** Initial weight setting-He ** Activation function when setting the initial value of He Relu function スクリーンショット 2020-01-02 11.04.56.png How to set the initial value The value obtained by dividing the weight element by the square root of the number of nodes in the previous layer and multiplying it by route 2.

1-3 batch normalization

Batch normalization is a method to suppress the bias of input value data in units of $ \ Rightarrow $ mini-batch. What is the use of batch normalization? ︖ $ \ Rightarrow $ Add a layer containing batch normalization processing before and after passing a value to the activation function.

u^{(l)}=w^{(l)}z^{(l)}+b^{(l)}Or z

Section2) Overview of learning rate optimization method
(Previous flow and vision of the overall picture of issues)
Review of gradient descent
Review of learning rate When the habit rate value is large ・ The optimum value is not reached forever and diverges. When the value of the learning rate is small ・ It does not diverge, but if it is too small, it takes time to converge. ・ It becomes difficult to converge to the global local optimum value.
Section2) Continued (Does the indent collapse when a figure is entered?)
2-1 Momentum
- 2-2AdaGrad

 + 2-3 RMSProp

 + 2-4Adam

Adam is an optimization algorithm that contains the above, the exponential decay average of the past gradient of ︖ $ \ Rightarrow $ momentum, and the exponential decay average of the square of the past gradient of RMSProp.
Adam's merit is an algorithm that has the merit of ︖ $ \ Rightarrow $ momentum and RMS Drop.
Section3) Overview of overfitting (Previous flow and vision of the overall picture of issues) The learning curve deviates between the test error and the training error. ︖ $ \ Rightarrow $ Specialize in learning for a specific training sample. The cause is a large number of parameters, incorrect parameter values, many nodes, etc ... $ \ Rightarrow $ High degree of freedom in network (number of layers, number of nodes, parameter values, etc ...)
3-1 L1 regularization, L2 regularization
Regularization is to constrain the degree of freedom of the network (number of layers, number of nodes, parameter values, etc ...). ** ︖ $ \ Rightarrow $ Use regularization method to suppress overfitting **
Weight decay
** Causes of overfitting ** --Overfitting may occur by taking a value with a large weight. --A value with a large weight is an important value in learning, and a value with a large weight causes overfitting.
** Overfitting solution ** ――If you learn to suppress the weight by adding the regularization term to the error, the weight will vary. ――It is necessary to control the weight below the size of the weight that overfitting is likely to occur, and to make the size of the weight vary.
3-2 dropout
Overfitting tasks ・ What is a dropout with a large number of nodes? ︖ $ \ Rightarrow $ Randomly delete nodes for learning. As a merit, it can be interpreted that different models are trained without changing the amount of data.

About convolutional neural networks

About convolutional neural networks
Section4) Concept of convolutional neural network
CNN structure diagram

LeNet structure diagram
4-1 Convolution layer
4-1-1 Bias
Convolution layer arithmetic concept (bias)
4-1-2 padding
Convolution layer arithmetic concept (padding)
4-1-3 Stride
Convolution layer arithmetic concept (padding)
4-1-4 channels
Convolution layer arithmetic concept (channel)
Issues when learning images with fully connected layers Disadvantages of fully connected layers $ \ Rightarrow $ In the case of images, it is 3D data of vertical, horizontal, and channels, but it is processed as 1D data. The relationship between each channel of $ \ Rightarrow $ RGB is not reflected in learning.
4-2 Pooling layer
Conceptual diagram of pooling layer
Section5) Latest CNN + 5-1 AlexNet
AlexNet model description
Model structure
Consists of 3 fully connected layers, including 5 convolution layers and a pooling layer.
Measures to prevent overfitting
Using dropouts for output of fully connected layer of size 4096.

Exercise

DN06_Jupyter Exercise

Consideration of confirmation test

[P12] Find dz / dx using the principle of chain rule.

     z = t^2,t=x+y

⇒ [Discussion] It can be calculated by the following calculation.

 \frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}

,t=x+y

z = t^Since it is 2, if it is differentiated by t\frac{dz}{dt}=2t

t=x+Since it is y, if it is differentiated by x\frac{dt}{dx}=1

\frac{dz}{dx}=2t ・ 1=2t=2(x+y)

[P20] When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Select the correct value from the options. (1) 0.15 (2) 0.25 (3) 0.35 (4) 0.45

⇒ [Discussion] Differentiation of sigumoid

     (sigmoid)'=(1-sigmoid)(sigmoid)

Since the sigmoid function is maximum at 0.5,

     (sigmoid)'=(1-0.5)(0.5)=0.Will be 25

[P28] What kind of problem occurs when the initial value of the weight is set to 0? Explain briefly. ⇒ [Discussion] Gradient may not be obtained. Since the formula for the initial value of the weight was mentioned above, we will utilize it.

[P31] List two commonly considered effects of batch normalization. ⇒ [Discussion] The distribution of parameters in the middle layer becomes appropriate. Learning in the middle layer stabilizes This method is currently widely used despite being a new method proposed in 2015. ..

[P36] Example challenge スクリーンショット 2020-01-02 11.54.43.png

Correct answer: data_x [i: i_end], data_t [i: i_end] • [Explanation] This is a process to retrieve data for batch size. ⇒ [Discussion] The description is similar and it is easy to make a mistake, so be careful.

[P63] Confirmation test

⇒ [Discussion] The answer is "a" スクリーンショット 2020-01-02 12.50.27.png It is good to remember it with the figure.

[P68] Answer either of the graphs showing L1 regularization. ⇒ [Discussion] The answer is right スクリーンショット 2020-01-02 12.37.49.png

It is good to remember it with the figure. Lasso is a figure with a characteristic rhombus. (Ridge is circular)

[P69] Example Challenge

⇒ [Discussion] The answer is (4) param スクリーンショット 2020-01-02 13.40.33.png It is good to remember it with the calculation formula. Understand L1 and L2 correctly.

[P71] Example challenge

⇒ [Discussion] The answer is "sign (param)" [Explanation] The L1 norm is|param|So that gradient is added to the gradient of the error. That is, sign(param)Is. sign is a sign function. It is also necessary to understand the sign sign function that appears for the first time.

[P78] Example Challenge スクリーンショット 2020-01-02 13.53.49.png ⇒ [Discussion] Correct answer: image [top: bottom, left: right,:] [Explanation] Consider that the format of the image is (vertical width, horizontal width, channel).

[P100] Confirmation test Answer the size of the output image when the input image of size 6x6 is folded with the filter of size 2x2. The stride and padding are set to 1. ⇒ [Discussion] Answer 7✖️7 Input size height (H), input size width (W) Output Hight（OH） Output Width（OW） Filler Hight（FH） Filler Width（FW） Stride (S) Panning (P)

   OH =\frac{H+2P-FH}{S}+1 =\frac{6+2.1-２}{1}+1=7

   OW =\frac{W+2P-FW}{S}+1 =\frac{6+2.1-２}{1}+1=7

Since it is a fixed calculation method, it is convenient to remember it as a formula.

Exercise

DN23_Jupyter Exercise

Result of changing to ReLU-Xavier combination スクリーンショット 2020-01-02 15.36.15.png Result of changing to Sigmoid-HE combination スクリーンショット 2020-01-02 15.43.35.png

DN32_Jupyter Exercise (Dropout)

DN35_Jupyter Exercise (im2col)

** [try] Let's check the processing of im2col -Comment out the line that is transposing in the function and execute the code below. ・ Let's change the size of each dimension of input_data, filter size, stride, and padding **

⇒ [Discussion] The results of the exercise are as follows.

`python`


#Im2col processing confirmation
input_data = np.random.rand(2, 1, 4, 4)*100//1 # number, channel, height,Represents width
print('========== input_data ===========\n', input_data)
print('==============================')
filter_h = 3
filter_w = 3
stride = 1
pad = 0
col = im2col(input_data, filter_h=filter_h, filter_w=filter_w, stride=stride, pad=pad)
print('============= col ==============\n', col)
print('==============================')

Try changing the size of each dimension of input_data and the filter size, stride, and padding as follows.

`python`


filter_h = 6
filter_w = 6
stride = 2
pad = 1

・ It is necessary to understand that im2col and col2im do not return in exactly the same way. ・ The scene to use is different in the first place. im2col is used for convolution, while col2im is used for final output.

** [try] Let's check the processing of col2im ・ Let's convert the col output by checking im2col to image and check it ** ⇒ [Discussion]

`python`


#Added processing of col2im
img = col2im(col, input_shape=input_data.shape, filter_h=filter_h, filter_w=filter_w, stride=stride, pad=pad)
print(img)

## DN37_Jupyter Exercise (3) スクリーンショット 2020-01-03 3.30.52.png

・ Please note that the convolution process takes time to learn. To process without stress, it is recommended to raise the PC specifications or prepare a device equipped with GPU.

<Course> Deep Learning: Day2 CNN

Deep learning

Deep Learning: Day2 CNN (Lecture Summary)

Reviewing the Big Picture of Deep Learning – Learning Concepts

Learning techniques for deep models

About convolutional neural networks

Exercise

DN06_Jupyter Exercise

Consideration of confirmation test

Exercise

DN23_Jupyter Exercise

DN32_Jupyter Exercise (Dropout)

DN35_Jupyter Exercise (im2col)

python

python

python

`python`

`python`

`python`