Review of deep learning (Part 1)

Vanishing gradient problem ... As you move to the lower layers by the error back propagation method, the gradient becomes gentler. The gentler gradient means that the parameter update in the gradient descent method It will be almost the same. As a result, the training data It will not be possible to converge on the optimum solution. (Since the disappearance of the gradient can be confirmed by visualization, it is also important to confirm by visualization.)

Sigmoid function: A function that slowly changes from 0 to 1, and the step function was only ON / OFF, but It can tell ON / OFF and the strength of the signal. Disadvantages: Even large values have small changes that can cause vanishing gradient problems.

Confirmation test (2-2) When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Choose the correct value. Sigmoid function

{f(x) = \frac{1}{1+e^{-x}}}

When differentiated,

{f'(x) = (1 - f(x))f(x)}Will be.

Gradient disappearance solution There are activation function selection, weight initial value setting, batch normalization, etc.

・ Selection of activation function ReLU function: Contributes to avoiding vanishing gradient problem and sparsification.

・ Initial value of weight Method using Xavier. The activation function when setting the initial value of Xavier is ・ ReLU function ・ Sigmoid function ・ Hyperbolic tangent function

The initial value is set by dividing the weight element by the square root of the number of nodes in the previous layer. 　 Method using He. The activation function when setting the initial value of He is ・ ReLU function

The initial value setting method is to multiply the weight element by the square root of the number of nodes in the previous layer divided by √2.

Confirmation test (2-3) What kind of problem occurs when the initial value of the weight is set to 0? State briefly.

When the weight is set to 0, the product of all input values and the weight is 0, so only the bias value is output to the output layer. Will be done. Therefore, the training data cannot be updated.

Batch normalization A method of suppressing the bias of input value data in mini-batch units.

Usage: Add a layer containing batch normalization before and after passing the value to the activation function.

Confirmation test (2-4) List two commonly considered effects of batch normalization. ・ Points that hardly depend on the initial value ・ Suppress overfitting ・ High-speed calculation (because the amount of calculation can be reduced by suppressing data bias)

Jupiter Exercise (2-5) Sigmoid estimation

In the estimation using the above sigmoid function, the gradient disappears as can be seen from the graph when the error back propagation method is performed. It was found that the accuracy was as low as about 0.1.

Next, estimation is performed using the ReLU function. The difference from the estimation using the sigmoid function is commented out. It only makes the activation function a ReLU function. As a result, it was found that the gradient disappearance did not occur and the error back propagation was possible.

Next, check by setting the initial value of the weight. Set the initial value of the weight in the sigmoid function with Xavier and estimate it.

As you can see from the graph, if you perform Xavier using the Gaussian distribution as the initial setting of the weight for the estimation with the first sigmoid function. It was found that the gradient disappearance did not occur.

Next, estimation is performed using He, which is another initial value setting of the weight. 　

It was also found that the gradient disappearance did not occur even in the estimation using the ReLU function and the initial weight setting He.

With the combination of the ReLU function and He, it was not possible to determine whether gradient disappearance would occur with the initial setting of the weight of He, so Estimates using the sigmoid function and He were performed below to confirm whether the gradient disappeared.

From the graph of the above result, it was confirmed that the gradient disappearance did not occur and the gradient did not disappear at the initial setting He of the weight.

Learning rate optimization method

Guidelines for setting the initial learning rate At first, set a large learning rate and gradually lower it. → Use the learning rate optimization method. ・ Momentum After subtracting the error by the product of the parameter differentiation and the learning rate, the current weight is multiplied by the previous weight. Differentiate the product of the subtracted value and the inertia. ・ AbaGrad ・ RMS Prop ・ Adam

Overfitting

Overfitting: The learning straight line deviates due to test error and training error.

Suppress overfitting by using regularization (constraining the degrees of freedom of the network). Network freedom: number of layers, number of nodes, parameter values, etc ...

Types of regularization ・ L1 regularization ・ L2 regularization ·Drop out

Confirmation test (2-10) Regularization of linear models used in machine learning is possible by limiting the weights of the models. There is a method called ridge regression in the regularization method, and select the correct one as its feature.

Answer (a) When hyperparameters are set to large values, all weights approach 0 infinitely. Ridge regression is one of the regularized linear regressions, which is the linear regression plus the square of the learned weight.

Weight decay The cause of overfitting occurs when a value with a large weight is taken. As a solution to overfitting, the weight value is suppressed by adding a regularization term. However, the larger the weight value, the more important the parameter in learning. It is essential to control the weight value and make the size of the weight vary within the value range where overfitting does not occur.

L1, L2 regularization Add the p-norm to the error function. If P = 1, it is called L1 regularization, and if P = 2, it is called L2 regularization.

Confirmation test (2-11) Answer either of the graphs showing L1 regularization. Graph on the right L1 regularization may take 0 due to its characteristics. Therefore, it supports sparse estimation.

Drop out The method mainly used in the regularization method The method of randomly deleting nodes and learning them is called dropout. Learn different models without changing the amount of data

Convolutional Neural Network (CNN)

A method often used mainly in image classification. It may be used not only for images but also for audio data.

CNN flow (example) Input layer-> convolution layer-> convolution layer-> pooling layer-> convolution layer-> convolution layer-> pooling layer-> fully connected layer-> output layer

Convolution layer The output value obtained by multiplying the input value by the filter and the bias are added, and the value is converted into the output value by the activation function. By using the convolution layer, 3D image data (channel: data that controls space) of vertical, horizontal, and channel can be obtained. You can learn as it is, and then you can tell it.

·bias Bias is added to the value obtained by multiplying the input layer and the filter. ・ Padding The result of multiplying the input layer and the filter will be smaller than the size of the input layer. Therefore, by increasing the fixed data The output value can be the same size as the input layer. We mainly use 0 padding which adds 0. ·stride Change the calculation location of the input layer and filter. ·Channel The number of channels is the number of channels that are decomposed into vertical, horizontal, and depth to learn the space.

If it is a fully connected layer, the image data 3D data will also be processed as 1D data. A convolutional layer was devised to learn 3D data.

Jupiter Exercise (2-17)

Abbreviation for image to column for im2col. A method that converts multidimensional data into a two-dimensional array. For the above input_data, create a multidimensional array (4x4 matrix with 2 channels) with random.rand (). Execute im2col with stride 1 and 0 padding with vertical and horizontal filter size 3. The im2col function converts a multidimensional array into a two-dimensional array. Next, the im2col function is described below.

The im2col function uses the input value, the vertical and horizontal dimensions of the filter, the number of strides (= 1), and the number of padding (= 0) as arguments. Store each size of the array with input_data.shape. out_h and out_w are the vertical and horizontal dimensions of the output

Next, to convert a 2D array to a multidimensional array, use col2im to column to image.

Note that an array converted to a two-dimensional array by im2col is not converted to a multidimensional array by col2im. Restoration is not possible because the im2col and col2im methods are different.

Pooling layer There are two main types (max pooling and average pooling). Outputs the maximum value of the target area of the input image data (max pooling). Output the average value of the target area (average pooling)

Confirmation test (2-18) Answer the size of the output image when the input image of size 6x6 is folded with the filter of size 2x2. The stride and padding are set to 1. Answer: 7x7 image size

Jupiter Exercise (2-19) Max pooling function program

Latest CNN AlexNet: Consists of three subsequent fully connected layers, including a five-layer convolution layer and a pooling layer.

Report_Deep Learning (Part 2)

Learning rate optimization method

Overfitting

Convolutional Neural Network (CNN)