Introduction

Suddenly, I started studying in Chapter 6 of "Deep Learning from scratch-The theory and implementation of deep learning learned with Python". It is a memo of the trip.

The execution environment is macOS Mojave + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.

(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / [Chapter 5](https://qiita. com / segavvy / items / 8707e4e65aa7fa357d8a) / Chapter 6 / Chapter 7 / Chapter 8 / Summary)

Chapter 6 Learning Techniques

This chapter describes techniques for streamlining learning and measures against overfitting.

6.1 Parameter update

This is an explanation of how to use the gradient to optimize the weight parameter after finding the gradient in the learning of the neural network.

As an example of the eccentric adventurer that comes out first, this adventurer bans the use of maps, is efficient even though he is blindfolded, and for some reason he is looking for the deepest valley floor. This is a bit eccentric and I can't empathize with it: sweat:

After a little google, I found a good example. This is the opening example of ml4a's Neural Network Training. When you reach the top of the mountain, the sun goes down and you have to go back to the base camp early, but the flashlight is poor and you can only illuminate your feet. What would you do? It is a story. Perhaps there are many other similar parables.

If you feel like an adventurer, the shortcomings of SGD are very straightforward. Since I can only see the slope of my current position and cannot see the surroundings, I have no choice but to proceed in the direction of the steep slope. Certainly, going back and forth can be inefficient.

Momentum, AdaGrad, RMSProp, and Adam have been introduced as improvement methods, but the problem is how to call these methods. The first momentum is good, but I don't know how to read the latter three. Is it OK with "Ada Glad", "RS Prop", and "Adam" respectively?

The story is derailed, but it is a serious problem that you do not know how to call it, and if you start using the wrong name at an in-house study session, employees will be embarrassed outside. It seems that there are a lot of people who have this problem, and the number of likes of Embarrassing English pronunciation rampant in the IT industry summarized by @ryounagaoka When I see a huge number of comments, I really feel the inconvenience of Japanese. If I worked in this industry, I wanted to be born in an English-speaking world: sweat_smile:

Return the story. These improvement techniques introduced in the book are devised to take advantage of not only the gradient of where you are, but also the momentum and past movements when you reach that point. Momentum and Ada Grad were explained in detail in the book, so there was nothing to stumble upon. Details of other methods are omitted, but including methods not introduced in the book, @ deaikei's Introduction to OPTIMIZER ~ From Linear Regression to Adam to Eve / items / 29d4550fa5066184329a) is organized in an easy-to-understand manner.

As you can imagine from the many methods, there is no universal method, and the result will change depending on the structure of the target neural network and hyperparameters. Choosing the optimal method and adjusting hyperparameters seems to be quite difficult.

In the book, we actually tested it on the MNIST dataset, but SGD was the slowest to learn, and Momentum, AdaGrad, and Adam were all just as fast. In this verification, a 5-layer neural network suddenly appears. I've done it in two layers so far, so it would be easy to understand if you verify it with the same one! I thought, but it seems that verification will be difficult unless the network is complicated to some extent.

6.2 Initial value of weight

It is an explanation of what kind of value the weight before learning should be.

The flow is to reduce the initial value of the weight to suppress overfitting, but I was not sure why overfitting was suppressed. Since the input value is weighted, if the weight is large, the effect on the result will be large, so it certainly seems to be overfitting, but it is a little refreshing start.

Aside from talking about suppressing overfitting, it's well understood that if the activation distribution doesn't vary nicely, gradient disappearance and multiple neurons will have similar outputs, resulting in poor performance. Compliment with 5.5.2 Sigmoid layer in the memo of the previous chapter. In the sigmoid function, the differential value of back propagation is $ y (1-y) $, so if the weight is large, $ y $ will also be large and the differential value will be small, and if they are multiplied by back propagation, the gradient will become smaller and smaller. I have a problem that learning does not proceed. In terms of gradient disappearance, ReLU seems to be very strong because the derivative is $ 1 $ when $ x> 0 $.

Perhaps the matter that was not refreshing at the beginning points to the disappearance of the gradient of the sigmoid function, and if the initial value of the weight is large, the gradient disappears due to the input data at the beginning of learning, and it is affected by the data that follows. It may be difficult and overfitting occurs, so let's reduce the weight.

In the explanation, the terms Gaussian distribution, standard deviation, and histogram are used. Since the Gaussian distribution is a normal distribution, you can get a lot of explanations by googled "What is a normal distribution?". The standard deviation and histogram are very easy to understand in Video lesson Try IT> Mathematics I Correlation with data distribution.

Also, here again, there are two methods that I understand the content but do not understand how to read. The initial value of Xavier, which is suitable when the activation function is a linear function, and the initial value of He, which is suitable for ReLU, are "Xavier's initial value" and "He's initial value", respectively? sweat:

6.3 Batch Normalization

In the explanation of the initial value of the weight, we tried to disperse the activation distribution of each layer nicely, but Batch Normalization is a method of forcibly normalizing the distribution in the middle of each layer. Specifically, using a mini-batch as one unit, a layer for adjustment is inserted between the Affine layer and the ReLU layer, and the values are adjusted so that the average is 0 and the variance is 1.

The formula for this adjustment came out as (6.7), but I stumbled a little here.

\begin{align}
&\mu _B \leftarrow \frac {1}{m} \sum _{i=1} ^{m} x_i \\
&\sigma^2 _B \leftarrow \frac {1}{m} \sum _{i=1} ^{m} (x_i - \mu_B)^2 \\
&\hat{x_i} \leftarrow \frac {x_i - \mu_B}{\sqrt{\sigma^2_B+ \epsilon}}
\end{align}

$ B $ in the formula is the input data of the mini-batch, which is $ m $ input data of $ B = \ {x_1, x_2, \ cdots, x_m \} $. $ \ Mu _B $ ($ \ mu $: read as mu) on the first line is the average of the input data of the mini-batch, and $ \ sigma ^ 2 _B $ ($ \ sigma $: sigma) on the second line is distributed. Up to this point, it's perfect because it's just after reviewing the above-mentioned triit, but I'm not sure why the mean can be normalized to 0 and the variance can be normalized to 1 by the third line. This could be understood in the explanation of Mathematics Learned with Specific Examples> Probability, Data Processing> Meaning and Purpose of Standardization in Statistics.

In addition, the explanation of the back propagation of this Batch Norm layer is omitted in the book, and for details, see [Frederik Kratzert's blog "Understanding the backward pass through Batch Normalization Layer"](https://kratzert.github.io/2016/ 02/12 / understanding-the-gradient-flow-through-the-batch-normalization-layer.html). For people like me who feel less nervous in English, I recommend @ t-tkd3a's Understanding Batch Normalization instead. ..

6.4 Regularization

This is an explanation of Weight decay (reading is weight decay?) And Dropout (dropout), which are methods to suppress overfitting.

The word "regularization" appears in the heading, but there is no explanation of the word in the book, and it suddenly appeared like "$ \ lambda $ is a hyperparameter that controls the strength of regularization" and I was a little confused. .. Wikipedia says "to solve well-posed problems and prevent overfitting. A method of adding information to. " It seems that it refers to all methods that penalize the loss function, such as Weight decay. By the way, the reading of $ \ lambda $ is lambda.

What I was interested in in the contents of the book was the verification result of the method. After confirming that overfitting occurs in a network that has reduced the number of training data and increased the number of layers and increased expressiveness, we have confirmed that weight decay and Dropout can suppress overfitting for training data. However, the recognition accuracy for the essential test data has not been improved.

Although the recognition accuracy of the training data is lower in Fig. 6-20 and Fig. 6-21 of Weight decay, the recognition accuracy of the test data is not so different. Also, in Dropout's Figure 6-23, the recognition accuracy of the test data has dropped, which makes me feel overwhelmed. The purpose of suppressing overfitting is to improve the recognition accuracy of test data, isn't it?

Furthermore, what is worrisome in Figure 6-23 is that the recognition accuracy of both training data and test data is still improving in the final result 301 epoch. With this, I want to increase the epoch a little more, so I set dropout_rate = 0.15 in the source codech06 / overfit_droput.py of the book according to Fig. 6-23, and tried turning it with 601 epoch. Looking at this result, the recognition accuracy of the training data and test data around 500 epochs with stable recognition accuracy is the same as around 100 epochs without Dropout, so it seems that the learning time was simply delayed. ..

In the example of the book, the number of training data is too small, and even if overfitting is suppressed, the recognition accuracy of the test data may not be improved.

6.5 Hyperparameter validation

This is an explanation of how to find the optimum value of hyperparameters.

It's interesting that muddy methods like random search give better results than grid search, which seems to be efficient. In addition, Bayesian optimization is introduced as a more sophisticated method, but it seems that there are various other methods that automatically adjust hyperparameters. @ cvusk's Various hyperparameter automatic adjustments introduces each method.

6.6 Summary

I was able to learn techniques in learning neural networks. The figure of the verification result of "6.4 Regularization" still has some smoky feeling, but there was no big stumbling point.

That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out. (To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / [Chapter 5](https://qiita. com / segavvy / items / 8707e4e65aa7fa357d8a) / Chapter 6 / Chapter 7 / Chapter 8 / Summary)

An amateur stumbled in Deep Learning from scratch Note: Chapter 6