Target person

People who have studied machine learning and deep learning, but do not know how they are related when implementing. Someone who wants to organize their minds. No detailed mathematical explanation will be given. See the chainer tutorial for a mathematical explanation.

What became a model of deep learning

Deep learning was created by imitating the mechanism of information transmission in human nerve cells. This has dramatically improved accuracy.

How neural networks work

Neuron modeling

In a neural network, a mathematical model that reproduces the movement of human nerve cells on a computer is created. Although individual nerve cells have only simple computing power, they can make advanced recognition and judgment by connecting and interlocking with each other. The mechanism of information transmission is reproduced by mathematical formulas while calculating matrices and special functions.

Neuron model

Neurons have multiple inputs and one output. One layer of a neural network consists of multiple neurons. ** Weight **: Coefficient to multiply each input. It corresponds to the transmission efficiency at synapses, and the larger the value, the more information is transmitted. Since it is automatically updated during learning, only the initial value is required. ** Bias **: A constant added to the sum of the input multiplied by the weight. It is for adjusting the excitement of neurons. Since it is automatically updated during learning, only the initial value is required. ** Activation function **: A function that represents the excited state of a neuron. Without the activation function, neuron operations would simply be the sum of the products, and the neural network would lose the ability to express complex expressions. There are various functions such as sigmoid function and ReLU, and the optimum function is determined to some extent depending on the problem to be dealt with. You have to choose for each layer. For the explanation of the function, refer to [Type of activation function](#Type of activation function). By the way, ** w ** is represented as a weight and ** b ** is represented as a bias as a symbol.

(Jabashi) The image recognition library yolo can be recognized immediately after importing a trained (automatically adjusted weight, bias, number of layers, etc.) weight file. However, if the learned environment is not taken into consideration, it may be misrecognized, or it may not converge well even if additional learning is performed. For example, as shown in this article, "sushi" is recognized as "hot dog" in the learning model learned in the United States. In other words, it can be said that the personality (calculation result) changes depending on the environment (data set) in which it was raised.

Networking of neurons (neural network)

A neural network is constructed by connecting a plurality of neurons and networking them. (x, v, y are neurons) Arrange them in layers as shown in the figure. Figure Neural network model (quoted from machine learning starting with python)

Layers in a neural network are classified into an input layer, an intermediate layer, and an output layer. There is only one input layer and one output layer, but the middle layer can be increased. It is necessary to adjust because the behavior of calculation changes by increasing the number of intermediate layers. In a normal neural network, the output from one neuron is connected to the input of all neurons in the next layer. ** Forward propagation ** that information is transmitted from input to output. The fact that information goes back from the output to the input is called ** backpropagation **. Backpropagation is done when the output results are traced back and the weights and biases are updated. Normally, it is done automatically, so if you know the adjustment parameters and mechanism, you will be able to use it. The error of the output value is calculated by comparing it with the correct label of the data set, and the weight and bias are automatically adjusted so that it approaches the correct label.

Two problems dealt with in neural networks

Problems handled by neural networks can be divided into classification problems and regression problems.

Types of neural networks

** Fully connected **: A model in which all neurons between layers are connected as in the "figure neural network model". ** Convolution type **: Often used in the field of image processing. By convolving the image, you can strengthen or weaken the features in the image. ** Recursive **: A type of neural network that can handle context. Often used for natural language processing and series data.

Regression

Regression problems predict continuous numbers from data trends. The basic usage is to collect discrete data, create an approximate function from that data, and calculate the predicted value of unknown data. As a main example ・ Predict weight from height ・ Predict what to buy next from the user's shopping tendency ・ Forecast tomorrow's stock price from past stock price trends ・ Predict rent from property information

Simple regression

Predicting one result from one input. It is used to express two relationships such as height and weight.

Multiple regression

One that predicts one result from multiple inputs such as predicting rent from various information of the property. This is more often used.

Classification

A classification problem is a problem of classifying data into multiple fixed frames. Unlike the regression problem, there are multiple outputs. The output represents the probability of what is classified as is. As a main example ・ Classify plants from leaf images ・ Classify handwritten characters ・ Determine what vehicle is shown in the image

Flow when actually learning

1 Data set preparation

What is a dataset?

This is a data group in which the training data x and the correct label y are one set. The dataset is further divided into ** training data ** and ** test data **. The training data is used for network training, while the test data is used to evaluate the performance of the model created from the training data. Training data is usually larger than test data. If the network learned from the training data gives good results even with test data, it means that the network can handle unknown data. If the test data does not give good results, there is something wrong with the network itself or the learning method. The ability of a network to handle unknown data is called ** generalization performance **.

1.1 Obtaining data

Training data is collected manually or automatically by scraping. Collecting data can be done automatically, but it is the most difficult task because it is manually determined whether the data is suitable. For how to collect data, refer to this article.

1.2 Creating a dataset

Datasets are created manually or use libraries. You can find it by searching for "how to make a data set". In the case of an image, create it using the ** annotation tool **. Much of the user's time is spent creating this dataset. Images usually require thousands of images to get the right results, but they are difficult to collect and are usually padded.

Data set structure

The training data and the correct label of the data set are each a vector. For example, the correct height label is represented by a vector in which the following numerical values are arranged. [146.2 170.4 169.3 154.5 179.2]　 In this case, there are 5 neurons in the output layer, and the bias and weight gradients are adjusted so that each output value approaches the correct value. In the case of a classification problem, the correct answer label is a vector in which the correct answer is 1 and all but the correct answer are 0, as shown below. [0 0 1 0 0] A sequence of numbers with one 1 and 0 remaining is called ** one-hot expression **.

1.3 Data preprocessing

Format data into an easy-to-use format such as normalization and standardization. For example, standardization has the effect of stabilizing and speeding up learning.

1.4 (Conversion to One-Hot representation)

In the case of a classification problem, the correct answer value is expressed in onoe-hot expression. Therefore, when implementing, insert a conversion program to one-hot expression. When the number of neurons in the output layer is 3, it becomes as follows. [1 0 0] [0 1 0] [0 0 1]

1.5 Separation of training data and test data

Finally, it is divided into training data and test data. The training data is used for learning, while the test data is used to evaluate the generalization performance of the learning results obtained from the training data. There is no difference in training both, but the results after learning are compared in a graph. It seems that 20% to 30% of the training data is good for the test data.

2 Implementation of each layer

Work to make molds for the intermediate layer and output layer used for learning. Set the ** activation function ** and ** loss function ** of the intermediate layer and output layer, respectively. It's a good idea to write it in class so that you can reuse it. Parameters should be declared as variables so that they can be changed from the outside. If you don't want to understand, but just want to learn quickly, copying and pasting other people is enough.

3 Setting the initial value of the parameter

** Set the initial weight and bias **. Regarding the types of activation functions, [What is the activation function that I do not understand well](https://newtechnologylifestyle.net/%E3%82%84%E3%81%A3%E3%81%B1%E3%82%8A % E3% 82% 88% E3% 81% 8F% E5% 88% 86% E3% 81% 8B% E3% 82% 89% E3% 81% AA% E3% 81% 84% E6% B4% BB% E6 % 80% A7% E5% 8C% 96% E9% 96% A2% E6% 95% B0% E3% 81% A8% E3% 81% AF /) will be helpful.

4 Hyperparameter adjustment

Set it with a can for the first time, and modify the following four values according to the learning result. These are adjusted many times by the user according to the output evaluated by the test data. Each explanation is [here](# hyperparameters) ·epoch ・ Batch size ・ Learning coefficient ・ Optimization algorithm

5 learning

Learning is the process of adjusting the connections between neurons. Learning goes through the following three processes.

5.1 Derivation of error from correct answer

Use the ** loss function ** to derive the error.

5.2 Gradient derivation

The ** gradient descent method ** is used to determine the gradient. The resulting loss function $ L $ is partially differentiated with respect to $ w $ to obtain the gradient $ \ frac {∂L} {∂w} $. (Quote: chaier tutorial)

5.3 Backpropagation (backpropagation method)

The error between the output obtained by forward propagation and the correct answer prepared in advance is propagated in the opposite direction layer by layer. Based on the propagated error, the amount of weight and bias update is calculated for each layer. The weight after the update is calculated as follows.


w←w−η\frac{∂L}{∂w}

6 Evaluation of correct answer rate

As a result of learning, check what percentage of the training data can be judged correctly. What percentage of the test data can be correctly judged is an important index for judging the success or failure of learning. Output a graph of epochs (number of learnings) and loss (number of incorrect answers), and the user can visually check it. Estimate what goes wrong from the movement of the graph, adjust the hyperparameters, and learn again.

Problems due to multiple layers of deep learning

Trap to the locally optimal solution

A problem in which the optimum solution cannot be reached as a whole because the local optimum solution is trapped when finding the gradient. In some cases, the gradient is extremely reduced, and there is a problem that learning does not proceed. In order to find the optimal solution as a whole, it is necessary to leave the best condition locally once. (Quote: chainer tutorial)

Overfitting

There is a neural network ** It is optimized and learned only for a specific range of data, and it becomes impossible to handle unknown data. ** In machine learning, it is also impossible to estimate for input data that is too unknown to fit the training data. It can be said that only a specific pattern has fallen into an optimized locally optimal solution. The generated network should be a little sloppy so that it can handle a variety of data. In order to suppress this overfitting, the user adjusts various parameters described later. ** If there are too many middle layers or neurons, overfitting will occur due to excessive expressiveness. ** It may also occur due to ** insufficient sample size of training data **.

Gradient disappearance

Problems mainly caused by using the sigmoid function as the activation function. The gradient of the sigmoid function has a maximum value of 0.25 and approaches 0 as the distance from 0 increases. When backpropagating, the derivative of the activation function is applied to each gradient each time the layer is traced back. In the case of the sigmoid function, each gradient becomes smaller each time the layer is traced, and the gradient disappears. Therefore, in deep learning, ReLU is often used as the activation function.

It takes time to learn

In multi-layer deep learning, the number of weights and biases is enormous. As a result, learning can take days or weeks. To deal with this problem, you can use a GPU, use a machine with high specifications, do not complicate the network more than necessary, and adjust the parameters.

Parameters adjusted by the user

The problems associated with multi-layering cannot be adjusted automatically by the machine and must be adjusted by the user according to the result. Therefore, adjustments and learning must be repeated many times to obtain better results. There are the following seven items to adjust. It takes some experience to decide which one to adjust. We often adjust hyperparameters, data expansion, and data preprocessing.

1 Hyperparameter

epoch

Learning all training data once is counted as one epoch. Even if the number of epochs is too large, over-adaptation to the training data will occur, so stop at an appropriate number of times.

Batch size

It refers to the number of data contained in each of several subsets of a dataset. Parameters that affect learning time and performance. For details, refer to the explanation here.

Learning coefficient

Coefficient for gradient descent. This allows you to adjust the amount of gradient. If the learning coefficient is too large, the value of the loss function will oscillate or diverge as the parameters are updated repeatedly. On the other hand, if it is too small, it will take time to converge. Since the optimum value for this value has not been determined, it is necessary to search empirically.

Optimization algorithm

Gradient descent adjusts the weight and bias little by little based on the gradient to optimize the network to minimize the error. There are various algorithms for this optimization. [Click here] for a description of each algorithm (#List of optimization algorithms)

2 Initial values of weight and bias

Important hyperparameters related to the success or failure of learning. It is desirable to have a random variation with a small value.

3 Early termination

A method to stop learning in the middle. As the learning progresses, the error of the test data increases from the middle, and overfitting may occur. Therefore, the learning is terminated before that. Even if the error is stagnant and learning does not proceed, it will be terminated to shorten the time.

4 Data expansion

If the number of training data samples is small, overfitting is likely to occur. In that case, it is common to inflate the sample to deal with it. By training various types of samples, it helps to improve the generalization performance of the network. In this article, it is written that when you actually train the image, you are overfitting due to lack of samples, so please refer to it.

5 Data preprocessing

Process the input data in advance to make it easier to handle. Preprocessing can be expected to improve network performance and speed up learning. Much of the machine learning time will be spent on this. If you are using python, you can easily preprocess it by using a library called "pandas". There are various types of preprocessing such as normalization and standardization. For the method, refer to this article.

6 dropout

Overfitting suppression technique. A technique that randomly erases neurons other than the output layer with a certain probability. Larger networks are more prone to overfitting, so dropouts can be used to reduce the size of the network.

7 Regularization

Limit the weight. By limiting the weights, the weights take extreme values and ** prevent being trapped in the locally optimal solution. ** **

Types of activation functions

ReLU An input below 0 has an output of 0.0. A function with an input above 0 that produces the same output as the input. ** Often used for classification problems **. Since the input of 0 or less cannot be taken in the classification problem, it is often used for noise removal. When implementing it programmatically, it can be implemented by truncating 0 or less in the if statement.

Sigmoid function

As the input value becomes large, it converges to a constant value, so it is not used for large input values. (wikipedia)

Softmax function

** A function suitable for dealing with classification problems. ** Often used in the output layer of classification problems. The sum of all the outputs of this function from K = 1 to n gives 1. Therefore, the softmax function has the property of normalizing so that the sum of all outputs is ** no matter what value the vector X consisting of multiple input values takes. ** Therefore, it is compatible with the classification problem that treats the total output value of neurons in the output layer as 1.

y=\frac{e^{x}}{\sum_{k=1}^{N}e^{x_k}}

Identity function

A function that returns the input as it is as the output. ** Often used as an activation function for the output layer of regression problems. ** **

Loss function list

Mean squared error

Loss function often used when you want to solve ** regression problem **.

L= \frac{1}{N}\sum_{n=1}^{N}(t_n-y_n)^2

Cross entropy error

Loss function often used when solving ** classification problems **. The advantage is that the learning speed is fast when the isolation between the output and the correct answer value is large.

L=\sum_{k=1}^{K}t_k(-\log(y_k))

List of optimization algorithms

The following is a note of typical ones. Click here for other optimization algorithms (https://qiita.com/ZoneTsuyoshi/items/8ef6fa1e154d176e25b8#%E7%A2%BA%E7%8E%87%E7%9A%84%E5%8B%BE% E9% 85% 8D% E9% 99% 8D% E4% B8% 8B% E6% B3% 95sgd-stochastic-gradient-descent).

SGD (Stochastic Gradient Descent)

An algorithm that randomly calls a sample for each update. It has the characteristic that it is difficult to get caught up in the local optimum solution. Although it can be implemented with simple code, it often takes time to learn because the amount of updates cannot be adjusted according to the progress of learning. AdaGard The update amount is adjusted automatically. As the learning progresses, the learning rate gradually decreases. Since the only constant to be set is the learning coefficient, it does not take time to adjust. RMSProp AdaGrad overcomes the weakness of learning stagnation due to a decrease in the amount of updates. Adam An improved version of RMSprop. It seems to be the most used.

reference

Chainer Tutorial Introduction to machine learning starting with python First Deep Learning Neural Network and Backpropagation Learned with Python Basics of Kikagaku Deep Learning https://www.sbbit.jp/article/cont1/33345 https://qiita.com/nishiy-k/items/1e795f92a99422d4ba7b https://qiita.com/Lickey/items/b97c3450d7def207bfbf

A memorandum of studying and implementing deep learning