People who have studied machine learning and deep learning, but do not know how they are related when implementing. Someone who wants to organize their minds. No detailed mathematical explanation will be given. See the chainer tutorial for a mathematical explanation.
Deep learning was created by imitating the mechanism of information transmission in human nerve cells. This has dramatically improved accuracy.
In a neural network, a mathematical model that reproduces the movement of human nerve cells on a computer is created. Although individual nerve cells have only simple computing power, they can make advanced recognition and judgment by connecting and interlocking with each other. The mechanism of information transmission is reproduced by mathematical formulas while calculating matrices and special functions.
Neurons have multiple inputs and one output. One layer of a neural network consists of multiple neurons. ** Weight **: Coefficient to multiply each input. It corresponds to the transmission efficiency at synapses, and the larger the value, the more information is transmitted. Since it is automatically updated during learning, only the initial value is required. ** Bias **: A constant added to the sum of the input multiplied by the weight. It is for adjusting the excitement of neurons. Since it is automatically updated during learning, only the initial value is required. ** Activation function **: A function that represents the excited state of a neuron. Without the activation function, neuron operations would simply be the sum of the products, and the neural network would lose the ability to express complex expressions. There are various functions such as sigmoid function and ReLU, and the optimum function is determined to some extent depending on the problem to be dealt with. You have to choose for each layer. For the explanation of the function, refer to [Type of activation function](#Type of activation function). By the way, ** w ** is represented as a weight and ** b ** is represented as a bias as a symbol.
(Jabashi) The image recognition library yolo can be recognized immediately after importing a trained (automatically adjusted weight, bias, number of layers, etc.) weight file. However, if the learned environment is not taken into consideration, it may be misrecognized, or it may not converge well even if additional learning is performed. For example, as shown in this article, "sushi" is recognized as "hot dog" in the learning model learned in the United States. In other words, it can be said that the personality (calculation result) changes depending on the environment (data set) in which it was raised.
A neural network is constructed by connecting a plurality of neurons and networking them. (x, v, y are neurons) Arrange them in layers as shown in the figure. Figure Neural network model (quoted from machine learning starting with python)
Layers in a neural network are classified into an input layer, an intermediate layer, and an output layer. There is only one input layer and one output layer, but the middle layer can be increased. It is necessary to adjust because the behavior of calculation changes by increasing the number of intermediate layers. In a normal neural network, the output from one neuron is connected to the input of all neurons in the next layer. ** Forward propagation ** that information is transmitted from input to output. The fact that information goes back from the output to the input is called ** backpropagation **. Backpropagation is done when the output results are traced back and the weights and biases are updated. Normally, it is done automatically, so if you know the adjustment parameters and mechanism, you will be able to use it. The error of the output value is calculated by comparing it with the correct label of the data set, and the weight and bias are automatically adjusted so that it approaches the correct label.
Problems handled by neural networks can be divided into classification problems and regression problems.
** Fully connected **: A model in which all neurons between layers are connected as in the "figure neural network model". ** Convolution type **: Often used in the field of image processing. By convolving the image, you can strengthen or weaken the features in the image. ** Recursive **: A type of neural network that can handle context. Often used for natural language processing and series data.
Regression problems predict continuous numbers from data trends. The basic usage is to collect discrete data, create an approximate function from that data, and calculate the predicted value of unknown data. As a main example ・ Predict weight from height ・ Predict what to buy next from the user's shopping tendency ・ Forecast tomorrow's stock price from past stock price trends ・ Predict rent from property information
Predicting one result from one input. It is used to express two relationships such as height and weight.
One that predicts one result from multiple inputs such as predicting rent from various information of the property. This is more often used.
A classification problem is a problem of classifying data into multiple fixed frames. Unlike the regression problem, there are multiple outputs. The output represents the probability of what is classified as is. As a main example ・ Classify plants from leaf images ・ Classify handwritten characters ・ Determine what vehicle is shown in the image
This is a data group in which the training data x and the correct label y are one set. The dataset is further divided into ** training data ** and ** test data **. The training data is used for network training, while the test data is used to evaluate the performance of the model created from the training data. Training data is usually larger than test data. If the network learned from the training data gives good results even with test data, it means that the network can handle unknown data. If the test data does not give good results, there is something wrong with the network itself or the learning method. The ability of a network to handle unknown data is called ** generalization performance **.
Training data is collected manually or automatically by scraping. Collecting data can be done automatically, but it is the most difficult task because it is manually determined whether the data is suitable. For how to collect data, refer to this article.
Datasets are created manually or use libraries. You can find it by searching for "how to make a data set". In the case of an image, create it using the ** annotation tool **. Much of the user's time is spent creating this dataset. Images usually require thousands of images to get the right results, but they are difficult to collect and are usually padded.
The training data and the correct label of the data set are each a vector. For example, the correct height label is represented by a vector in which the following numerical values are arranged. [146.2 170.4 169.3 154.5 179.2] In this case, there are 5 neurons in the output layer, and the bias and weight gradients are adjusted so that each output value approaches the correct value. In the case of a classification problem, the correct answer label is a vector in which the correct answer is 1 and all but the correct answer are 0, as shown below. [0 0 1 0 0] A sequence of numbers with one 1 and 0 remaining is called ** one-hot expression **.
Format data into an easy-to-use format such as normalization and standardization. For example, standardization has the effect of stabilizing and speeding up learning.
In the case of a classification problem, the correct answer value is expressed in onoe-hot expression. Therefore, when implementing, insert a conversion program to one-hot expression. When the number of neurons in the output layer is 3, it becomes as follows. [1 0 0] [0 1 0] [0 0 1]
Finally, it is divided into training data and test data. The training data is used for learning, while the test data is used to evaluate the generalization performance of the learning results obtained from the training data. There is no difference in training both, but the results after learning are compared in a graph. It seems that 20% to 30% of the training data is good for the test data.
Work to make molds for the intermediate layer and output layer used for learning. Set the ** activation function ** and ** loss function ** of the intermediate layer and output layer, respectively. It's a good idea to write it in class so that you can reuse it. Parameters should be declared as variables so that they can be changed from the outside. If you don't want to understand, but just want to learn quickly, copying and pasting other people is enough.
** Set the initial weight and bias **. Regarding the types of activation functions, [What is the activation function that I do not understand well](https://newtechnologylifestyle.net/%E3%82%84%E3%81%A3%E3%81%B1%E3%82%8A % E3% 82% 88% E3% 81% 8F% E5% 88% 86% E3% 81% 8B% E3% 82% 89% E3% 81% AA% E3% 81% 84% E6% B4% BB% E6 % 80% A7% E5% 8C% 96% E9% 96% A2% E6% 95% B0% E3% 81% A8% E3% 81% AF /) will be helpful.
Set it with a can for the first time, and modify the following four values according to the learning result. These are adjusted many times by the user according to the output evaluated by the test data. Each explanation is [here](# hyperparameters) ·epoch ・ Batch size ・ Learning coefficient ・ Optimization algorithm
Learning is the process of adjusting the connections between neurons. Learning goes through the following three processes.
Use the ** loss function ** to derive the error.
The ** gradient descent method ** is used to determine the gradient. The resulting loss function $ L $ is partially differentiated with respect to $ w $ to obtain the gradient $ \ frac {∂L} {∂w} $. (Quote: chaier tutorial)
The error between the output obtained by forward propagation and the correct answer prepared in advance is propagated in the opposite direction layer by layer. Based on the propagated error, the amount of weight and bias update is calculated for each layer. The weight after the update is calculated as follows.
w←w−η\frac{∂L}{∂w}
As a result of learning, check what percentage of the training data can be judged correctly. What percentage of the test data can be correctly judged is an important index for judging the success or failure of learning. Output a graph of epochs (number of learnings) and loss (number of incorrect answers), and the user can visually check it. Estimate what goes wrong from the movement of the graph, adjust the hyperparameters, and learn again.
A problem in which the optimum solution cannot be reached as a whole because the local optimum solution is trapped when finding the gradient. In some cases, the gradient is extremely reduced, and there is a problem that learning does not proceed. In order to find the optimal solution as a whole, it is necessary to leave the best condition locally once. (Quote: chainer tutorial)
There is a neural network ** It is optimized and learned only for a specific range of data, and it becomes impossible to handle unknown data. ** In machine learning, it is also impossible to estimate for input data that is too unknown to fit the training data. It can be said that only a specific pattern has fallen into an optimized locally optimal solution. The generated network should be a little sloppy so that it can handle a variety of data. In order to suppress this overfitting, the user adjusts various parameters described later. ** If there are too many middle layers or neurons, overfitting will occur due to excessive expressiveness. ** It may also occur due to ** insufficient sample size of training data **.
Problems mainly caused by using the sigmoid function as the activation function. The gradient of the sigmoid function has a maximum value of 0.25 and approaches 0 as the distance from 0 increases. When backpropagating, the derivative of the activation function is applied to each gradient each time the layer is traced back. In the case of the sigmoid function, each gradient becomes smaller each time the layer is traced, and the gradient disappears. Therefore, in deep learning, ReLU is often used as the activation function.
In multi-layer deep learning, the number of weights and biases is enormous. As a result, learning can take days or weeks. To deal with this problem, you can use a GPU, use a machine with high specifications, do not complicate the network more than necessary, and adjust the parameters.
The problems associated with multi-layering cannot be adjusted automatically by the machine and must be adjusted by the user according to the result. Therefore, adjustments and learning must be repeated many times to obtain better results. There are the following seven items to adjust. It takes some experience to decide which one to adjust. We often adjust hyperparameters, data expansion, and data preprocessing.
Learning all training data once is counted as one epoch. Even if the number of epochs is too large, over-adaptation to the training data will occur, so stop at an appropriate number of times.
It refers to the number of data contained in each of several subsets of a dataset. Parameters that affect learning time and performance. For details, refer to the explanation here.
Coefficient for gradient descent. This allows you to adjust the amount of gradient. If the learning coefficient is too large, the value of the loss function will oscillate or diverge as the parameters are updated repeatedly. On the other hand, if it is too small, it will take time to converge. Since the optimum value for this value has not been determined, it is necessary to search empirically.
Gradient descent adjusts the weight and bias little by little based on the gradient to optimize the network to minimize the error. There are various algorithms for this optimization. [Click here] for a description of each algorithm (#List of optimization algorithms)
Important hyperparameters related to the success or failure of learning. It is desirable to have a random variation with a small value.
A method to stop learning in the middle. As the learning progresses, the error of the test data increases from the middle, and overfitting may occur. Therefore, the learning is terminated before that. Even if the error is stagnant and learning does not proceed, it will be terminated to shorten the time.
If the number of training data samples is small, overfitting is likely to occur. In that case, it is common to inflate the sample to deal with it. By training various types of samples, it helps to improve the generalization performance of the network. In this article, it is written that when you actually train the image, you are overfitting due to lack of samples, so please refer to it.
Process the input data in advance to make it easier to handle. Preprocessing can be expected to improve network performance and speed up learning. Much of the machine learning time will be spent on this. If you are using python, you can easily preprocess it by using a library called "pandas". There are various types of preprocessing such as normalization and standardization. For the method, refer to this article.
Overfitting suppression technique. A technique that randomly erases neurons other than the output layer with a certain probability. Larger networks are more prone to overfitting, so dropouts can be used to reduce the size of the network.
Limit the weight. By limiting the weights, the weights take extreme values and ** prevent being trapped in the locally optimal solution. ** **
ReLU An input below 0 has an output of 0.0. A function with an input above 0 that produces the same output as the input. ** Often used for classification problems **. Since the input of 0 or less cannot be taken in the classification problem, it is often used for noise removal. When implementing it programmatically, it can be implemented by truncating 0 or less in the if statement.
As the input value becomes large, it converges to a constant value, so it is not used for large input values. (wikipedia)
** A function suitable for dealing with classification problems. ** Often used in the output layer of classification problems. The sum of all the outputs of this function from K = 1 to n gives 1. Therefore, the softmax function has the property of normalizing so that the sum of all outputs is ** no matter what value the vector X consisting of multiple input values takes. ** Therefore, it is compatible with the classification problem that treats the total output value of neurons in the output layer as 1.
y=\frac{e^{x}}{\sum_{k=1}^{N}e^{x_k}}
A function that returns the input as it is as the output. ** Often used as an activation function for the output layer of regression problems. ** **
Loss function often used when you want to solve ** regression problem **.
L= \frac{1}{N}\sum_{n=1}^{N}(t_n-y_n)^2
Loss function often used when solving ** classification problems **. The advantage is that the learning speed is fast when the isolation between the output and the correct answer value is large.
L=\sum_{k=1}^{K}t_k(-\log(y_k))
The following is a note of typical ones. Click here for other optimization algorithms (https://qiita.com/ZoneTsuyoshi/items/8ef6fa1e154d176e25b8#%E7%A2%BA%E7%8E%87%E7%9A%84%E5%8B%BE% E9% 85% 8D% E9% 99% 8D% E4% B8% 8B% E6% B3% 95sgd-stochastic-gradient-descent).
An algorithm that randomly calls a sample for each update. It has the characteristic that it is difficult to get caught up in the local optimum solution. Although it can be implemented with simple code, it often takes time to learn because the amount of updates cannot be adjusted according to the progress of learning. AdaGard The update amount is adjusted automatically. As the learning progresses, the learning rate gradually decreases. Since the only constant to be set is the learning coefficient, it does not take time to adjust. RMSProp AdaGrad overcomes the weakness of learning stagnation due to a decrease in the amount of updates. Adam An improved version of RMSprop. It seems to be the most used.
Chainer Tutorial Introduction to machine learning starting with python First Deep Learning Neural Network and Backpropagation Learned with Python Basics of Kikagaku Deep Learning https://www.sbbit.jp/article/cont1/33345 https://qiita.com/nishiy-k/items/1e795f92a99422d4ba7b https://qiita.com/Lickey/items/b97c3450d7def207bfbf
Recommended Posts