Aidemy 2020/11/10
Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the first post for deep learning and image recognition. Nice to meet you.
What to learn this time ・ ・
-Simple perceptron is a device that receives multiple inputs and outputs one value (0 or 1), and is the basic unit of __neural network __.
・ The formula of simple perceptron is as follows.
-For the above equation, "u" is the sum of each input $ x_i $ multiplied by the weight $ w_i $ and the bias θ added. On the other hand, $ H (u) $ is called __step function __ (activation function), which is 1 when u is positive and 0 when u is negative. -In this way, the simple perceptron receives multiple inputs and fires when the __threshold (0 this time) is exceeded. -However, it can be used only when __ linear separability is possible like the XOR function __. In other cases, use the __ "multilayer perceptron" __ described later.
-Code (implementation example)
-It is necessary to change the above "w" and "θ" appropriately, but it is not realistic to change them manually. In such a case, the method of __automating the value update __ is __ "error correction learning" __.
-In error correction learning, an appropriate "w" is given at first, and an output "y" is obtained by giving an input "x". From the difference between that y and the correct output "t", update "w" with the following formula.
・ At this time, for simplicity, the weight w is (w1, w2, θ), the input x is (x1, x2,1), and the output u is calculated by __ "(transposed w) * x" __. (Omit θ from the formula for u). ・ $ \ Eta $ is the learning rate. Details will be described later. -As can be said from the formula, w is updated only when the values of t and y are different (difference occurs) and x is 1. Also, when y> t, w is updated in the negative direction, and when y <t, w is updated in the positive direction.
·code

・ Output (+ code for output)

-As mentioned above, there was a problem that the simple perceptron could not handle __ nonlinear separation, but __ "multilayer perceptron" __ is used in such cases. As the word "multilayer" means, the one with more layers is called like this. Specifically, in addition to the "input layer" and "output layer", __ "intermediate layer (hidden layer)" __ is increased.
-The specific code is as follows. For the input data x, find "u1" and "u2" using the same formula as for the simple perceptron, find the output "z1" and "z2" of the first layer, and calculate "u3" with these, the weight w3, and the bias b3. Then, the final output "y" is calculated by "H (u3)".

-The "error correction learning" learned above cannot be handled except when the input is "0 or 1". __ If the number of hidden layers increases, this method cannot be used __, so in such a case, use __ "gradient descent method" __. -The gradient descent method is __ the method that has been used in deep learning so far, and __ "learning so that the error function is the smallest" __ method. -It is __gradient __ that is used to make this "smallest". This is obtained by __differential __. -Here, too, a parameter called __ "learning rate" __ appears, but this is a parameter that adjusts __ "how much to learn in one step" __. If this value is too large, it will not converge to the desired value, and if it is too small, it will take too long to reach that value, so it is necessary to set an appropriate value __. -The learning rate is set __basically exploratory __, but you can also use a tool to find an appropriate value.
-Differentiation is performed in the gradient descent method, but if the previous step function is used for the __activation function __, the output will be 0 or 1, so the __differentiation result will be 0 __. Therefore, when using this method, it is necessary to use __ another function for the activation function __.
-For example, __sigmoid function __ and ReLU function can be mentioned.
-Although it is a mathematical part, the __sigmoid function __ is expressed by the following formula. The derivative is also described.
-Even in the code, the calculation result of the sigmoid function can be obtained by calculating in this way.
-The __ReLU function __ is a simple function where __x remains "x" when __x> = 0, and __x is "0" when __x <0, and "x" always has a value when differentiated. Since it is "1", it is often used in the __error back propagation method __.
-In the neural network, the weight update is updated in the gradient direction of the error function obtained by __error back propagation __. Depending on how to find the gradient at this time, it can be divided into the following three methods. -The first is __ "the steepest descent method" __, which is updated according to the gradient __ obtained using all the data. However, this method has the problem that once it reaches the __local solution __, it cannot get out of it. -The second is called __ "stochastic gradient descent" __, which finds the gradient using only the __ith data __ and updates all the data based on it. Although this method is unlikely to fall into a local solution, it uses only one data, so if that data is an outlier, the update may fail. ・ The third __ "mini-batch method" __ can reduce this problem. This is a method of deciding the number of data __ (batch_size) __ used for calculating the gradient by yourself and updating based on it.
-The basic structure of a neural network is __ "simple perceptron" __ that receives multiple inputs and outputs one value. By adding __hidden layer __ to this, it becomes __multilayer perceptron __, and it also supports non-linear separation. -When updating the weights and biases of deep learning, __ "error correction learning" __ is also used, but since this cannot be used as the number of intermediate layers increases, __ "gradient descent method" __ is used instead. Will be. This is a method of updating the value toward the part where the gradient obtained by __differentiating the error function __ is the smallest. -When using the gradient descent method, use the __sigmoid function __ and __ReLU function __ for the __activation function __. -When updating weights with the gradient descent method, the method differs depending on how much of the __all data is used to calculate the gradient. The most commonly used is a prescription that determines the number of data (batch_size) by yourself, and this is called __ "mini-batch method" __.
This time is over. Thank you for reading until the end.
Recommended Posts