Introduction

During the holidays of GW, I suddenly thought about "Deep Learning from scratch-Theory and implementation of deep learning learned with Python" and studied and implemented it, so I made a note of my impressions and my own understanding. For those who will pick up this book and start studying, I hope it will be helpful for understanding and how much learning cost should be expected, but since it is a memo for me to look back and remember when time passes, I read it. There may be some difficult parts. .. .. I think that detailed mathematical formulas should be followed in detail by looking at the reference book, so I am writing while thinking that I should feel like I understand the outline like "Deep Learning is such a mechanism".

By understanding this book and extending the attached sample code a little, We were able to create a system that discriminates numbers in image files of numbers handwritten with "Paint". (* The source code is not released because it seems that copyright issues will be involved.)

Roughly about Deep Learning

Deep Learning is a mix of "identification" and "learning" functions.

"Identification" means input value , neural network function , output value There is font>, input value and known neural network function to output value < Ask for/font>. (Example: Read an image that you do not know what the number is and recognize it as 5)

"Learning" is <font color = "" from test data ( input value ) for which the correct answer ( output value ) is known. Find the "Green"> neural network function . (Example: Load images of various handwritten numbers 5 and update neural network function so that the recognition result is 5)

In addition, it should be noted In the "identification" function, __activation function __ (Chapter 2 and 3), __Loss function __, __gradient descent , __ error backpropagation are important in the "learning" function (Chapter 4 and 5). There are various ways to improve learning accuracy (Chapter 6, 7, 8).

About "identification" function (Chapter 2 and 3)

As mentioned above, "identification" means input value and known neural network function to <font color = "Red". > Find the output value . As an example, the flow of reading an image whose number is unknown and recognizing it as 5 is like this (see also the figure below).

I don't know what the number is, but I cut the image to be read into a 20x20 mesh ( input value ).
Known neural network function (so-called trained data) that makes a judgment such as "If the mesh here is painted black, the number XX is likely." Is used. (How the function is defined will be described later in "Learning")
Calculate the probability that the loaded image is a number 0-9 ( output value ).

|:-:|

What is important here is that this image is not determined as "5!", But is judged as "the probability of being 5 is 70%, so it seems to be 5.", and the output value Is to be a 1-by-10 matrix. In addition, we have devised so that "each element becomes a positive value" and "the total of each element becomes" 1 "" in the matrix Y output using the __activation function __, which is representative. There is sigmoid function and softmax function as a typical function.

About "learning" function (Chapter 4 and 5)

As mentioned above, "learning" means input value from test data ( input value ) for which the correct answer ( output value ) is known. Find the font color = "Green"> neural network function . This neural network function W is calculated using __loss function __, __ gradient descent __, and __ error backpropagation __.

|:-:|

__ Derailment: Nostalgic High School Mathematics __ In the function y = f (x), if you know the input value x and the output value y, you should know the function f (x).

f(x) = ax^2 + bx + c

 > For example, in the case of the quadratic function f (x), if there are three combinations of (x, y), a, b, c can be obtained, and f (x) can be obtained.
 It's a little difficult and you can think of the matrix version as "learning".
 (I miss high school mathematics: relaxed :).


 __ Loss function __
 About the figure of "Deep Learning flow (learning)" above

 --Using the neural network function A, the probability of the number "5" was 0.7.
 --Using the neural network function B, the probability of the number "5" was 0.85.

 In such a case, the loss function L is an index for evaluating which neural network function is __how better __.
 There are various loss function L such as entropy error and sum of squares error, but the formula of sum of squares error is this ↓.

```math
L = \sum (X_{Output value}-X_{Correct answer value})^2\\

Here, for example, when the probabilities (output values) of 2 to 5 when reading a numerical image are as follows, the loss functions L of the neural networks A and B are 0.34 and 0.14, respectively, and the neural network function B is excellent. I understand that.

Also, as a matter of course, the closer the loss function L is to 0, the more accurately it can be identified, and it is important in "learning" to find the neural network function W. Then, reduce the value of the loss function L (__ gradient descent __) and recalculate the function W.

__ Gradient descent __ A method of finding the minimum value by finding f (x) at a location that is advanced by a certain distance in the gradient direction from the current location, and then advancing by a certain distance ΔX in the gradient direction from that point. ..

In the case of two dimensions (lower left figure), if you start from any point (x, f (x)) with X> maximum value, it will finally settle to the minimum value by the gradient method (the point where the gradient becomes 0 is the minimum). value). The same applies when the loss function L of the sum of squares error is three-dimensional (lower left figure), and the minimum value is reached by repeating the operation of finding the gradient at an arbitrary point and advancing it by ΔX in the gradient direction.

Even if it is an n-dimensional matrix, the gradient toward the minimum value can be obtained by partial differentiation. Therefore, the matrix Y of the output value is shifted by the gradient direction Δx, and the function of the neural network from the original matrix X and the child matrix Y You can "learn" by recalculating W.

__ Backpropagation method __ Omitted. Well, a method that is easy to calculate with partial differentiation and matrix features suppressed.

Ingenuity to improve learning accuracy (Chapter 6, 7, 8)

Omitted. Damped vibration differential equations have also appeared, and the world beyond this is a deep world. .. ..

Summary

It's amazing that all the stage bosses I've encountered in my mathematical life have gathered and attacked. What does that mean?

--Partial differential equation (number IIIC) --Matrix product and transpose of n-dimensional matrix (university mathematics) --Damped vibration differential equation (institute test)

It was an on-parade that I had a hard time studying while taking university entrance exams or studying at university, so I had to remember these things lightly before learning about Deep Learning. So programming is strong, but for those who are a little involved in matrix calculations and partial differential equations, it seems that the learning cost is a little high.

Thank you for reading this far. : bow: