Introduction

Learning is a function in neural networks (deep learning). I tried to understand from scratch the calculations in the model that are being done to increase the predictive value of the predictive model.

This time as well, I referred to O'Reilly's deep learning textbook. It's very easy to understand. https://www.oreilly.co.jp/books/9784873117584/

The outline is as follows.

** What is learning in neural networks (first half: this time) **
** What is the loss function (first half: this time) **
** Implement differential of function (first half: this time)
About gradient descent (second half)
Gradient in neural network (second half)
Try to implement a learning algorithm (second half)

What is learning in neural networks

Learning in a model is to bring the predicted value closer to the correct answer or increase the correct answer rate. Take image recognition as an example. MNIST, a well-known number recognition system, distinguishes handwritten numbers.

In this image, any human can see 5 (the brain learns and recognizes). Next, let's think about what is needed to create an algorithm that allows a computer to recognize this 5. In order to recognize 5 from the "image" of 5, it is necessary to find a "feature amount" that can be identified as 5 from the image. Feature quantity is written in English as Feature Selection. Translated literally, it means "choose a characteristic." If you replace it with the image of 5, it will have features such as "first horizontal bar", "vertical line", and "a circular arc with an open left at about 270 degrees". The flow of extracting these features and learning the extracted features is the algorithm that makes the computer recognize 5. 　 The function of finding (= extracting) features is called a converter. These well-known converters include SIFT, SURF, and HOG. For more information, please visit the URL below. This URL is the material for 2011, and seems to be a technology that has developed since the 2000s.

https://www.slideshare.net/lawmn/siftsurf

Next, you can use features to convert image data into a vector and train that vector with a function called a classifier used in machine learning. Well-known such classifiers are Support Vector Machine (SVM) and K-nearest neighbor method (KNN).

Here, it is necessary to judge and select the converter appropriately by "person" according to the characteristics. On the other hand, the range covered by the neural network includes this converter as well. In other words, the converter itself that searches for features is also an algorithm that can be trained.

The concept is illustrated above. By increasing the area that the computer judges, the neural network interprets the given data as it is and tries to find the pattern of the problem. It can be understood that it is an algorithm with a greater sense of artificial intelligence.

What is a loss function

Next, I will summarize the idea of distinguishing between the specifically predicted data and the correct data. We introduce a function called a loss function to indicate whether it is close to the correct answer.

Sum of squares error

The most well-known loss function is the mean squared error. It is expressed by the formula shown below.

yk indicates the output of the neural network, tk indicates the teacher data (correct answer data), and k indicates the number of dimensions (number) of the data. From the formula, we can see that the more correct answers, the smaller this value. I would like to write it easily in a program.

`nn.ipynb`


import numpy as np

def mean_squared_error(y,t):
    return 0.5*np.sum((y-t)**2)

t = [0,0,1,0,0,0,0,0,0,0]
y = [0.1,0.1,0.6,0.1,0.1,0,0,0,0,0]
y1 = [0.1,0.1,0.1,0.1,0.6,0,0,0,0,0]
print(mean_squared_error(np.array(y),np.array(t)))
print(mean_squared_error(np.array(y1),np.array(t)))

0.10000000000000003
0.6000000000000001

The elements of this array correspond to the numbers "0", "1", "2" in order from the first index. Where y is the output of the neural network. The value converted by the softmax function represents the probability. It is said that the probability of determining that it is the number 2 is 0.6. Furthermore, t is teacher data. In other words, the correct answer is the number 2. When the sum-of-squares error was calculated for each of y and y1, y was closer. It can be seen that the value output by y can properly express that the element with the number 2 has the highest probability.

Cross entropy error

Another error function is the cross entropy error.

log is based on the natural logarithm. Since tk is the correct label, 1 is output only when the answer is correct. Therefore, this function is calculated to output the natural logarithm corresponding to the correct label of 1. Here is the result of the actual implementation.

`nn.ipynb`



def cross_entropy_error(y,t):
    delta = 1e-7
    return -np.sum(t*np.log(y+delta))

print(cross_entropy_error(np.array(y),np.array(t)))
print(cross_entropy_error(np.array(y1),np.array(t)))

0.510825457099338
2.302584092994546

Here, a small value (0.0000001) is added in the calculation in the log. This is added in order to prevent the calculation from getting stuck because it diverges to minus infinity when it becomes log (0). Looking at the result, if the y output of the correct label is small, it will be 2.3, but if the y output is high, it will be 0.5.

Purpose of setting the error function

The loss function can be made into a model with high prediction accuracy by minimizing the value obtained. Therefore, it is necessary to find a parameter that reduces the loss function. At this time, the parameter is updated using the differentiated value of this parameter as a clue. Differentiation allows you to know the gradient of the function. The basic contents related to differentiation are omitted here.

If the value of this gradient is positive, moving the parameter (a in the figure) in the negative direction will bring it closer to the minimum value. On the contrary, if the gradient value is negative, you can imagine moving the parameter in the positive direction to approach the minimum value.

Implement the differential of a function

Now, I would like to think about the differentiation of functions. There are two approaches to differentiating a function: (1) solving it analytically and (2) solving it discretely (taking a difference). If you move your hand and do it by a human, you do it in (1), but in solving it programmatically, (2) is convenient. This time, we will implement the concept of central difference shown in the figure below.

This time, I would like to find the value obtained by differentiating this function discretely.

`nn.ipynb`


import numpy as np
import matplotlib.pyplot as plt

def numerical_diff(f,x):
    h =1e-4 #0.0001
    return (f(x+h)-f(x-h))/(2*h)

def function_1(x):
    return 0.01*x**2 + 0.1*x
numerical_diff(function_1,5)

0.1999999999990898

The attached curve is the original function, and the straight line is the gradient at x = 5.

Implement partial differential

Next, consider performing partial differentiation of the two-variable function shown below.

If you draw the original function, it will be a 3D graph as shown below.

`nn.ipynb`


def function_2(x):
return x[0]**2 + x[1]**2

Partial differentiation refers to determining the variable to be differentiated and treating other numerical values as constants to perform differentiation. Partially differentiate x0 and find the value when x0 = 3, x1 = 4.

`nn.ipynb`


def function_tmp1(x0):
    return x0*x0 +4.0**2.0

numerical_diff(function_tmp1,3.0)

6.00000000000378

We define it as a function with only one variable and differentiate that function. However, in this case, it is necessary to perform processing such as assigning values other than the values that are used as variables one by one. Consider that you want to differentiate x0 and x1 together. This can be implemented as follows:

`nn.ipynb`


def numerical_gradient(f,x):
    h =1e-4
    grad = np.zeros_like(x)
    
    for idx in range(x.size):
        tmp_val =x[idx]
        x[idx] =tmp_val + h
        fxh1 = f(x)
        
        x[idx] = tmp_val -h
        fxh2 = f(x)
        
        grad[idx] = (fxh1-fxh2)/(2*h)
        x[idx] = tmp_val
    
    return grad

numerical_gradient(function_2,np.array([3.0,4.0]))

array([6., 8.])

I explained earlier that this differentiated value indicates the gradient of the original function. Furthermore, consider drawing this differentiated value as a vector. For convenience, it is shown below with a minus sign.

You can see that the arrow points to (x0, x1) = (0,0). This leads to improving the accuracy of the model by finding the minimum value in the discussion of the loss function. ** It turns out that this differential operation can find the minimum value of the loss function, leading to model optimization! ** **

At the end

This time, I have advanced to the point of understanding that this differential operation leads to improvement of the accuracy of the model. By looking at the contents of learning, which is the heart of neural networks, I deepened my understanding. In the next and second half of the article, I would like to carefully understand the learning by actually proceeding to the implementation on the neural network.

The second half is here. https://qiita.com/Fumio-eisan/items/7507d8687ca651ab301d

I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).

Introduction

What is learning in neural networks

What is a loss function

Sum of squares error

nn.ipynb

Cross entropy error

nn.ipynb

Purpose of setting the error function

Implement the differential of a function

nn.ipynb

Implement partial differential

nn.ipynb

nn.ipynb

nn.ipynb

At the end

`nn.ipynb`

`nn.ipynb`

`nn.ipynb`

`nn.ipynb`

`nn.ipynb`

`nn.ipynb`