Summary of explanation

The purpose here is to interpret what I couldn't understand just by reading a book while studying Deep Learning, and to remember it smoothly when I look back at it later. I will explain the contents of the code as carefully as possible, so I hope it will be helpful.

4.3 Gradient method

Previous article: Summary Note on Deep Learning -4.2 Loss Function- However, as explained, the purpose of neural network learning is to search for the optimum parameters and reduce the value of the loss function. So how do you specifically update the parameters? As a method, the gradient method explained this time is used.

Gradient method

First, I will briefly describe the flow of what is being done by the gradient method.

Specify the initial point * x *
Find the slope of the function at the point
Update point * x * based on slope
Repeat 2-3 until you find * x * with a slope of 0.

In the gradient method, the name differs depending on whether the purpose is to find the minimum value or the maximum value. The former is ** gradient decent method **, The latter is ** gradient ascent method. Called. This time, I will explain using ** gradient descent ** </ font>, which often appears in neural networks, as an example.

Formula

The formula of the gradient method is as follows

x = x -n\frac{\partial f}{\partial x}

Indicated by. As you can see from the mathematical formula, the function f (x) is differentiated with respect to x to obtain the slope. Multiply this by n and subtract from the current point to update the point. When the slope of the function f (x) becomes 0, the derivative value becomes 0 and the parameter is not updated.

n is called the learning rate, which is the amount that determines how many parameters are updated in each learning. The larger the value, the longer the moving distance of the parameter each time, and the smaller the value, the shorter the moving distance. Please note that ** "The learning rate must use an appropriate value" ** </ font>, which will be explained later.

Gradient method example

In this example, the function

f(x_1,x_2) = x_0^2 + x_1^2

Let's search for x1, x2 that minimizes the value of f (x1, x2) by the gradient method.

#Module import
import numpy as np
#Define a function to differentiate
def numerical_gradient(function,x):
    h =  1e-4
    #Create an array with the same shape as x and all zero elements
    #Substitute the differentiated value later
    grad = np.zeros_like(x)
    
    for idx in range(x.size):
        tmp_val = x[idx]
        #f(x+h)
        x[idx] = tmp_val + h
        fxh1 = function(x)
        #f(x-h)
        x[idx] = tmp_val -h
        fxh2 = function(x)
        
        #Differentiate and assign the value to grad
        grad[idx] = (fxh1 - fxh2)/(2*h)
        #Undo the value of x
        x[idx] = tmp_val
    #Returns grad if differentiation is possible for all x
    return grad
    
#Define gradient descent function(This time main)############################
def gradient_descent(function,init_x,lr=0.01,step_num=100):
#lr is the learning rate, step_num is the number of times. Here, Default is 0 for each.01、100
    #x is the current location (array)
    x = init_x
    #step_Update the point num times.
    for i in range(step_num):
       
        grad = numerical_gradient(function,x)
        
        #Gradient method formula
        x = x - lr * grad
    #step_Output the point after updating num times
    return x
############################################################
#Create a function for testing
def testfunction(x):
    return x[0]**2 + x[1]**2
#Create x for test
testx = np.array([3,2])

#Perform gradient method, initial point of x(init_x), Learning rate(lr), Number of learning(step_num)The set
gradient_descent(testfunction,init_x=testx,lr=0.1,step_num=100)

The output result is array([-6.35809854e-07, -3.81434987e-07]) It became. It's hard to understand because it contains e, but in other words,

x_1 = -6.358 ×10^{-7} = - 0.0000006358 \\
 
x_2 = -3.814 ×10^{-7} = - 0.0000003814

Is. This is a value close to (x1, x2) = (0,0), and it can be said that almost correct results were obtained by the gradient method.

Reasons for setting the learning rate to an appropriate value

The learning rate should not be too large or too small. Let's check the reason for this with the code written above.

--If the learning rate is too high Try changing the code lr from 0.1 to 1. The execution result is as follows array([-2499150084997, -1499450054998]) This is far from (x1, x2) = (0,0). The reason is that if the learning rate is too high, the values may diverge.

--If the learning rate is too small Try changing the code lr from 0.1 to 0.000001. The execution result is as follows array([2.97441101, 1.98460701]) This is also far from (x1, x2) = (0,0). The reason is that the value was hardly updated in one learning, and sufficient learning was not performed in step_num times.

Summary

--The gradient method is a method of updating variables to maximize or minimize the value of a function. --There are two types of gradient methods: the Gradient Decent Method and the Gradient Ascent Method. --It is necessary to specify an appropriate value for the learning rate.

Reference book

[Deep Learning from scratch-Theory and implementation of deep learning learned with Python (Japanese)](https://www.amazon.co.jp/%E3%82%BC%E3%83%AD%E3%81] % 8B% E3% 82% 89% E4% BD% 9C% E3% 82% 8BDeep-Learning-% E2% 80% 95Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E3% 83% 87% E3% 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0% E3% 81% AE% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% A3% 85-% E6% 96% 8E% E8% 97% A4 -% E5% BA% B7% E6% AF% 85 / dp / 4873117585 / ref = sr_1_1? __ mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & crid = W6DVSLVW0BUS & dchild = 1 & keywords =% E3% 82% BC% E3% 83% AD% E3% 81% 8B% E3% 82% 89% E4% BD% 9C% E3% 82% 8Bdeep + learning & qid = 1597943190 & sprefix =% E3% 82% BC % E3% 83% AD% E3% 81% 8B% E3% 82% 89% 2Caps% 2C285 & sr = 8-1)]

Execution environment

OS: Windows 10/Ubuntu 20.04 LTS Jupyter Notebook Python Version: Python 3.8

Summary Note on Deep Learning -4.3 Gradient Method-