1.First of all

I started deep learning. This time, I will simulate the ** Stochastic Gradient Descent (SGD) method with Jupyter Notebook. ** **

The simple gradient descent method calculates the slope from all the data and updates the weights, so once you fall into a local solution, it becomes difficult to get out of it and it takes time to calculate. 　 Stochastic Gradient Descent (SGD) randomly extracts a part of the data, calculates the slope and updates the weight, so the slope calculation fluctuates well and overcomes the local solution to reach the more optimal solution. It's easier and less time consuming to calculate.

I found it very interesting that this ** sloppy blur is a means to reach a more optimal solution **, so this time I would like to simulate Stochastic Gradient Descent (SGD) with Jupyter Notebook. think.

2. Data creation

This time, for the sake of simplicity, we will use one weight. Take 11 x and y coordinates and approximate with a 6-dimensional polynomial.

import numpy as np
import matplotlib.pyplot as plt

#Data (for polynomial creation)
x = np.array([-5.0,  -4.0,  -3.0,  -2.0,  -1.0,   0.0,   1.0,   2.0,   3.0,   4.0,    5.0])
y = np.array([ 5.0,   1.5,   2.0,   1.5,   0.0,  -3.0,  -1.0,   2.0,   3.0,   2.5,    5.0])

#Polynomial creation (6 dimensions)
p = np.poly1d(np.polyfit(x, y, 6))
print(p)

#View data and polynomials
xp = np.linspace(-10, 10, 100)
plt.plot(x, y, '.', xp, p(xp), '')
plt.xlim(-7, 7)
plt.ylim(-5, 10)
plt.show()

スクリーンショット 2020-01-09 22.20.48.png Based on the obtained polynomial, find y when x is divided into 100 parts from -10 to 10 and changed. In reality, the observed value should be noisy, so we add a random number from 0 to 0.2 to y.

#Create 100 points of data from polynomial (0 to 0).Add 2 random numbers)
x_add, y_add =[], []
for i in np.linspace(-10, 10, 100):
    x_add.append(i)
    y_add.append( p(i) + np.random.normal(0, 0.2)) 

#Display the created data
plt.scatter(x_add, y_add, alpha=0.5)
plt.xlim(-7, 7)
plt.ylim(-5, 10)
plt.show()

スクリーンショット 2020-01-09 22.18.36.png We have created data (100 points) with local solutions around x = -4, 4 and optimal solutions around x = 0.

3. Stochastic gradient descent

This is the main part of the code. Use train_test_split to randomly sample 10 points from 100 points of data.

Based on only the data of the 10 points, approximate with a 6-dimensional polynomial, find the derivative with d_y = p.deriv (), calculate the slope, and update the weight.

Do this one screen at a time and animate it with matplotlib animation.

from sklearn.model_selection import train_test_split
from matplotlib import pylab
from matplotlib import animation, rc

#Setting
rc('animation', html='jshtml')
w = np.array([-2.])

#Random sampling function (sampling 100 to 10 points)
def random_sampling():
    X_train, X_test, y_train, y_test = train_test_split(x_add, y_add, test_size=0.90)
    _x = X_train
    _y = y_train 
    return _x, _y

#1 screen creation function
def animate(frame, w, alpha):    
    _x, _y = random_sampling()
    p = np.poly1d(np.polyfit(_x, _y, 6))
    plt.plot(_x, _y, '.',
             xp, p(xp), '')
    d_y = p.deriv()
    
    plt.clf()
    plt.plot(xp, p(xp), '-', color='green')
    plt.plot(w, p(w), '.', color='red', markersize=20)
    plt.xlim(-7, 7)
    plt.ylim(-5, 10)  
    
    grad = d_y(w)
    w -= alpha * grad

#Animation creation function
def gradient_descent(alpha, w):
    fig, ax = plt.subplots(111)
    if type(w) is list:
        w = np.array(w, detype=np.float32)
    anim = animation.FuncAnimation(fig, animate, fargs=(w, alpha), frames=100, interval=300) 
    
    return anim

4. Simulation

Now, let's run the simulation with the learning rate alpha = 0.3 and the initial weight x = 3.5.

#Learning rate 0.3, initial value of weight 3.Run in 5
gradient_descent(alpha=0.3, w=np.array([3.5]))

スクリーンショット 2020-01-09 22.59.04.png When you execute the code, the following display will appear, so please play it with the ▶ ︎ button. It's probabilistic, so it may not work, but if you try it a few times, you'll see something sloppy. It is interesting to play with various parameters.

Here is an example of how it worked (learning rate alpha = 0.3, initial weight X = 3.5, loop playback). With a good sloppy slope calculation, we have arrived at the optimal solution X = 0, not just the local solution X = 4.

Deep Learning / Stochastic Gradient Descent (SGD) Simulation

1.First of all

2. Data creation

3. Stochastic gradient descent

4. Simulation