Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

This time, I will talk about the ** gradient descent method **, which is a method for optimizing the parameters of the loss function. It is good if a solution can be found mathematically, but there are few such cases in the world. There are various gradient methods for searching for the optimum value, but I would like to summarize some with code. The sites that I referred to as usual are as follows. Thank you very much.

About the gradient method

According to Wikipedia, what is the gradient method?

The Gradient method is a general term for algorithms that use information about the gradient of a function to search for a solution in an optimization problem.

It is written. In simple regression, the loss function $ E $ is sum of the squares of the residuals $ E = \ sum_ {i = 0} ^ {N} (y_i-) in order to approximate with the function $ y = Ax + B $. (Ax_i -B)) As ^ 2 $, we found $ A $ and $ B $ that minimized $ E $. In this example, $ A $ and $ B $ can be calculated mathematically, but even if the loss function is more complicated, the approach to find the smallest (probably) point is gradient descent. is there.

theme

As usual, we use scikit-learn diabetes (diabetes data).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

diabetes = datasets.load_diabetes()

df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

x = df['bmi'].values
y = diabetes.target
plt.scatter(x, y)

The slopes and intercepts when solved mathematically are

Tilt A:  949.43526038395
Intercept B:  152.1334841628967

was.

The steepest gradient descent method

The steepest descent method is the idea of gradually moving in the opposite direction of the gradient vector at a point on the loss function. Let the slope be $$ nabla E , and calculate $ E_ {t + 1} = E_ {t}-\ eta \ nabla E_ {t} $$. In addition, $ \ eta $ is called the learning rate, which is how slowly it progresses. If this is large, the solution will not converge and will diverge.

For the loss function $ E = \ sum_ {i = 0} ^ {N} (y_i-(Ax_i -B)) ^ 2 $, to obtain $ A $ and $ B $, $ E $ is $ Partial differential for each of A $ and $ B $. (The subscript of $ \ sum $ is omitted)

\frac{\partial{E}}{\partial{A}}=\sum(-2x_iy_i+2Ax_i^2+2Bx_i)
 \\
=-2\sum(Ax_i+B-y_i)x_i \\
\frac{\partial{E}}{\partial{B}}=\sum(-2x_iy_i+2Ax_i+2B)
 \\
=-2\sum(Ax_i+B-y_i) \\

Will be. Decide the initial value appropriately, change $ A $ and $ B $ in the gradient direction using the above formula, and stop the calculation at a reasonable stage.

Further consideration is needed on how to determine the initial value, how to determine the learning rate, and the conditions for discontinuing the calculation, but the outline is as follows.

Implementation in python

Created a Steepest Gradient Descent class that realizes the steepest descent method. I tried to match the method to scikit-learn. The learning rate was appropriate, and it was censored by the number of times, not when the error became sufficiently small.

class SteepestGradientDescent:
  def __init__(self, eta=0.001, n_iter=2000):
    self.eta = eta
    self.n_iter = n_iter
    self.grad = np.zeros((2,))
    self.loss = np.array([])

  def fit(self, X, Y, w0):
    self.w = w0
    for _ in range(self.n_iter):
      self.loss = np.append(self.loss, np.sum((Y-self.w[0]-self.w[1]*X)**2))
      self.grad[0] = np.sum(-2*(Y-self.w[0]-self.w[1]*X))
      self.grad[1] = np.sum(-2*X*(Y-self.w[0]-self.w[1]*X))

      self.w -= self.eta * self.grad

  def predict(self, x):
    return (self.w[0] + self.w[1]*x)

  @property
  def coef_(self):
    return self.w[1]

  @property
  def intercept_(self):
    return self.w[0]

  @property
  def loss_(self):
    return self.loss

In the fit method, the gradient is calculated for all the data and the coefficients are updated. The value of the loss function was also saved at the same time so that we could see how the learning was proceeding.

Use this class to approximate diabetes data and plot the change in loss on a graph.

w0 = np.array([1.,1.])

model = SteepestGradientDescent()
model.fit(x, y, w0)

print("A: ", model.coef_)
print("B: ", model.intercept_)

loss = model.loss
plt.plot(np.arange(len(loss)), np.log10(loss))

plt.show()

A: 932.1335010668406
B: 152.13348416289668

It is now with the mathematically solved value. You can also see how the loss is decreasing.

Stochastic gradient descent

The steepest descent method calculated the gradient using all the samples. The stochastic gradient descent method uses one or more shuffled data to update the parameters. The method of using multiple methods is also called mini-batch learning.

The question is what to do in this way, but the loss function may have multiple minimum values depending on its shape. Remember the terrain, but the valleys you can see are not always the lowest, and you may find deeper valleys beyond the mountains. This seemingly correct but not valley bottom is called the ** local solution **, and the lower valley bottom is called the ** global solution **.

The steepest descent method is easy to fall into a local solution by selecting the initial value, but the stochastic gradient descent method reselects the data used for gradient calculation every time the parameter is updated, so there is a possibility of reaching a global solution over a low mountain. Seems to be higher.

Furthermore, it is suitable for real-time processing because it is not necessary to decide the data to be used.

Implementation in python

I created the Stochastic Gradient Descent class as before. The contents are almost the same as the steepest descent method except for the parameter update part. The size of the mini-batch can now be specified how much to use for all data. I remember the parameter with the lowest loss and try to return it.

class StochasticGradientDescent:
  def __init__(self, eta=0.01, n_iter=2000, sample_rate=0.1):
    self.eta = eta
    self.n_iter = n_iter
    self.sample_rate = sample_rate
    self.grad = np.zeros((2,))
    self.loss = np.array([])

  def fit(self, X, Y, w0):
    self.w = w0
    self.min_w = w0
    n_samples = int(np.ceil(len(X)*self.sample_rate))
    min_loss = 10**18
    for _ in range(self.n_iter):
      loss = np.sum((Y-self.w[0]-self.w[1]*X)**2)
      if min_loss>loss:
        min_loss = loss
        self.min_w = self.w
      self.loss = np.append(self.loss, loss)
      for i in range(n_samples):
        index = np.random.randint(0, len(X))
        batch_x = X[index]
        batch_y = Y[index]
        self.grad[0] = np.sum(-2*(batch_y-self.w[0]-self.w[1]*batch_x))
        self.grad[1] = np.sum(-2*batch_x*(batch_y-self.w[0]-self.w[1]*batch_x))
        self.w -= self.eta * self.grad

  def predict(self, x):
    return (self.w[0] + self.w[1]*x)

  @property
  def coef_(self):
    return self.min_w[1]

  @property
  def intercept_(self):
    return self.min_w[0]

  @property
  def loss_(self):
    return self.loss

Do this and look at the changes in loss as well.

w0 = np.array([1.,1.])

model = StochasticGradientDescent()
model.fit(x, y, w0)

print("A: ", model.coef_)
print("B: ", model.intercept_)

loss = model.loss
plt.plot(np.arange(len(loss)), np.log10(loss))

plt.show()

A:  925.5203159922469
B:  146.8724188770836

It's a little different from the exact solution, and it feels like the loss is a little rampant. I wondered if the spike height would be lower if the learning rate was lowered, and if there was a random sampling relationship or if the correlation of the original sample was low, it would not converge to an exact solution. Tuning the parameters around here is an issue.

Summary

As a gradient descent method, I investigated the steepest descent method and the stochastic gradient descent method to buy. In the world of scikit-learn and deep learning, there are various other gradients, but the basics are the same. By looking at it step by step in this way, it may be easier to imagine the tuning of parameters.

Next time, I would like to summarize the regression and overfitting and regularization.

Machine learning algorithm (gradient descent method)

Introduction

About the gradient method

theme

The steepest gradient descent method

Implementation in python

Stochastic gradient descent

Implementation in python

Summary