Machine learning algorithm (gradient descent method)

Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

This time, I will talk about the ** gradient descent method **, which is a method for optimizing the parameters of the loss function. It is good if a solution can be found mathematically, but there are few such cases in the world. There are various gradient methods for searching for the optimum value, but I would like to summarize some with code. The sites that I referred to as usual are as follows. Thank you very much.

About the gradient method

According to Wikipedia, what is the gradient method?

The Gradient method is a general term for algorithms that use information about the gradient of a function to search for a solution in an optimization problem.

It is written. In simple regression, the loss function $ E $ is sum of the squares of the residuals $ E = \ sum_ {i = 0} ^ {N} (y_i-) in order to approximate with the function $ y = Ax + B $. (Ax_i -B)) As ^ 2 $, we found $ A $ and $ B $ that minimized $ E $. In this example, $ A $ and $ B $ can be calculated mathematically, but even if the loss function is more complicated, the approach to find the smallest (probably) point is gradient descent. is there.

theme

As usual, we use scikit-learn diabetes (diabetes data).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

diabetes = datasets.load_diabetes()

df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

x = df['bmi'].values
y = diabetes.target
plt.scatter(x, y)
bmi_vs_target_1.png

The slopes and intercepts when solved mathematically are

Tilt A:  949.43526038395
Intercept B:  152.1334841628967

was.

The steepest gradient descent method

The steepest descent method is the idea of gradually moving in the opposite direction of the gradient vector at a point on the loss function. Let the slope be $$ nabla E , and calculate $ E_ {t + 1} = E_ {t}-\ eta \ nabla E_ {t} $$. In addition, $ \ eta $ is called the learning rate, which is how slowly it progresses. If this is large, the solution will not converge and will diverge.

For the loss function $ E = \ sum_ {i = 0} ^ {N} (y_i-(Ax_i -B)) ^ 2 $, to obtain $ A $ and $ B $, $ E $ is $ Partial differential for each of A $ and $ B $. (The subscript of $ \ sum $ is omitted)

\frac{\partial{E}}{\partial{A}}=\sum(-2x_iy_i+2Ax_i^2+2Bx_i)
 \\
=-2\sum(Ax_i+B-y_i)x_i \\
\frac{\partial{E}}{\partial{B}}=\sum(-2x_iy_i+2Ax_i+2B)
 \\
=-2\sum(Ax_i+B-y_i) \\

Will be. Decide the initial value appropriately, change $ A $ and $ B $ in the gradient direction using the above formula, and stop the calculation at a reasonable stage.

Further consideration is needed on how to determine the initial value, how to determine the learning rate, and the conditions for discontinuing the calculation, but the outline is as follows.

Implementation in python

Created a Steepest Gradient Descent class that realizes the steepest descent method. I tried to match the method to scikit-learn. The learning rate was appropriate, and it was censored by the number of times, not when the error became sufficiently small.

class SteepestGradientDescent:
  def __init__(self, eta=0.001, n_iter=2000):
    self.eta = eta
    self.n_iter = n_iter
    self.grad = np.zeros((2,))
    self.loss = np.array([])

  def fit(self, X, Y, w0):
    self.w = w0
    for _ in range(self.n_iter):
      self.loss = np.append(self.loss, np.sum((Y-self.w[0]-self.w[1]*X)**2))
      self.grad[0] = np.sum(-2*(Y-self.w[0]-self.w[1]*X))
      self.grad[1] = np.sum(-2*X*(Y-self.w[0]-self.w[1]*X))

      self.w -= self.eta * self.grad

  def predict(self, x):
    return (self.w[0] + self.w[1]*x)

  @property
  def coef_(self):
    return self.w[1]

  @property
  def intercept_(self):
    return self.w[0]

  @property
  def loss_(self):
    return self.loss

In the fit method, the gradient is calculated for all the data and the coefficients are updated. The value of the loss function was also saved at the same time so that we could see how the learning was proceeding.

Use this class to approximate diabetes data and plot the change in loss on a graph.

w0 = np.array([1.,1.])

model = SteepestGradientDescent()
model.fit(x, y, w0)

print("A: ", model.coef_)
print("B: ", model.intercept_)

loss = model.loss
plt.plot(np.arange(len(loss)), np.log10(loss))

plt.show()

A: 932.1335010668406
B: 152.13348416289668
steepest_loss.png

It is now with the mathematically solved value. You can also see how the loss is decreasing.

Stochastic gradient descent

The steepest descent method calculated the gradient using all the samples. The stochastic gradient descent method uses one or more shuffled data to update the parameters. The method of using multiple methods is also called mini-batch learning.

The question is what to do in this way, but the loss function may have multiple minimum values depending on its shape. Remember the terrain, but the valleys you can see are not always the lowest, and you may find deeper valleys beyond the mountains. This seemingly correct but not valley bottom is called the ** local solution **, and the lower valley bottom is called the ** global solution **.

The steepest descent method is easy to fall into a local solution by selecting the initial value, but the stochastic gradient descent method reselects the data used for gradient calculation every time the parameter is updated, so there is a possibility of reaching a global solution over a low mountain. Seems to be higher.

Furthermore, it is suitable for real-time processing because it is not necessary to decide the data to be used.

Implementation in python

I created the Stochastic Gradient Descent class as before. The contents are almost the same as the steepest descent method except for the parameter update part. The size of the mini-batch can now be specified how much to use for all data. I remember the parameter with the lowest loss and try to return it.

class StochasticGradientDescent:
  def __init__(self, eta=0.01, n_iter=2000, sample_rate=0.1):
    self.eta = eta
    self.n_iter = n_iter
    self.sample_rate = sample_rate
    self.grad = np.zeros((2,))
    self.loss = np.array([])

  def fit(self, X, Y, w0):
    self.w = w0
    self.min_w = w0
    n_samples = int(np.ceil(len(X)*self.sample_rate))
    min_loss = 10**18
    for _ in range(self.n_iter):
      loss = np.sum((Y-self.w[0]-self.w[1]*X)**2)
      if min_loss>loss:
        min_loss = loss
        self.min_w = self.w
      self.loss = np.append(self.loss, loss)
      for i in range(n_samples):
        index = np.random.randint(0, len(X))
        batch_x = X[index]
        batch_y = Y[index]
        self.grad[0] = np.sum(-2*(batch_y-self.w[0]-self.w[1]*batch_x))
        self.grad[1] = np.sum(-2*batch_x*(batch_y-self.w[0]-self.w[1]*batch_x))
        self.w -= self.eta * self.grad

  def predict(self, x):
    return (self.w[0] + self.w[1]*x)

  @property
  def coef_(self):
    return self.min_w[1]

  @property
  def intercept_(self):
    return self.min_w[0]

  @property
  def loss_(self):
    return self.loss

Do this and look at the changes in loss as well.

w0 = np.array([1.,1.])

model = StochasticGradientDescent()
model.fit(x, y, w0)

print("A: ", model.coef_)
print("B: ", model.intercept_)

loss = model.loss
plt.plot(np.arange(len(loss)), np.log10(loss))

plt.show()

A:  925.5203159922469
B:  146.8724188770836
SGD_loss.png

It's a little different from the exact solution, and it feels like the loss is a little rampant. I wondered if the spike height would be lower if the learning rate was lowered, and if there was a random sampling relationship or if the correlation of the original sample was low, it would not converge to an exact solution. Tuning the parameters around here is an issue.

Summary

As a gradient descent method, I investigated the steepest descent method and the stochastic gradient descent method to buy. In the world of scikit-learn and deep learning, there are various other gradients, but the basics are the same. By looking at it step by step in this way, it may be easier to imagine the tuning of parameters.

Next time, I would like to summarize the regression and overfitting and regularization.

Recommended Posts

Machine learning algorithm (gradient descent method)
Machine learning algorithm (simple perceptron)
Machine learning algorithm (support vector machine)
Machine learning algorithm (logistic regression)
<Course> Machine Learning Chapter 6: Algorithm 2 (k-means)
Machine learning algorithm (multiple regression analysis)
Machine learning
"Usable" one-hot Encoding method for machine learning
Machine learning algorithm (implementation of multi-class classification)
Deep Learning / Stochastic Gradient Descent (SGD) Simulation
Machine learning algorithm (linear regression summary & regularization)
Summary Note on Deep Learning -4.3 Gradient Method-
Dictionary learning algorithm
Gradient method implementation 1
Machine learning classification
Machine Learning sample
Gaussian mixed model EM algorithm [statistical machine learning]
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Machine learning support vector machine
Machine learning #k-nearest neighbor method and its implementation and various
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Study method for learning machine learning from scratch (March 2020 version)
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
Talk about improving machine learning algorithm bottlenecks with Cython
Python Machine Learning Programming Chapter 2 Classification Problems-Machine Learning Algorithm Training Summary
Output the result of gradient descent method as matplotlib animation
Machine learning model considering maintainability
Machine learning learned with Pokemon
Data set for machine learning
Japanese preprocessing for machine learning
Machine learning in Delemas (practice)
Machine learning / classification related techniques
Machine Learning: Supervised --Linear Regression
Basics of Machine Learning (Notes)
Machine learning beginners tried RBM
[Machine learning] Understanding random forest
Machine learning with Python! Preparation
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary
Understand machine learning ~ ridge regression ~.
Machine learning article summary (self-authored)
About machine learning mixed matrices
Practical machine learning system memo
Machine learning Minesweeper with PyTorch
Build a machine learning environment
Python Machine Learning Programming> Keywords
Used in machine learning EDA
Importance of machine learning datasets
Machine Learning: Supervised --Support Vector Machine
Supervised machine learning (classification / regression)