Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.
This time, I will talk about the ** gradient descent method **, which is a method for optimizing the parameters of the loss function. It is good if a solution can be found mathematically, but there are few such cases in the world. There are various gradient methods for searching for the optimum value, but I would like to summarize some with code. The sites that I referred to as usual are as follows. Thank you very much.
According to Wikipedia, what is the gradient method?
The Gradient method is a general term for algorithms that use information about the gradient of a function to search for a solution in an optimization problem.
It is written. In simple regression, the loss function $ E $ is sum of the squares of the residuals
As usual, we use scikit-learn diabetes (diabetes data).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
x = df['bmi'].values
y = diabetes.target
plt.scatter(x, y)
The slopes and intercepts when solved mathematically are
Tilt A: 949.43526038395
Intercept B: 152.1334841628967
was.
The steepest descent method is the idea of gradually moving in the opposite direction of the gradient vector at a point on the loss function. Let the slope be $$ nabla E
For the loss function
\frac{\partial{E}}{\partial{A}}=\sum(-2x_iy_i+2Ax_i^2+2Bx_i)
\\
=-2\sum(Ax_i+B-y_i)x_i \\
\frac{\partial{E}}{\partial{B}}=\sum(-2x_iy_i+2Ax_i+2B)
\\
=-2\sum(Ax_i+B-y_i) \\
Will be. Decide the initial value appropriately, change $ A $ and $ B $ in the gradient direction using the above formula, and stop the calculation at a reasonable stage.
Further consideration is needed on how to determine the initial value, how to determine the learning rate, and the conditions for discontinuing the calculation, but the outline is as follows.
Created a Steepest Gradient Descent class that realizes the steepest descent method. I tried to match the method to scikit-learn. The learning rate was appropriate, and it was censored by the number of times, not when the error became sufficiently small.
class SteepestGradientDescent:
def __init__(self, eta=0.001, n_iter=2000):
self.eta = eta
self.n_iter = n_iter
self.grad = np.zeros((2,))
self.loss = np.array([])
def fit(self, X, Y, w0):
self.w = w0
for _ in range(self.n_iter):
self.loss = np.append(self.loss, np.sum((Y-self.w[0]-self.w[1]*X)**2))
self.grad[0] = np.sum(-2*(Y-self.w[0]-self.w[1]*X))
self.grad[1] = np.sum(-2*X*(Y-self.w[0]-self.w[1]*X))
self.w -= self.eta * self.grad
def predict(self, x):
return (self.w[0] + self.w[1]*x)
@property
def coef_(self):
return self.w[1]
@property
def intercept_(self):
return self.w[0]
@property
def loss_(self):
return self.loss
In the fit method, the gradient is calculated for all the data and the coefficients are updated. The value of the loss function was also saved at the same time so that we could see how the learning was proceeding.
Use this class to approximate diabetes data and plot the change in loss on a graph.
w0 = np.array([1.,1.])
model = SteepestGradientDescent()
model.fit(x, y, w0)
print("A: ", model.coef_)
print("B: ", model.intercept_)
loss = model.loss
plt.plot(np.arange(len(loss)), np.log10(loss))
plt.show()
A: 932.1335010668406
B: 152.13348416289668
It is now with the mathematically solved value. You can also see how the loss is decreasing.
The steepest descent method calculated the gradient using all the samples. The stochastic gradient descent method uses one or more shuffled data to update the parameters. The method of using multiple methods is also called mini-batch learning.
The question is what to do in this way, but the loss function may have multiple minimum values depending on its shape. Remember the terrain, but the valleys you can see are not always the lowest, and you may find deeper valleys beyond the mountains. This seemingly correct but not valley bottom is called the ** local solution **, and the lower valley bottom is called the ** global solution **.
The steepest descent method is easy to fall into a local solution by selecting the initial value, but the stochastic gradient descent method reselects the data used for gradient calculation every time the parameter is updated, so there is a possibility of reaching a global solution over a low mountain. Seems to be higher.
Furthermore, it is suitable for real-time processing because it is not necessary to decide the data to be used.
I created the Stochastic Gradient Descent class as before. The contents are almost the same as the steepest descent method except for the parameter update part. The size of the mini-batch can now be specified how much to use for all data. I remember the parameter with the lowest loss and try to return it.
class StochasticGradientDescent:
def __init__(self, eta=0.01, n_iter=2000, sample_rate=0.1):
self.eta = eta
self.n_iter = n_iter
self.sample_rate = sample_rate
self.grad = np.zeros((2,))
self.loss = np.array([])
def fit(self, X, Y, w0):
self.w = w0
self.min_w = w0
n_samples = int(np.ceil(len(X)*self.sample_rate))
min_loss = 10**18
for _ in range(self.n_iter):
loss = np.sum((Y-self.w[0]-self.w[1]*X)**2)
if min_loss>loss:
min_loss = loss
self.min_w = self.w
self.loss = np.append(self.loss, loss)
for i in range(n_samples):
index = np.random.randint(0, len(X))
batch_x = X[index]
batch_y = Y[index]
self.grad[0] = np.sum(-2*(batch_y-self.w[0]-self.w[1]*batch_x))
self.grad[1] = np.sum(-2*batch_x*(batch_y-self.w[0]-self.w[1]*batch_x))
self.w -= self.eta * self.grad
def predict(self, x):
return (self.w[0] + self.w[1]*x)
@property
def coef_(self):
return self.min_w[1]
@property
def intercept_(self):
return self.min_w[0]
@property
def loss_(self):
return self.loss
Do this and look at the changes in loss as well.
w0 = np.array([1.,1.])
model = StochasticGradientDescent()
model.fit(x, y, w0)
print("A: ", model.coef_)
print("B: ", model.intercept_)
loss = model.loss
plt.plot(np.arange(len(loss)), np.log10(loss))
plt.show()
A: 925.5203159922469
B: 146.8724188770836
It's a little different from the exact solution, and it feels like the loss is a little rampant. I wondered if the spike height would be lower if the learning rate was lowered, and if there was a random sampling relationship or if the correlation of the original sample was low, it would not converge to an exact solution. Tuning the parameters around here is an issue.
As a gradient descent method, I investigated the steepest descent method and the stochastic gradient descent method to buy. In the world of scikit-learn and deep learning, there are various other gradients, but the basics are the same. By looking at it step by step in this way, it may be easier to imagine the tuning of parameters.
Next time, I would like to summarize the regression and overfitting and regularization.
Recommended Posts