Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

So far, we have seen "simple regression" and "multiple regression", but both have talked in the same field of linear regression. This time, I would like to summarize the "** linear basis regression model " that generalizes linear regression and the " gradient descent method **" for optimizing the loss function. The following sites were referred to this time.

PRML Linear Regression Model (Linear Basis Function Model)

Basic

To draw an approximate curve for the data sequence, the simple regression model is $ y = Ax + B $ multiple regression model

y=w_0x_0+w_1x_1+\cdots+w_nx_n

It was to be approximated by. Furthermore, it can be seen that for simple regression, only two items of the multiple regression equation were used.

Now, if the weight of each term is $ (w_0, w_1, \ cdots, w_n) $, the function of the model can actually be anything, and if this is $ y = \ phi (x) $,

y(\boldsymbol{x}, \boldsymbol{w}) = \sum_{j=0}^{M-1}w_j\phi_{j}(\boldsymbol{x})

It is expressed as. $ \ boldsymbol {w} = (w_0, w_1, \ cdots, w_ {M-1}) ^ T $, $ \ boldsymbol {\ phi} = (\ phi_0, \ phi_1, \ cdots, \ phi_ {M-1 }) ^ T $. If $ \ phi_0 = 1 $ (intercept term),

y(\boldsymbol{x}, \boldsymbol{w}) = \boldsymbol{w}^T\phi(x)

become. This $ \ phi (x) $ is called ** basis set **.

Various basis functions

The generalized expression means that linear regression means finding a sequence of coefficients $ \ boldsymbol {w} $ that best represents a given sequence of data by combining some basis functions. ..

Simple regression, multiple regression. Used when approximating with a regression line. $ \ phi_j (x) = x $
Polynomial regression. Approximate with a polynomial. $ \ phi_j (x) = x ^ j $
Gaussian basis set $ \ phi_j (x) = \ exp \ left \\ {-\ frac {(x- \ mu_j) ^ 2} {2s ^ 2} \ right \\} $
Sigmoid basis set. Often used in neural networks. $ \ phi_j (x) = \ sigma (\ frac {x- \ mu_j} {2s}) $
Fourier basis. Used in the Fourier transform. $ \ phi_j (x) = \ exp (i \ theta) $

scikit-learn allows you to use various basis functions for regression.

https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics

Find the regression coefficient

For simple regression and multiple regression, we found a coefficient that minimizes the sum of squared residuals. Although it was possible to find w mathematically with simple regression, it is often very difficult to find a solution analytically when the basis function is complicated or the data has many dimensions. In such cases, it is necessary to find the coefficient approximately. At that time, "** gradient descent method **" is used. Literally, it is a method to find the optimum value while going down the slope (gradient).

Think about how to find the coefficients, including how to solve them mathematically. It is described in detail below.

I thought about the advantages of stochastic gradient descent

Mathematical solution

It is a method to find a solution by formula transformation as described in simple regression and multiple regression. It is a method of solving simultaneous equations from square completion and partial differentiation. If the formula is simple, there is no problem, but if the model is complicated, there are cases where it cannot be solved.

Gradient descent

The gradient method is literally a way to go down the gradient of the loss function. The value of the loss function needs to be small in order to find the optimum parameter, but it is an image of going down a slope toward a smaller value.

The steepest descent method and stochastic gradient descent method are often introduced on machine learning sites, but in the world of deep learning, more gradient descent methods are used. It may be said that it is a field where deep learning is flourishing and further developing.

The steepest descent method

Given the loss function $ f (x, y) $, if the gradient vector is partially differentiated with respect to $ x $ and $ y , then $ \ nabla f (x, y) = \ biggl (\ frac) {\ partial f} {\ partial x}, \ frac {\ partial f} {\ partial y} \ biggr) $$, so decide the initial position $ (x_0, y_0) $ appropriately and $ f [(x_0,, x_0,) y_0)-\ eta \ nabla f (x_0, y_0)] With $ as the next point, repeat until the result converges. $ \ Eta $ is called the learning rate.

However, the weakness of this method is that there is not always one loss function. The position of convergence changes when the initial value is taken (converges to the local solution).

Stochastic Gradient Descent (SGD)

The steepest descent method refers to one point, while the stochastic gradient descent method refers to multiple samples. Calculate $ w: = w- \ eta \ sum_ {i = 1} ^ n \ nabla Q_i (w) $ using an arbitrary number of samples extracted and shuffled.

In most cases, SGD seems to converge faster, but the steepest descent method is faster to calculate. In most cases, I think it's okay if you use SGD (Wikipedia talk % 9A% 84% E5% 8B% BE% E9% 85% 8D% E9% 99% 8D% E4% B8% 8B% E6% B3% 95)).

Summary

I developed simple regression and multiple regression and wrote about general regression and solution. By using the theory so far, I think that regression for various samples is required.

I actually wanted to try the python implementation, but I was exhausted. Next, after trying out some implementations in python, I'd like to summarize overfitting and regularization.

Machine learning algorithm (generalization of linear regression)