Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

This time, I would like to develop simple regression analysis and do "multiple regression analysis". I referred to the next page.

Basic

In simple regression analysis, we found $ A $ and $ B $ to draw the approximate straight line $ y = Ax + B $ for $ N $ $ (x, y) $ on the plane. Specifically, the sum of the squares of the difference between the straight line and the $ i $ th point is $ \ sum_ {i = 1} ^ {N} (y_i- (Ax + B)) ^ 2 $ is the smallest. I asked for $ A $ and $ B $. Multiple regression is to find the coefficient when the variable (explanatory variable) that was one in simple regression is increased.

In other words, if the formula of the straight line is $ y = w_0x_0 + w_1x_1 + \ cdots + w_nx_n $ ($ x_0 = 1 $), then $ (w_0, w_1, \ cdots, w_n) $ should be calculated. Become.

How to solve multiple regression

From here, it will be almost like the plagiarism of the article I referred to, but I will try to write it as easily as possible.

When the straight line equation is in the form of a matrix,

y = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} \begin{bmatrix} x_0, x_1, x_2, \cdots, x_n\end{bmatrix}

($ X_0 = 1 ). The sum of squares of the difference from the measured value $ hat {y} $ is $ \ sum_ {i = 1} ^ {n} (y-\ hat {y}) ^ 2 $, so transform this To go. Note that $ (w_0, w_1, \ cdots, w_n) $ is $ \ boldsymbol {w} $, and all explanatory variables are the matrix $ \ boldsymbol {X} $.

\sum_{i=1}^{n}(y-\hat{y})^2 \\
= (\boldsymbol{y}-\hat{\boldsymbol{y}})^{T}(\boldsymbol{y}-\hat{\boldsymbol{y}}) \\
= (\boldsymbol{y}-\boldsymbol{Xw})^{T}(\boldsymbol{y}-\boldsymbol{Xw}) \\
= (\boldsymbol{y}^{T}-(\boldsymbol{Xw})^{T})(\boldsymbol{y}-\boldsymbol{Xw}) \\
= (\boldsymbol{y}^{T}-\boldsymbol{w}^{T}\boldsymbol{X}^{T})(\boldsymbol{y}-\boldsymbol{Xw}) \\
= \boldsymbol{y}^{T}\boldsymbol{y}-\boldsymbol{y}^{T}\boldsymbol{X}\boldsymbol{w}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{y}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w} \\
= \boldsymbol{y}^{T}\boldsymbol{y}-2\boldsymbol{y}^{T}\boldsymbol{X}\boldsymbol{w}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}

$ \ Boldsymbol {w} $ irrelevant are constants, so $ \ boldsymbol {X} ^ {T} \ boldsymbol {X} = A $, $ -2 \ boldsymbol {y} ^ {T} \ boldsymbol { If X} = B $, $ \ boldsymbol {y} ^ {T} \ boldsymbol {y} = C $, the least squares $ L $ is $ L = CB \ boldsymbol {w}-\ boldsymbol { w} ^ {T} A \ boldsymbol {w} $. Since $ L $ is a quadratic function of $ \ boldsymbol {w} $, $ \ boldsymbol {w} $, which minimizes $ L $, partially differentiates $ L $ by $ \ boldsymbol {w} $. You can find $ \ boldsymbol {w} $ whose expression is 0.

\begin{split}\begin{aligned}
\frac{\partial}{\partial {\boldsymbol{w}}} L
&= \frac{\partial}{\boldsymbol{w}} (C + B\boldsymbol{w} + \boldsymbol{w}^T{A}\boldsymbol{w}) \\
&=\frac{\partial}{\partial {\boldsymbol{w}}} (C) + \frac{\partial}{\partial {\boldsymbol{w}}} ({B}{\boldsymbol{w}}) + \frac{\partial}{\partial {\boldsymbol{w}}} ({\boldsymbol{w}}^{T}{A}{\boldsymbol{w}}) \\
&={B} + {w}^{T}({A} + {A}^{T})
\end{aligned}\end{split}

I want this to be 0, so

\boldsymbol{w}^T(A+A^T)=-B \\
\boldsymbol{w}^T(\boldsymbol{X}^{T}\boldsymbol{X}+(\boldsymbol{X}^{T}\boldsymbol{X})^T)=2\boldsymbol{y}^{T}\boldsymbol{X} \\
\boldsymbol{w}^T\boldsymbol{X}^{T}\boldsymbol{X}=\boldsymbol{y}^T\boldsymbol{X} \\
\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w} = \boldsymbol{X}^T\boldsymbol{y} \\

This form is a system of linear equations, and the simultaneous equations cannot be solved unless $ \ boldsymbol {X} ^ {T} \ boldsymbol {X} $ is regular. It is not regular if there is a strong correlation in some $ x $, that is, if one data column can explain another. This state is called ** multicollinearity **, commonly known as multicollinearity.

Assuming it is regular

(\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}=(\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y} \\
\boldsymbol{w}=(\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}

Now you have $ \ boldsymbol {w} $.

Try to implement it honestly with python

The data uses scikit-learn diabetes data (diabetes). Let's find out how the target (progress after one year) and BMI, S5 (ltg: lamotriogin) data are related.

First look at the data

First, let's plot the data. Since there are two explanatory variables and the target, it becomes 3D data. Graphs are not drawn beyond 3D.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

%matplotlib inline

diabetes = datasets.load_diabetes()

df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

fig=plt.figure()
ax=Axes3D(fig)

x1 = df['bmi']
x2 = df['s5']
y = diabetes.target

ax.scatter3D(x1, x2, y)
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_zlabel("y")

plt.show()

The result is the following graph, which looks like a slope.

Try to calculate

The formula for finding $ \ boldsymbol {w} $ was as follows. It seems to be called a normal equation. $ \ boldsymbol {w} = (\ boldsymbol {X} ^ {T} \ boldsymbol {X}) ^ {-1} \ boldsymbol {X} ^ T \ boldsymbol {y} $ Try to code this as it is I will.

X = pd.concat([pd.Series(np.ones(len(df['bmi']))), df.loc[:,['bmi','s5']]], axis=1, ignore_index=True).values
y = diabetes.target

w = np.linalg.inv(X.T @ X) @ X.T @ y

result:[152.13348416 675.06977443 614.95050478]

python matrix calculation is intuitive and nice. By the way, if there is only one explanatory variable, the result is the same as simple regression. It's natural because it was generalized with n explanatory variables.

X = pd.concat([pd.Series(np.ones(len(df['bmi']))), df.loc[:,['bmi']]], axis=1, ignore_index=True).values
y = diabetes.target

w = np.linalg.inv(X.T @ X) @ X.T @ y
print(w)

[152.13348416 949.43526038]

Let's draw a graph based on the calculated values when there are two explanatory variables.

fig=plt.figure()
ax=Axes3D(fig)

mesh_x1 = np.arange(x1.min(), x1.max(), (x1.max()-x1.min())/20)
mesh_x2 = np.arange(x2.min(), x2.max(), (x2.max()-x2.min())/20)
mesh_x1, mesh_x2 = np.meshgrid(mesh_x1, mesh_x2)

x1 = df['bmi'].values
x2 = df['s5'].values
y = diabetes.target
ax.scatter3D(x1, x2, y)
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_zlabel("y")

mesh_y = w[1] * mesh_x1 + w[2] * mesh_x2 + w[0]
ax.plot_wireframe(mesh_x1, mesh_x2, mesh_y, color='red')

plt.show()

The result is shown in the figure below. I can't seem to be able to do it properly w

Evaluation

Let's evaluate the degree of coincidence of planes with the coefficient of determination. For the coefficient of determination $ R ^ 2 $, it is necessary to obtain "total variation" and "regression variation".

Total variation: Difference between measured value and overall average value
Regression variation: Difference between predicted value and overall mean

The coefficient of determination determines how much the explanatory variable explains the objective variable, that is, "how much the regression variation is relative to the total variation". Using the sum of squares (variance) of total variation and regression variation

R^2=\frac{\sum_{i=0}^{N}(\hat{y}_i-\bar{y})^2}{\sum_{i=0}^{N}(y_i-\bar{y})^2}

Since total variation is the sum of regression variation and total difference variation (predicted value and measured value),

R^2=1-\frac{\sum_{i=0}^{N}(y_i-\hat{y}_i)^2}{\sum_{i=0}^{N}(y_i-\bar{y})^2}

Write this in python and calculate the coefficient of determination.

u = ((y-(X @ w))**2).sum()
v = ((y-y.mean())**2).sum()

R2 = 1-u/v
print(R2)

0.4594852440167805

The result was that.

Normalization, standardization

By the way, in this example, the values of "BMI" and "ltg" are used, but if the number of variables increases, for example, numbers on the order of $ 10 ^ 5 $ and data on the order of $ 10 ^ {-5} $ can be mixed. There is also sex. If this happens, the calculation may not work. Aligning data while preserving the original data is called normalization.

Min-Max scaling

Min-Max scaling transforms the minimum value to -1 and the maximum value to 1. That is, it calculates $ x_ {i_ {new}} = \ frac {x_i-x_ {min}} {x_ {max} -x_ {min}} $.

Standardization

Standardization transforms the mean to 0 and the variance to 1. That is, it calculates $ x_ {i_ {new}} = \ frac {x_i- \ bar {x}} {\ sigma} $.

It is written in detail on the following page.

Try to standardize with python

I tried to calculate with python, but it didn't make sense because the diabetic data of scikit-learn seems to have already been normalized.

Try to calculate with scikit-learn

For multiple regression, use the Linear Regression of scikit-learn and just fit it with the training data.

from sklearn import linear_model

clf = linear_model.LinearRegression()
clf.fit(df[['bmi', 's5']], diabetes.target)

print("coef: ", clf.coef_)
print("intercept: ", clf.intercept_)
print("score: ", clf.score(df[['bmi', 's5']], diabetes.target))

coef:  [675.06977443 614.95050478]
intercept:  152.1334841628967
score:  0.45948524401678054

Only this. The result is the same as the result without scikit-learn.

Summary

We have evolved from simple regression to multiple regression. Normal equation $ \ boldsymbol {w} = (\ boldsymbol {X} ^ {T} \ boldsymbol {X}) ^ {-1} \ boldsymbol {X} ^ T \ boldsymbol {y} $ I was able to extend it to multiple explanatory variables. Since multiple explanatory variables need to be scaled, they need to be calculated with a technique called standardization.

Now you understand the linear approximation.

Machine learning algorithm (multiple regression analysis)