Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.
This time, I would like to develop simple regression analysis and do "multiple regression analysis". I referred to the next page.
In simple regression analysis, we found $ A $ and $ B $ to draw the approximate straight line
In other words, if the formula of the straight line is
From here, it will be almost like the plagiarism of the article I referred to, but I will try to write it as easily as possible.
When the straight line equation is in the form of a matrix,
y = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} \begin{bmatrix} x_0, x_1, x_2, \cdots, x_n\end{bmatrix}
($ X_0 = 1
\sum_{i=1}^{n}(y-\hat{y})^2 \\
= (\boldsymbol{y}-\hat{\boldsymbol{y}})^{T}(\boldsymbol{y}-\hat{\boldsymbol{y}}) \\
= (\boldsymbol{y}-\boldsymbol{Xw})^{T}(\boldsymbol{y}-\boldsymbol{Xw}) \\
= (\boldsymbol{y}^{T}-(\boldsymbol{Xw})^{T})(\boldsymbol{y}-\boldsymbol{Xw}) \\
= (\boldsymbol{y}^{T}-\boldsymbol{w}^{T}\boldsymbol{X}^{T})(\boldsymbol{y}-\boldsymbol{Xw}) \\
= \boldsymbol{y}^{T}\boldsymbol{y}-\boldsymbol{y}^{T}\boldsymbol{X}\boldsymbol{w}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{y}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w} \\
= \boldsymbol{y}^{T}\boldsymbol{y}-2\boldsymbol{y}^{T}\boldsymbol{X}\boldsymbol{w}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}
$ \ Boldsymbol {w} $ irrelevant are constants, so $ \ boldsymbol {X} ^ {T} \ boldsymbol {X} = A $, $ -2 \ boldsymbol {y} ^ {T} \ boldsymbol { If X} = B $, $ \ boldsymbol {y} ^ {T} \ boldsymbol {y} = C $, the least squares $ L $ is
\begin{split}\begin{aligned}
\frac{\partial}{\partial {\boldsymbol{w}}} L
&= \frac{\partial}{\boldsymbol{w}} (C + B\boldsymbol{w} + \boldsymbol{w}^T{A}\boldsymbol{w}) \\
&=\frac{\partial}{\partial {\boldsymbol{w}}} (C) + \frac{\partial}{\partial {\boldsymbol{w}}} ({B}{\boldsymbol{w}}) + \frac{\partial}{\partial {\boldsymbol{w}}} ({\boldsymbol{w}}^{T}{A}{\boldsymbol{w}}) \\
&={B} + {w}^{T}({A} + {A}^{T})
\end{aligned}\end{split}
I want this to be 0, so
\boldsymbol{w}^T(A+A^T)=-B \\
\boldsymbol{w}^T(\boldsymbol{X}^{T}\boldsymbol{X}+(\boldsymbol{X}^{T}\boldsymbol{X})^T)=2\boldsymbol{y}^{T}\boldsymbol{X} \\
\boldsymbol{w}^T\boldsymbol{X}^{T}\boldsymbol{X}=\boldsymbol{y}^T\boldsymbol{X} \\
\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w} = \boldsymbol{X}^T\boldsymbol{y} \\
This form is a system of linear equations, and the simultaneous equations cannot be solved unless $ \ boldsymbol {X} ^ {T} \ boldsymbol {X} $ is regular. It is not regular if there is a strong correlation in some $ x $, that is, if one data column can explain another. This state is called ** multicollinearity **, commonly known as multicollinearity.
Assuming it is regular
(\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}=(\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y} \\
\boldsymbol{w}=(\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}
Now you have $ \ boldsymbol {w} $.
The data uses scikit-learn diabetes data (diabetes). Let's find out how the target (progress after one year) and BMI, S5 (ltg: lamotriogin) data are related.
First, let's plot the data. Since there are two explanatory variables and the target, it becomes 3D data. Graphs are not drawn beyond 3D.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
%matplotlib inline
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
fig=plt.figure()
ax=Axes3D(fig)
x1 = df['bmi']
x2 = df['s5']
y = diabetes.target
ax.scatter3D(x1, x2, y)
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_zlabel("y")
plt.show()
The result is the following graph, which looks like a slope.
The formula for finding $ \ boldsymbol {w} $ was as follows. It seems to be called a normal equation.
X = pd.concat([pd.Series(np.ones(len(df['bmi']))), df.loc[:,['bmi','s5']]], axis=1, ignore_index=True).values
y = diabetes.target
w = np.linalg.inv(X.T @ X) @ X.T @ y
result:[152.13348416 675.06977443 614.95050478]
python matrix calculation is intuitive and nice. By the way, if there is only one explanatory variable, the result is the same as simple regression. It's natural because it was generalized with n explanatory variables.
X = pd.concat([pd.Series(np.ones(len(df['bmi']))), df.loc[:,['bmi']]], axis=1, ignore_index=True).values
y = diabetes.target
w = np.linalg.inv(X.T @ X) @ X.T @ y
print(w)
[152.13348416 949.43526038]
Let's draw a graph based on the calculated values when there are two explanatory variables.
fig=plt.figure()
ax=Axes3D(fig)
mesh_x1 = np.arange(x1.min(), x1.max(), (x1.max()-x1.min())/20)
mesh_x2 = np.arange(x2.min(), x2.max(), (x2.max()-x2.min())/20)
mesh_x1, mesh_x2 = np.meshgrid(mesh_x1, mesh_x2)
x1 = df['bmi'].values
x2 = df['s5'].values
y = diabetes.target
ax.scatter3D(x1, x2, y)
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_zlabel("y")
mesh_y = w[1] * mesh_x1 + w[2] * mesh_x2 + w[0]
ax.plot_wireframe(mesh_x1, mesh_x2, mesh_y, color='red')
plt.show()
The result is shown in the figure below. I can't seem to be able to do it properly w
Let's evaluate the degree of coincidence of planes with the coefficient of determination. For the coefficient of determination $ R ^ 2 $, it is necessary to obtain "total variation" and "regression variation".
The coefficient of determination determines how much the explanatory variable explains the objective variable, that is, "how much the regression variation is relative to the total variation". Using the sum of squares (variance) of total variation and regression variation
R^2=\frac{\sum_{i=0}^{N}(\hat{y}_i-\bar{y})^2}{\sum_{i=0}^{N}(y_i-\bar{y})^2}
Since total variation is the sum of regression variation and total difference variation (predicted value and measured value),
R^2=1-\frac{\sum_{i=0}^{N}(y_i-\hat{y}_i)^2}{\sum_{i=0}^{N}(y_i-\bar{y})^2}
Write this in python and calculate the coefficient of determination.
u = ((y-(X @ w))**2).sum()
v = ((y-y.mean())**2).sum()
R2 = 1-u/v
print(R2)
0.4594852440167805
The result was that.
By the way, in this example, the values of "BMI" and "ltg" are used, but if the number of variables increases, for example, numbers on the order of $ 10 ^ 5 $ and data on the order of $ 10 ^ {-5} $ can be mixed. There is also sex. If this happens, the calculation may not work. Aligning data while preserving the original data is called normalization.
Min-Max scaling transforms the minimum value to -1 and the maximum value to 1. That is, it calculates
Standardization transforms the mean to 0 and the variance to 1. That is, it calculates
It is written in detail on the following page.
I tried to calculate with python, but it didn't make sense because the diabetic data of scikit-learn seems to have already been normalized.
For multiple regression, use the Linear Regression of scikit-learn and just fit it with the training data.
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(df[['bmi', 's5']], diabetes.target)
print("coef: ", clf.coef_)
print("intercept: ", clf.intercept_)
print("score: ", clf.score(df[['bmi', 's5']], diabetes.target))
coef: [675.06977443 614.95050478]
intercept: 152.1334841628967
score: 0.45948524401678054
Only this. The result is the same as the result without scikit-learn.
We have evolved from simple regression to multiple regression. Normal equation
Now you understand the linear approximation.