Machine learning algorithm (multiple regression analysis)

Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

This time, I would like to develop simple regression analysis and do "multiple regression analysis". I referred to the next page.

Basic

In simple regression analysis, we found $ A $ and $ B $ to draw the approximate straight line $ y = Ax + B $ for $ N $ $ (x, y) $ on the plane. Specifically, the sum of the squares of the difference between the straight line and the $ i $ th point is $ \ sum_ {i = 1} ^ {N} (y_i- (Ax + B)) ^ 2 $ is the smallest. I asked for $ A $ and $ B $. Multiple regression is to find the coefficient when the variable (explanatory variable) that was one in simple regression is increased.

In other words, if the formula of the straight line is $ y = w_0x_0 + w_1x_1 + \ cdots + w_nx_n $ ($ x_0 = 1 $), then $ (w_0, w_1, \ cdots, w_n) $ should be calculated. Become.

How to solve multiple regression

From here, it will be almost like the plagiarism of the article I referred to, but I will try to write it as easily as possible.

When the straight line equation is in the form of a matrix,

y = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} \begin{bmatrix} x_0, x_1, x_2, \cdots, x_n\end{bmatrix}

($ X_0 = 1 ). The sum of squares of the difference from the measured value $ hat {y} $ is $ \ sum_ {i = 1} ^ {n} (y-\ hat {y}) ^ 2 $, so transform this To go. Note that $ (w_0, w_1, \ cdots, w_n) $ is $ \ boldsymbol {w} $, and all explanatory variables are the matrix $ \ boldsymbol {X} $.

\sum_{i=1}^{n}(y-\hat{y})^2 \\
= (\boldsymbol{y}-\hat{\boldsymbol{y}})^{T}(\boldsymbol{y}-\hat{\boldsymbol{y}}) \\
= (\boldsymbol{y}-\boldsymbol{Xw})^{T}(\boldsymbol{y}-\boldsymbol{Xw}) \\
= (\boldsymbol{y}^{T}-(\boldsymbol{Xw})^{T})(\boldsymbol{y}-\boldsymbol{Xw}) \\
= (\boldsymbol{y}^{T}-\boldsymbol{w}^{T}\boldsymbol{X}^{T})(\boldsymbol{y}-\boldsymbol{Xw}) \\
= \boldsymbol{y}^{T}\boldsymbol{y}-\boldsymbol{y}^{T}\boldsymbol{X}\boldsymbol{w}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{y}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w} \\
= \boldsymbol{y}^{T}\boldsymbol{y}-2\boldsymbol{y}^{T}\boldsymbol{X}\boldsymbol{w}-\boldsymbol{w}^{T}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}

$ \ Boldsymbol {w} $ irrelevant are constants, so $ \ boldsymbol {X} ^ {T} \ boldsymbol {X} = A $, $ -2 \ boldsymbol {y} ^ {T} \ boldsymbol { If X} = B $, $ \ boldsymbol {y} ^ {T} \ boldsymbol {y} = C $, the least squares $ L $ is $ L = CB \ boldsymbol {w}-\ boldsymbol { w} ^ {T} A \ boldsymbol {w} $. Since $ L $ is a quadratic function of $ \ boldsymbol {w} $, $ \ boldsymbol {w} $, which minimizes $ L $, partially differentiates $ L $ by $ \ boldsymbol {w} $. You can find $ \ boldsymbol {w} $ whose expression is 0.

\begin{split}\begin{aligned}
\frac{\partial}{\partial {\boldsymbol{w}}} L
&= \frac{\partial}{\boldsymbol{w}} (C + B\boldsymbol{w} + \boldsymbol{w}^T{A}\boldsymbol{w}) \\
&=\frac{\partial}{\partial {\boldsymbol{w}}} (C) + \frac{\partial}{\partial {\boldsymbol{w}}} ({B}{\boldsymbol{w}}) + \frac{\partial}{\partial {\boldsymbol{w}}} ({\boldsymbol{w}}^{T}{A}{\boldsymbol{w}}) \\
&={B} + {w}^{T}({A} + {A}^{T})
\end{aligned}\end{split}

I want this to be 0, so

\boldsymbol{w}^T(A+A^T)=-B \\
\boldsymbol{w}^T(\boldsymbol{X}^{T}\boldsymbol{X}+(\boldsymbol{X}^{T}\boldsymbol{X})^T)=2\boldsymbol{y}^{T}\boldsymbol{X} \\
\boldsymbol{w}^T\boldsymbol{X}^{T}\boldsymbol{X}=\boldsymbol{y}^T\boldsymbol{X} \\
\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w} = \boldsymbol{X}^T\boldsymbol{y} \\

This form is a system of linear equations, and the simultaneous equations cannot be solved unless $ \ boldsymbol {X} ^ {T} \ boldsymbol {X} $ is regular. It is not regular if there is a strong correlation in some $ x $, that is, if one data column can explain another. This state is called ** multicollinearity **, commonly known as multicollinearity.

Assuming it is regular

(\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^{T}\boldsymbol{X}\boldsymbol{w}=(\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y} \\
\boldsymbol{w}=(\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{y}

Now you have $ \ boldsymbol {w} $.

Try to implement it honestly with python

The data uses scikit-learn diabetes data (diabetes). Let's find out how the target (progress after one year) and BMI, S5 (ltg: lamotriogin) data are related.

First look at the data

First, let's plot the data. Since there are two explanatory variables and the target, it becomes 3D data. Graphs are not drawn beyond 3D.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

%matplotlib inline

diabetes = datasets.load_diabetes()

df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

fig=plt.figure()
ax=Axes3D(fig)

x1 = df['bmi']
x2 = df['s5']
y = diabetes.target

ax.scatter3D(x1, x2, y)
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_zlabel("y")

plt.show()

The result is the following graph, which looks like a slope.

regression_multi_1.png

Try to calculate

The formula for finding $ \ boldsymbol {w} $ was as follows. It seems to be called a normal equation. $ \ boldsymbol {w} = (\ boldsymbol {X} ^ {T} \ boldsymbol {X}) ^ {-1} \ boldsymbol {X} ^ T \ boldsymbol {y} $ Try to code this as it is I will.

X = pd.concat([pd.Series(np.ones(len(df['bmi']))), df.loc[:,['bmi','s5']]], axis=1, ignore_index=True).values
y = diabetes.target

w = np.linalg.inv(X.T @ X) @ X.T @ y

result:[152.13348416 675.06977443 614.95050478]

python matrix calculation is intuitive and nice. By the way, if there is only one explanatory variable, the result is the same as simple regression. It's natural because it was generalized with n explanatory variables.

X = pd.concat([pd.Series(np.ones(len(df['bmi']))), df.loc[:,['bmi']]], axis=1, ignore_index=True).values
y = diabetes.target

w = np.linalg.inv(X.T @ X) @ X.T @ y
print(w)

[152.13348416 949.43526038]

Let's draw a graph based on the calculated values when there are two explanatory variables.

fig=plt.figure()
ax=Axes3D(fig)

mesh_x1 = np.arange(x1.min(), x1.max(), (x1.max()-x1.min())/20)
mesh_x2 = np.arange(x2.min(), x2.max(), (x2.max()-x2.min())/20)
mesh_x1, mesh_x2 = np.meshgrid(mesh_x1, mesh_x2)

x1 = df['bmi'].values
x2 = df['s5'].values
y = diabetes.target
ax.scatter3D(x1, x2, y)
ax.set_xlabel("x1")
ax.set_ylabel("x2")
ax.set_zlabel("y")

mesh_y = w[1] * mesh_x1 + w[2] * mesh_x2 + w[0]
ax.plot_wireframe(mesh_x1, mesh_x2, mesh_y, color='red')

plt.show()

The result is shown in the figure below. I can't seem to be able to do it properly w regression_multi_2.png

Evaluation

Let's evaluate the degree of coincidence of planes with the coefficient of determination. For the coefficient of determination $ R ^ 2 $, it is necessary to obtain "total variation" and "regression variation".

The coefficient of determination determines how much the explanatory variable explains the objective variable, that is, "how much the regression variation is relative to the total variation". Using the sum of squares (variance) of total variation and regression variation

R^2=\frac{\sum_{i=0}^{N}(\hat{y}_i-\bar{y})^2}{\sum_{i=0}^{N}(y_i-\bar{y})^2}

Since total variation is the sum of regression variation and total difference variation (predicted value and measured value),

R^2=1-\frac{\sum_{i=0}^{N}(y_i-\hat{y}_i)^2}{\sum_{i=0}^{N}(y_i-\bar{y})^2}

Write this in python and calculate the coefficient of determination.

u = ((y-(X @ w))**2).sum()
v = ((y-y.mean())**2).sum()

R2 = 1-u/v
print(R2)

0.4594852440167805

The result was that.

Normalization, standardization

By the way, in this example, the values of "BMI" and "ltg" are used, but if the number of variables increases, for example, numbers on the order of $ 10 ^ 5 $ and data on the order of $ 10 ^ {-5} $ can be mixed. There is also sex. If this happens, the calculation may not work. Aligning data while preserving the original data is called normalization.

Min-Max scaling

Min-Max scaling transforms the minimum value to -1 and the maximum value to 1. That is, it calculates $ x_ {i_ {new}} = \ frac {x_i-x_ {min}} {x_ {max} -x_ {min}} $.

Standardization

Standardization transforms the mean to 0 and the variance to 1. That is, it calculates $ x_ {i_ {new}} = \ frac {x_i- \ bar {x}} {\ sigma} $.

It is written in detail on the following page.

Try to standardize with python

I tried to calculate with python, but it didn't make sense because the diabetic data of scikit-learn seems to have already been normalized.

Try to calculate with scikit-learn

For multiple regression, use the Linear Regression of scikit-learn and just fit it with the training data.

from sklearn import linear_model

clf = linear_model.LinearRegression()
clf.fit(df[['bmi', 's5']], diabetes.target)

print("coef: ", clf.coef_)
print("intercept: ", clf.intercept_)
print("score: ", clf.score(df[['bmi', 's5']], diabetes.target))

coef:  [675.06977443 614.95050478]
intercept:  152.1334841628967
score:  0.45948524401678054

Only this. The result is the same as the result without scikit-learn.

Summary

We have evolved from simple regression to multiple regression. Normal equation $ \ boldsymbol {w} = (\ boldsymbol {X} ^ {T} \ boldsymbol {X}) ^ {-1} \ boldsymbol {X} ^ T \ boldsymbol {y} $ I was able to extend it to multiple explanatory variables. Since multiple explanatory variables need to be scaled, they need to be calculated with a technique called standardization.

Now you understand the linear approximation.

Recommended Posts

Machine learning algorithm (multiple regression analysis)
Machine learning algorithm (simple regression analysis)
Machine learning algorithm (logistic regression)
Machine learning algorithm (generalization of linear regression)
Machine learning with python (2) Simple regression analysis
Machine learning algorithm (linear regression summary & regularization)
[Machine learning] Regression analysis using scikit learn
Machine learning logistic regression
Machine learning linear regression
Machine Learning: Supervised --Linear Regression
Understand machine learning ~ ridge regression ~.
Machine learning algorithm (simple perceptron)
Supervised machine learning (classification / regression)
Machine learning algorithm (support vector machine)
Machine learning stacking template (regression)
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
<Course> Machine Learning Chapter 6: Algorithm 2 (k-means)
Machine learning algorithm (support vector machine application)
Machine learning beginners try linear regression
Classification and regression in machine learning
Machine learning
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
Machine learning algorithm (gradient descent method)
Machine Learning: Supervised --Linear Discriminant Analysis
[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics
<Course> Machine Learning Chapter 3: Logistic Regression Model
I tried multiple regression analysis with polynomial regression
Machine learning algorithm (implementation of multi-class classification)
<Course> Machine Learning Chapter 1: Linear Regression Model
[Python] First data analysis / machine learning (Kaggle)
Machine learning algorithm classification and implementation summary
<Course> Machine learning Chapter 4: Principal component analysis
<Course> Machine Learning Chapter 2: Nonlinear Regression Model
Stock price forecast using machine learning (regression)
Preprocessing in machine learning 1 Data analysis process
Dictionary learning algorithm
Poisson regression analysis
Regression analysis method
[Memo] Machine learning
Machine learning classification
Machine Learning sample
[scikit-learn, matplotlib] Multiple regression analysis and 3D drawing
Gaussian mixed model EM algorithm [statistical machine learning]
EV3 x Python Machine Learning Part 2 Linear Regression
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
[Python3] Let's analyze data using machine learning! (Regression)
Analysis of shared space usage by machine learning
Creating multiple output models for regression analysis [Beginner]
A story about data analysis by machine learning
Basics of Supervised Learning Part 3-Multiple Regression (Implementation)-(Notes)-
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)