Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.
This time is the basic "simple regression analysis". I referred to the next page.
A straight line on the plane consisting of the $ x $ axis and the $ y $ axis is represented as
Python's scikit-learn has several datasets for testing. This time, we will use diabetes (diabetes data) from among them. You can try the code in Google Colaboratory.
First, look at the test data.
A detailed explanation can be found in the API documentation, but for 10 data Targets (progress after one year) are prepared.
Let's take a scatter plot to see how BMI data affects the 10 elements. I will touch on why BMI.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
x = df['bmi']
y = diabetes.target
plt.scatter(x, y)
The horizontal axis is BMI and the vertical axis is progress. Looking at the figure, it seems that you can draw a straight line that rises to the right.
For a given $ N $ number of $ (x, y) $ columns, the parameters $ A $ and $ B $ for drawing a nice straight line are the straight line $ y = Ax + B $ and $ i $ th. You can find $ A $ and $ B $ that minimize the sum of the squares of the difference between $ (x_i, y_i) $. In other words, find $ A $ and $ B $ that minimizes
Specifically, the above equation is partially differentiated with respect to $ A $ and $ B $ to solve the simultaneous equations, but I will omit it. I think you should definitely try writing with paper and pencil. If $ \ sum_ {i = 1} ^ {N} x_i $ is represented by $ n \ bar {x} $ and $ \ sum_ {i = 1} ^ {N} y_i $ is represented by $ n \ bar {y} $ $ A $ and $ B $ are
You can code $ A $ and $ B $ obediently, but numpy already has a useful function, so use that. The denominator of $ A $ is the variance of the $ x $ column ($ 1 / n $), and the numerator is the covariance of the $ x $ and $ y $ columns ($ 1 / n $).
S_xx = np.var(x, ddof=1)
S_xy = np.cov(np.array([x, y]))[0][1]
A = S_xy / S_xx
B = np.mean(y) - A * np.mean(x)
print("S_xx: ", S_xx)
print("S_xy: ", S_xy)
print("A: ", A)
print("B: ", B)
The result is as follows. The variance (var) is divided into sample variance and unbiased variance, and scikit-learn, which will be described later, is unbiased variance, so it is calculated with unbiased variance. The sample variance and the unbiased variance will be described separately.
S_xx: 0.0022675736961455507
S_xy: 2.1529144226397467
A: 949.43526038395
B: 152.1334841628967
Actually, np.cov [0] [0] is the variance of x, so it is not necessary to calculate it, but it is done as above for understanding. Plot the straight line obtained here on the scatter plot.
plt.scatter(df['bmi'], diabetes.target)
plt.plot(df['bmi'], A*df['bmi']+B, color='red')
Looking at the resulting graph, you can see that somehow a nice straight line is drawn.
Doing the same with scikit-learn makes things easier. You can see that it can be used somehow, but can you understand that if you use it after understanding the theory, you will be completely hungry.
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()
model_lr.fit(x.to_frame(), y)
Only this. It seems that the first argument of the fit method only accepts pandas.DataFrame, so it is necessary to force it to DataFrame with to_frame ([Reference](https://medium.com/@yamasaKit/scikit-learn%E3%81%] A7% E5% 8D% 98% E5% 9B% 9E% E5% B8% B0% E5% 88% 86% E6% 9E% 90% E3% 82% 92% E8% A1% 8C% E3% 81% 86% E6% 96% B9% E6% B3% 95-f6baa2cb761e)).
Since the slope and intercept are coef_ and intercept_ respectively (see API) (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)) Let's compare it with the previous result.
print("coef_: ", model_lr.coef_[0])
print("intercept: ", model_lr.intercept_)
coef_: 949.4352603839491
intercept: 152.1334841628967
You got the same result.
Correlation coefficient R is a coefficient that indicates how much the two variables are related (how much they influence each other), and takes a number from -1 to 1. The correlation coefficient $ r $ is the covariance of $ x $ and $ y $ divided by their standard deviations, and is calculated by the corrcoef method in numpy.
r = S_xy/(x.std(ddof=1)*y.std(ddof=1))
rr = np.corrcoef(x, y)[0][1]
0.5864501344746891
0.5864501344746891
This is also the same value. The higher the value, the stronger the relevance of each.
The coefficient of determination is an index of how well the obtained straight line matches the actual data, and the closer it is to 1, the closer it is to the original data.
The coefficient of determination can be obtained based on the values of total variation and residual variation, and is equal to the square of the correlation coefficient. For details, see here.
The coefficient of determination is obtained by the score method of the LinearRegression class.
R = model_lr.score(x.to_frame(), y)
print("R: ", R)
print("r^2: ", r**2)
R: 0.3439237602253803
r^2: 0.3439237602253809
It will be equal.
For simple regression analysis, I tried the python implementation while checking the theory. I think you can understand how to draw a regression line and how much the obtained straight line represents the original data. By the way, I chose BMI for the target because it had the highest correlation coefficient. I would like to write about how to check that.