Machine learning algorithm (simple regression analysis)

Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

This time is the basic "simple regression analysis". I referred to the next page.

Basic

A straight line on the plane consisting of the $ x $ axis and the $ y $ axis is represented as $ y = Ax + B $. $ A $ is the slope and $ B $ is the intercept. Simple regression is to find $ A $ and $ B $ to draw a nice straight line on many combinations of $ x $ and $ y $. Humans can somehow draw a straight line like "Is it like this?", But it's an approach to let a computer draw this.

theme

Python's scikit-learn has several datasets for testing. This time, we will use diabetes (diabetes data) from among them. You can try the code in Google Colaboratory.

Preparation

First, look at the test data.

A detailed explanation can be found in the API documentation, but for 10 data Targets (progress after one year) are prepared.

Let's take a scatter plot to see how BMI data affects the 10 elements. I will touch on why BMI.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

diabetes = datasets.load_diabetes()

df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

x = df['bmi']
y = diabetes.target
plt.scatter(x, y)

The horizontal axis is BMI and the vertical axis is progress. Looking at the figure, it seems that you can draw a straight line that rises to the right. bmi_vs_target_1.png

How to solve simple regression

For a given $ N $ number of $ (x, y) $ columns, the parameters $ A $ and $ B $ for drawing a nice straight line are the straight line $ y = Ax + B $ and $ i $ th. You can find $ A $ and $ B $ that minimize the sum of the squares of the difference between $ (x_i, y_i) $. In other words, find $ A $ and $ B $ that minimizes $ \ sum_ {i = 1} ^ {N} (y_i- (Ax + B)) ^ 2 $.

Specifically, the above equation is partially differentiated with respect to $ A $ and $ B $ to solve the simultaneous equations, but I will omit it. I think you should definitely try writing with paper and pencil. If $ \ sum_ {i = 1} ^ {N} x_i $ is represented by $ n \ bar {x} $ and $ \ sum_ {i = 1} ^ {N} y_i $ is represented by $ n \ bar {y} $ $ A $ and $ B $ are $ A = \ frac {\ sum_ {i = 1} ^ {n} (x_i- \ bar {x}) (y_i- \ bar {y})} {\ sum_ { i = 1} ^ {n} (x_i- \ bar {x}) ^ 2} $ $ B = \ bar {y} -A \ bar {x} $. At this point, if you put the given $ (x_i, y_i) $ into the above formula, you can easily find $ A $ and $ B $.

Let's implement it honestly with python.

You can code $ A $ and $ B $ obediently, but numpy already has a useful function, so use that. The denominator of $ A $ is the variance of the $ x $ column ($ 1 / n $), and the numerator is the covariance of the $ x $ and $ y $ columns ($ 1 / n $).

S_xx = np.var(x, ddof=1)
S_xy = np.cov(np.array([x, y]))[0][1]

A = S_xy / S_xx
B = np.mean(y) - A * np.mean(x)

print("S_xx: ", S_xx)
print("S_xy: ", S_xy)
print("A: ", A)
print("B: ", B)

The result is as follows. The variance (var) is divided into sample variance and unbiased variance, and scikit-learn, which will be described later, is unbiased variance, so it is calculated with unbiased variance. The sample variance and the unbiased variance will be described separately.

There was

S_xx:  0.0022675736961455507
S_xy:  2.1529144226397467
A:  949.43526038395
B:  152.1334841628967

Actually, np.cov [0] [0] is the variance of x, so it is not necessary to calculate it, but it is done as above for understanding. Plot the straight line obtained here on the scatter plot.

plt.scatter(df['bmi'], diabetes.target)
plt.plot(df['bmi'], A*df['bmi']+B, color='red')

Looking at the resulting graph, you can see that somehow a nice straight line is drawn.

bmi_vs_target_2.png

Do the same with scikit-learn

Doing the same with scikit-learn makes things easier. You can see that it can be used somehow, but can you understand that if you use it after understanding the theory, you will be completely hungry.

from sklearn.linear_model import LinearRegression

model_lr = LinearRegression()
model_lr.fit(x.to_frame(), y)

Only this. It seems that the first argument of the fit method only accepts pandas.DataFrame, so it is necessary to force it to DataFrame with to_frame ([Reference](https://medium.com/@yamasaKit/scikit-learn%E3%81%] A7% E5% 8D% 98% E5% 9B% 9E% E5% B8% B0% E5% 88% 86% E6% 9E% 90% E3% 82% 92% E8% A1% 8C% E3% 81% 86% E6% 96% B9% E6% B3% 95-f6baa2cb761e)).

Since the slope and intercept are coef_ and intercept_ respectively (see API) (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)) Let's compare it with the previous result.

print("coef_: ", model_lr.coef_[0])
print("intercept: ", model_lr.intercept_)

coef_:  949.4352603839491
intercept:  152.1334841628967

You got the same result.

Further understanding (correlation coefficient R and coefficient of determination R2)

Correlation coefficient

Correlation coefficient R is a coefficient that indicates how much the two variables are related (how much they influence each other), and takes a number from -1 to 1. The correlation coefficient $ r $ is the covariance of $ x $ and $ y $ divided by their standard deviations, and is calculated by the corrcoef method in numpy.

r = S_xy/(x.std(ddof=1)*y.std(ddof=1))
rr = np.corrcoef(x, y)[0][1]

0.5864501344746891
0.5864501344746891

This is also the same value. The higher the value, the stronger the relevance of each.

Coefficient of determination

The coefficient of determination is an index of how well the obtained straight line matches the actual data, and the closer it is to 1, the closer it is to the original data.

The coefficient of determination can be obtained based on the values of total variation and residual variation, and is equal to the square of the correlation coefficient. For details, see here.

The coefficient of determination is obtained by the score method of the LinearRegression class.

R = model_lr.score(x.to_frame(), y)

print("R: ", R)
print("r^2: ", r**2)

R:  0.3439237602253803
r^2:  0.3439237602253809

It will be equal.

Summary

For simple regression analysis, I tried the python implementation while checking the theory. I think you can understand how to draw a regression line and how much the obtained straight line represents the original data. By the way, I chose BMI for the target because it had the highest correlation coefficient. I would like to write about how to check that.

Recommended Posts

Machine learning algorithm (simple regression analysis)
Machine learning algorithm (multiple regression analysis)
Machine learning with python (2) Simple regression analysis
Machine learning algorithm (simple perceptron)
Machine learning algorithm (logistic regression)
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
Machine learning algorithm (linear regression summary & regularization)
[Machine learning] Regression analysis using scikit learn
Machine learning logistic regression
Machine learning linear regression
Machine Learning: Supervised --Linear Regression
Understand machine learning ~ ridge regression ~.
Simple regression analysis in Python
Supervised machine learning (classification / regression)
Machine learning algorithm (support vector machine)
Machine learning stacking template (regression)
2. Multivariate analysis spelled out in Python 1-2. Simple regression analysis (algorithm)
<Course> Machine Learning Chapter 6: Algorithm 2 (k-means)
Machine learning beginners try linear regression
Classification and regression in machine learning
Machine learning
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
Machine learning algorithm (gradient descent method)
Simple regression analysis implementation in Keras
Machine Learning: Supervised --Linear Discriminant Analysis
[Machine learning] Understanding linear simple regression from both scikit-learn and mathematics
scikit learn algorithm cheat sheet
[Machine learning] Regression analysis using scikit learn
<Course> Machine Learning Chapter 3: Logistic Regression Model
Machine learning algorithm (implementation of multi-class classification)
<Course> Machine Learning Chapter 1: Linear Regression Model
[Python] First data analysis / machine learning (Kaggle)
Machine learning algorithm classification and implementation summary
<Course> Machine learning Chapter 4: Principal component analysis
<Course> Machine Learning Chapter 2: Nonlinear Regression Model
Stock price forecast using machine learning (regression)
Preprocessing in machine learning 1 Data analysis process
Dictionary learning algorithm
Poisson regression analysis
Regression analysis method
[Memo] Machine learning
Machine learning classification
Machine Learning sample
A story about simple machine learning using TensorFlow
Gaussian mixed model EM algorithm [statistical machine learning]
Basics of Supervised Learning Part 1-Simple Regression- (Note)
EV3 x Python Machine Learning Part 2 Linear Regression
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
[Python3] Let's analyze data using machine learning! (Regression)
Analysis of shared space usage by machine learning
A story about data analysis by machine learning
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Machine learning support vector machine
Calculate the regression coefficient of simple regression analysis with python
Studying Machine Learning ~ matplotlib ~
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6