Regression analysis in Python

Hello, this is Motty. This time, I described regression analysis using Python.

Regression

Regression analysis is a method of predicting the target data using the data at hand. At that time, the structure of the quantitative relationship is applied to the data (regression model). If the regression model is a straight line, it is called a regression line, and if the nth-order function is applied by polynomial regression, it is called a regression curve.

How to determine the model

The least squares method is used to evaluate the fitted model. A method of selecting a coefficient that minimizes the sum of squares of residuals when approximating the data obtained by measurement with a function such as a straight line.

Evaluation method

Use the coefficient of determination. The higher this number, the better the regression model fits into the actual data. If the observed value = y and the estimated value by the function is f, it is expressed by the following equation.  2020-04-06 11.20.44.png

If the model fits the data perfectly, the coefficient of determination value will be 1.

Regression

For data with noise added to each of the linear function, quadratic function, and cubic function I applied a regression line.

LinearRegression.py


import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#sklearn
from sklearn.linear_model import LinearRegression as reg
from sklearn.metrics import r2_score

#Data related
CI =["black","red","blue","yellow","green","orange","purple","skyblue"]#ColorIndex

N = 30 #The number of samples
x = np.linspace(1,10,N)
y1 = x *5 + np.random.randn(N)*5
y2 =  2*(x-2)*(x-7) +   np.random.randn(N)*5
y3 =  3*(x-1)*(x-4)*(x-7) +  np.random.randn(N)*10

x = x.reshape([-1,1])
y1 = y1.reshape([-1,1])
y2 = y2.reshape([-1,1])
y3 = y3.reshape([-1,1])

#Learning
clf1, clf2, clf3 = reg(),reg(),reg()
clf1.fit(x,y1),clf2.fit(x,y2),clf3.fit(x,y3)

#x Predicted value for the data
y1_pred,y2_pred,y3_pred = clf1.predict(x),clf2.predict(x),clf3.predict(x)

#drawing
fig = plt.figure(figsize = (15,15))
ax1,ax2,ax3 = fig.add_subplot(3,3,1),fig.add_subplot(3,3,2),fig.add_subplot(3,3,3)
#Data
ax1.scatter(x,y1,c = CI[1],label = "R^2 = {}".format(r2_score(y1,y1_pred)))
ax2.scatter(x,y2,c = CI[2],label = "R^2 = {}".format(r2_score(y2,y2_pred)))
ax3.scatter(x,y3,c = CI[3],label = "R^2 = {}".format(r2_score(y3,y3_pred)))
ax1.legend(),ax2.legend(),ax3.legend()
#Regression line
ax1.plot(x,clf1.predict(x), c = CI[0])
ax2.plot(x,clf2.predict(x), c = CI[0])
ax3.plot(x,clf3.predict(x), c = CI[0])

fig.suptitle("RinearLegression", fontsize = 15)
ax1.set_title("1")
ax2.set_title("2")
ax3.set_title("3")

Not surprisingly, the results show that the straight line fit is best for linear functions.

 2020-04-06 13.24.26.png

Polynomial regression

For datasets such as 2 and 3, it may be appropriate to apply a regression curve such as a multi-order function.

polynomial.py



import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression as reg
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures as PF


#Data related
CI =["black","red","blue","yellow","green","orange","purple","skyblue"]#ColorIndex
N = 10 #The number of samples
x = np.linspace(1,10,N)
y3 =  3*(x-1)*(x-4)*(x-7) +  np.random.randn(N)*10
x = x.reshape([-1,1])
y3 = y3.reshape([-1,1])


#Learning
clf = reg()
clf.fit(x,y3)
#Degree
DegreeSet =[1,2,3] 
for dg in DegreeSet:
    
    pf = PF(degree = dg, include_bias = False)
    x_poly = pf.fit_transform(x)
    poly_reg = reg()
    poly_reg.fit(x_poly,y3)
    polypred = poly_reg.predict(x_poly)

    #x Predicted value for the data
    pred = clf.predict(x)
    #drawing
    plt.scatter(x,y3,c = CI[dg], label = "R^2={}".format(r2_score(y3,polypred)))
    plt.plot(x, polypred,c = CI[0])
    plt.legend()
    plt.title("Regression")
    plt.show()

As a result, the model fits well and the coefficient of determination is high each time the order is increased to 1, 2, and 3.

 2020-04-06 15.06.49.png

Should I raise the order?

The higher the order, the more expressive the model becomes and the better it fits into the data, but the higher the order, the lower the generalization performance (overfitting). To solve such a problem, it is advisable to use a simple linear regression with penalties such as AIC.

(When we reduce the fit of the data to the model to the AIC minimization problem As you can see from the formula, penalties are set for the increase in order, and the optimum order can be selected. )

Since sklearn didn't have the right library, I plan to evaluate the model using my own AIC as a continuation of this.

Recommended Posts

Regression analysis in Python
Simple regression analysis in Python
First simple regression analysis in Python
Association analysis in Python
Multiple regression expressions in Python
Axisymmetric stress analysis in Python
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)
EEG analysis in Python: Python MNE tutorial
Planar skeleton analysis in Python (2) Hotfix
Simple regression analysis implementation in Keras
Logistic regression analysis Self-made with python
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
SendKeys in Python
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
Poisson regression analysis
N-Gram in Python
Programming in python
Constant in python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
Regression analysis method
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
Linear regression in Python (statmodels, scikit-learn, PyMC3)
Online Linear Regression in Python (Robust Estimate)
I implemented Cousera's logistic regression in Python
Residual analysis in Python (Supplement: Cochrane rules)
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (2)
[Statistical test 2nd grade / quasi 1st grade] Regression analysis training in Python (1)
2. Multivariate analysis spelled out in Python 2-3. Multiple regression analysis [COVID-19 infection rate]
Sorted list in Python
Daily AtCoder # 36 in Python
Clustering text in Python