Hello, this is Motty. This time, I described regression analysis using Python.

Regression

Regression analysis is a method of predicting the target data using the data at hand. At that time, the structure of the quantitative relationship is applied to the data (regression model). If the regression model is a straight line, it is called a regression line, and if the nth-order function is applied by polynomial regression, it is called a regression curve.

How to determine the model

The least squares method is used to evaluate the fitted model. A method of selecting a coefficient that minimizes the sum of squares of residuals when approximating the data obtained by measurement with a function such as a straight line.

Evaluation method

Use the coefficient of determination. The higher this number, the better the regression model fits into the actual data. If the observed value = y and the estimated value by the function is f, it is expressed by the following equation. 2020-04-06 11.20.44.png

If the model fits the data perfectly, the coefficient of determination value will be 1.

Regression

For data with noise added to each of the linear function, quadratic function, and cubic function I applied a regression line.

`LinearRegression.py`


import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#sklearn
from sklearn.linear_model import LinearRegression as reg
from sklearn.metrics import r2_score

#Data related
CI =["black","red","blue","yellow","green","orange","purple","skyblue"]#ColorIndex

N = 30 #The number of samples
x = np.linspace(1,10,N)
y1 = x *5 + np.random.randn(N)*5
y2 =  2*(x-2)*(x-7) +   np.random.randn(N)*5
y3 =  3*(x-1)*(x-4)*(x-7) +  np.random.randn(N)*10

x = x.reshape([-1,1])
y1 = y1.reshape([-1,1])
y2 = y2.reshape([-1,1])
y3 = y3.reshape([-1,1])

#Learning
clf1, clf2, clf3 = reg(),reg(),reg()
clf1.fit(x,y1),clf2.fit(x,y2),clf3.fit(x,y3)

#x Predicted value for the data
y1_pred,y2_pred,y3_pred = clf1.predict(x),clf2.predict(x),clf3.predict(x)

#drawing
fig = plt.figure(figsize = (15,15))
ax1,ax2,ax3 = fig.add_subplot(3,3,1),fig.add_subplot(3,3,2),fig.add_subplot(3,3,3)
#Data
ax1.scatter(x,y1,c = CI[1],label = "R^2 = {}".format(r2_score(y1,y1_pred)))
ax2.scatter(x,y2,c = CI[2],label = "R^2 = {}".format(r2_score(y2,y2_pred)))
ax3.scatter(x,y3,c = CI[3],label = "R^2 = {}".format(r2_score(y3,y3_pred)))
ax1.legend(),ax2.legend(),ax3.legend()
#Regression line
ax1.plot(x,clf1.predict(x), c = CI[0])
ax2.plot(x,clf2.predict(x), c = CI[0])
ax3.plot(x,clf3.predict(x), c = CI[0])

fig.suptitle("RinearLegression", fontsize = 15)
ax1.set_title("1")
ax2.set_title("2")
ax3.set_title("3")

Not surprisingly, the results show that the straight line fit is best for linear functions.

2020-04-06 13.24.26.png

Polynomial regression

For datasets such as 2 and 3, it may be appropriate to apply a regression curve such as a multi-order function.

`polynomial.py`



import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression as reg
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures as PF


#Data related
CI =["black","red","blue","yellow","green","orange","purple","skyblue"]#ColorIndex
N = 10 #The number of samples
x = np.linspace(1,10,N)
y3 =  3*(x-1)*(x-4)*(x-7) +  np.random.randn(N)*10
x = x.reshape([-1,1])
y3 = y3.reshape([-1,1])


#Learning
clf = reg()
clf.fit(x,y3)
#Degree
DegreeSet =[1,2,3] 
for dg in DegreeSet:
    
    pf = PF(degree = dg, include_bias = False)
    x_poly = pf.fit_transform(x)
    poly_reg = reg()
    poly_reg.fit(x_poly,y3)
    polypred = poly_reg.predict(x_poly)

    #x Predicted value for the data
    pred = clf.predict(x)
    #drawing
    plt.scatter(x,y3,c = CI[dg], label = "R^2={}".format(r2_score(y3,polypred)))
    plt.plot(x, polypred,c = CI[0])
    plt.legend()
    plt.title("Regression")
    plt.show()

As a result, the model fits well and the coefficient of determination is high each time the order is increased to 1, 2, and 3.

2020-04-06 15.06.49.png

Should I raise the order?

The higher the order, the more expressive the model becomes and the better it fits into the data, but the higher the order, the lower the generalization performance (overfitting). To solve such a problem, it is advisable to use a simple linear regression with penalties such as AIC.

(When we reduce the fit of the data to the model to the AIC minimization problem As you can see from the formula, penalties are set for the increase in order, and the optimum order can be selected. )

Since sklearn didn't have the right library, I plan to evaluate the model using my own AIC as a continuation of this.

Regression analysis in Python

Regression

How to determine the model

Evaluation method

Regression

LinearRegression.py

Polynomial regression

polynomial.py

Should I raise the order?

`LinearRegression.py`

`polynomial.py`