Regression

(Linear) simple regression model

This model outputs one type of predicted value from one type of data. When the input data is $ x = (x_1, x_2, \ cdots, x_n) $ and the output data is $ y = (y_1, y_2, \ cdots, y_n) $, the __ model is best represented. Determine the slope $$ and intercept $ b $ of the __ straight line $ y = wx + b $. $ w $ is called __weight __ and $ b $ is called __bias __. There is an error between the actual data and the model ($ y = wx + b ). $ \ varepsilon = (\ varepsilon_1, \ varepsilon_2, \ cdots, \ varepsilon_n) $$. At this time, about each teacher data

y_i = wx_i+b+\varepsilon_i

Is established. The best representation of a model is when you have determined the appropriate __loss function __ (eg, sum of squares error) and the model (straight line equation) minimizes it. Loss function $ L (w, b) $ is sum of squares error (also called residual sum of squares)

L(w, b):=\sum_{i=1}^n \{y_i-(wx_i+b)\}^2

Then, you can see that $ L (w, b) = \ sum_ {i = 1} ^ n \ varepsilon_i ^ 2 $, that is, the sum of squares of the error. In some cases, multiply by (1/2) (because the coefficient can be 1 by differentiating). Although details are omitted, $ w and b $ can be obtained by solving the equation $ = 0 $ which is the partial derivative of $ L (w, b) $ with respect to $ w and b $ respectively.

b = \bar{y}-w\bar{x}, \ w=\frac{Cov(x, y)}{Var(x)}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})}

Where $ \ bar {x} $ is the mean of $ x $, $ Var (x) $ is the variance of $ x $, and $ Cov (x, y) $ is the covariance of $ x $ and $ y $. Represents. Especially under this $ (w, b) $

\begin{eqnarray}
&&y=wx+b=wx+(\bar{y}-w\bar{x})=\bar{y}+w(x-\bar{x})\\
&\Leftrightarrow& y-\bar{y}=w(x-\bar{x})
\end{eqnarray}

Therefore, the regression line passes through the sample mean $ (\ bar {x}, \ bar {y}) $ of the observed data.

Implementation

I will implement it by myself in Python & implement it using the LinearRegression model of linear_model of sklearn. Random seeds are set for reproducibility. First, import the required libraries.

import numpy as np
import matplotlib.pyplot as plt
from numpy.random  import randn

Next, define linear regression as a function.

def LinearRegression(x, y):
    n = len(x)
    temp_x = x.sum()/n
    temp_y = y.sum()/n
    w = ((x-temp_x)*(y-temp_y)).sum()/((x-temp_x)**2).sum()
    b = temp_y - w * temp_x
    return w, b

And, as teacher data, $ y = x + 1 + noise $ is assumed. $ noise $ is a random number that follows a standard normal distribution multiplied by 0.1.

# initial value
# y=x+Assuming 1 plus noise
x = np.array([1,2,3,4,5])
np.random.seed(seed=0)
y = np.array([2,3,4,5,6])+np.random.randn(5)*0.1
print(y)
w, b = LinearRegression(x, y) 
print(w, b)

If you do this, you will get about $ w = 1.02, b = 1.08 $. Just in case, visualize it with a graph.

plt.scatter(x, y, color="k")
xx = np.linspace(0, 5, 50)
yy = xx * w + b
plt.plot(xx, yy)
plt.show()

it is a good feeling!

Rewrite the code to make it easier to use

Let's rewrite the code using the class. I created a class called model, a fit method that calculates parameters, a predict method that outputs predictions, and a score method that calculates the coefficient of determination. The coefficient of determination $ R ^ 2 $ means that the closer it is to 1, the better the accuracy.

R^2:= 1-\frac{\sum_{i=1}^n (y_i-(wx_i+b))^2}{\sum_{i=1}^n (y_i-\bar{y})^2}

Is defined as.

class model():
        
    def fit(self, X, y):
        n = len(X)
        temp_X = X.sum()/n
        temp_y = y.sum()/n
        self.w = ((X-temp_X)*(y-temp_y)).sum()/((X-temp_X)**2).sum()
        self.b = temp_y - self.w * temp_X
        
    def predict(self, X):
        return self.w*X +self.b
    
    def score(self, X, y):
        n = len(X)
        return  1 - ((y-(self.w*X+self.b))**2).sum() / ((y-y.sum()/n)**2).sum()

Enter the teacher data as before. In addition, let me test it.

X = np.array([1,2,3,4,5])
np.random.seed(seed=0)
y = np.array([2,3,4,5,6])+np.random.randn(5)*0.1

lr = model()
lr.fit(X, y)
print("w, b={}, {}".format(w, b))

#test data
test_X = np.array([6,7,8,9,10])
pre = lr.predict(test_X)
print(pre)

#Coefficient of determination R^2
print(lr.score(X, y))

Make a graph.

#Display the result in a graph
plt.scatter(X, y, color="k")
plt.scatter(test_X, pre, color="r")
xx = np.linspace(0, 10, 50)
yy = xx * w + b
plt.plot(xx, yy)
plt.show()

Use sklearn's library

Finally, let's write some code briefly using LinearRegression () of linear_model.

from sklearn import linear_model

X = np.array([1,2,3,4,5])
np.random.seed(seed=0)
y = np.array([2,3,4,5,6])+np.random.randn(5)*0.1

X = X.reshape(-1,1)
model = linear_model.LinearRegression()
model.fit(X, y)
print(model.coef_[0])
print(model.intercept_)
#Coefficient of determination R^2
print(model.score(X, y))

$ w $ corresponds to coef_ [0] and $ b $ corresponds to intercept_. You can see that the values match. The point to note here is that in the case of one-dimensional input data, input as $ X = np.array ([[1], [2], [3], [4], [5]]) $. is needed. $ X = np.array ([1,2,3,4,5]) $ will throw an error. Next, let's make a prediction using the test data.

test_X = test_X.reshape(-1,1)
pre = model.predict(test_X)

Make a graph.

plt.scatter(X, y)
plt.scatter(test_X, pre)
plt.plot(xx, model.coef_[0]*xx+model.intercept_)
plt.show()

Summary

Writing code using sklearn makes it easy to call: This can be applied in much the same way with other methods. All that is left is optional elements such as parameters and dividing teacher data into training data and test data.

#Define model
model = linear_model.LinearRegression()
#Learning(X:Input data, y:Correct label)
model.fit(X, y)
#Forecast(XX:Unknown data)
model.predict(XX)

That is all for simple regression (least squares method). Continue to next time. .. ..

Postscript

If you think about it, it should have been $ b = (b, b, \ cdots, b) \ in \ mathbf {\ mathrm {R}} ^ n $. In Python code, thanks to broadcast, the sum with vectors and matrices can be calculated automatically even as a scalar $ b $.

Addendum 2

from sklearn import linear_model

But

from sklearn.linear_model import LinearRegression

Is easier. I wonder if this is common in other books as well.

References

Koichi Kato "Essence of Machine Learning" SB Creative, 2019

Basics of Supervised Learning Part 1-Simple Regression- (Note)