We will find the simple regression equation using only Numpy and Pandas, which are necessary for basic numerical calculations.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("https://raw.githubusercontent.com/karaage0703/machine-learning-study/master/data/karaage_data.csv")
print(df.head())
The goal of simple regression analysis was to find the two constants contained in the regression equation: the regression coefficient $ a $ and the intercept $ b $.
At that time, in order to obtain a more accurate simple regression equation, the constants $ a $ and $ b $ are determined so that the overall error, that is, the residual $ y-\ hat {y} $ is as small as possible. I have to.
Consider this ** definition of residuals **.
We will solve the simple regression equation based on the least squares method.
mean_x = df['x'].mean()
mean_y = df['y'].mean()
Deviation is the difference between the value of each individual and the mean value. Calculate $ x_ {i}-\ bar {x} $ for the variable $ x $ and $ y- \ bar {y} $ for the variable $ y $. Each variable will be calculated for the number of data.
#Deviation of x
dev_x = []
for i in df['x']:
dx = i - mean_x
dev_x.append(dx)
#deviation of y
dev_y = []
for j in df['y']:
dy = j - mean_y
dev_y.append(dy)
Calculate the variance using the deviation obtained in (4). The variance is the mean square of the deviations, that is, the squares for each deviation and the sum is divided by the number (number of data-1).
#Deviation square sum
ssdev_x = 0
for i in dev_x:
d = i ** 2
ssdev_x += d
#Distributed
var_x = ssdev_x / (len(df) - 1)
The covariance $ s_ {xy} $ is one of the indexes showing the strength of the relationship between two variables and is defined by the following equation.
#Deviation product sum
spdev = 0
for i,j in zip(df['x'], df['y']):
spdev += (i - mean_x) * (j - mean_y)
#Covariance
cov = spdev / (len(df) - 1)
Here is the formula for finding the regression coefficient by the least squares method.
a = cov / var_x
By transforming the simple regression equation $ y = ax + b $ and setting $ b = y -ax $, the average value $ \ bar {x}, \ bar {y} $ obtained in ⑶ and the regression coefficient obtained in ⑺ Substitute $ a $.
b = mean_y - (a * mean_x)
** As mentioned above, the simple regression equation was obtained by the formula of the least squares method. ** ** ** It matches the calculation result obtained by using the machine learning library scikit-learn earlier. Therefore, I will also calculate the confirmation of the coefficient of determination by myself. ** **
Create predicted value data using a regression equation and find its variance. What percentage of the variance of the measured value $ y $, that is, how much can the original variate $ y $ be explained?
#Data creation of predicted value z
df['z'] = (a * df['x']) + b
print(df)
#Variance of predicted value z
ssdev_z = 0
for i in df['z']:
j = (i - df['z'].mean())**2
ssdev_z += j
var_z = ssdev_z / (len(df) - 1)
print("Variance of predicted values:", var_z)
#Variance of measured value y
ssdev_y = 0
for i in dev_y:
j = i ** 2
ssdev_y += j
var_y = ssdev_y / (len(df) - 1)
print("Variance of measured value y:", var_y)
#Coefficient of determination
R = var_z / var_y
print("Coefficient of determination R:", R)
It was confirmed that the coefficient of determination also matches the calculation result by scikit-learn above.
plt.plot(x, y, "o") #Scatter plot
plt.plot(x, z, "r") #Regression line
plt.show()
So far, you've learned the algorithms for simple regression analysis. However, in the real world, there are few cases where a phenomenon can be explained by only one factor. In the background of a certain phenomenon, various factors are intertwined at the same time, more or less. Next, you will learn multiple regression analysis that deals with three or more variables.