2. Multivariate analysis spelled out in Python 1-2. Simple regression analysis (algorithm)

We will find the simple regression equation using only Numpy and Pandas, which are necessary for basic numerical calculations.

⑴ Import the library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

⑵ Import data and check the contents

df = pd.read_csv("https://raw.githubusercontent.com/karaage0703/machine-learning-study/master/data/karaage_data.csv")
print(df.head())

least squares </ font>

The goal of simple regression analysis was to find the two constants contained in the regression equation: the regression coefficient $ a $ and the intercept $ b $. At that time, in order to obtain a more accurate simple regression equation, the constants $ a $ and $ b $ are determined so that the overall error, that is, the residual $ y-\ hat {y} $ is as small as possible. I have to. Consider this ** definition of residuals **. e_{1}+e_{2}+e_{3}…+e_{n} It looks like this, but this is incorrect. The measured values are unevenly distributed on each of the positive and negative sides of the regression line. In other words, the plus and minus cancel each other out, and the total is 0. Therefore, by squared the residual for each individual, the positive and negative are eliminated, and it can be treated simply as the size (distance) of the distance. Q = {e_{1}}^{2}+{e_{2}}^{2}+{e_{3}}^{2}…+{e_{n}}^{2} $ Q $ has been defined as the total amount of distance from the regression line. The smallest of this $ Q $ is the decisive factor for the slope of the regression line $ a $, and if it is obtained, the intersection $ b $ with the $ y $ axis can be naturally obtained. This method is called the ** least squares method **.

We will solve the simple regression equation based on the least squares method.

⑶ Calculate the average value of each variable x and y

mean_x = df['x'].mean()
mean_y = df['y'].mean()

⑷ Calculate the deviation of each variable x and y

Deviation is the difference between the value of each individual and the mean value. Calculate $ x_ {i}-\ bar {x} $ for the variable $ x $ and $ y- \ bar {y} $ for the variable $ y $. Each variable will be calculated for the number of data.

#Deviation of x
dev_x = []
for i in df['x']:
    dx = i - mean_x
    dev_x.append(dx)
#deviation of y
dev_y = []
for j in df['y']:
    dy = j - mean_y
    dev_y.append(dy)

⑸ Calculate the variance of variable x

Calculate the variance using the deviation obtained in (4). The variance is the mean square of the deviations, that is, the squares for each deviation and the sum is divided by the number (number of data-1).

#Deviation square sum
ssdev_x = 0
for i in dev_x:
    d = i ** 2
    ssdev_x += d
#Distributed
var_x = ssdev_x / (len(df) - 1)

⑹ Calculate covariance

The covariance $ s_ {xy} $ is one of the indexes showing the strength of the relationship between two variables and is defined by the following equation. s_{xy} = \frac{1}{n - 1} \displaystyle \sum_{i = 1}^n {(x_i - \overline{x})(y_{i} - \overline{y})} Consider a set of data for each individual. When there is a $ n $ pair of $ (x_ {1}, y_ {1}), (x_ {2}, y_ {2}), ..., (x_ {n}, y_ {n}) $ For each pair, multiply the deviation of $ x $ by the deviation of $ y $ and divide the sum of them by the number (number of data-1).

#Deviation product sum
spdev = 0
for i,j in zip(df['x'], df['y']):
    spdev += (i - mean_x) * (j - mean_y)
#Covariance
cov = spdev / (len(df) - 1)

⑺ Calculate regression coefficient a

Here is the formula for finding the regression coefficient by the least squares method. a = \frac{S_{xy}}{Sx^2} The regression coefficient $ a $ can be obtained by dividing the covariance $ S_ {xy} $ obtained in ⑹ by the variance $ Sx ^ 2 $ of the variable $ x $ obtained in ⑸.

a = cov / var_x

⑻ Calculate intercept b

By transforming the simple regression equation $ y = ax + b $ and setting $ b = y -ax $, the average value $ \ bar {x}, \ bar {y} $ obtained in ⑶ and the regression coefficient obtained in ⑺ Substitute $ a $.

b = mean_y - (a * mean_x)

** As mentioned above, the simple regression equation was obtained by the formula of the least squares method. ** ** ** It matches the calculation result obtained by using the machine learning library scikit-learn earlier. Therefore, I will also calculate the confirmation of the coefficient of determination by myself. ** **

⑼ Calculate the coefficient of determination and check the accuracy of the regression equation

Create predicted value data using a regression equation and find its variance. What percentage of the variance of the measured value $ y $, that is, how much can the original variate $ y $ be explained?

#Data creation of predicted value z
df['z'] = (a * df['x']) + b
print(df)

#Variance of predicted value z
ssdev_z = 0
for i in df['z']:
    j = (i - df['z'].mean())**2
    ssdev_z += j
var_z = ssdev_z / (len(df) - 1)
print("Variance of predicted values:", var_z)

#Variance of measured value y
ssdev_y = 0
for i in dev_y:
    j = i ** 2
    ssdev_y += j
var_y = ssdev_y / (len(df) - 1)
print("Variance of measured value y:", var_y)

#Coefficient of determination
R = var_z / var_y
print("Coefficient of determination R:", R)

It was confirmed that the coefficient of determination also matches the calculation result by scikit-learn above.

⑽ Show the regression line along with the scatter plot

plt.plot(x, y, "o") #Scatter plot
plt.plot(x, z, "r") #Regression line
plt.show()

So far, you've learned the algorithms for simple regression analysis. However, in the real world, there are few cases where a phenomenon can be explained by only one factor. In the background of a certain phenomenon, various factors are intertwined at the same time, more or less. Next, you will learn multiple regression analysis that deals with three or more variables.

2. Multivariate analysis spelled out in Python 1-2. Simple regression analysis (algorithm)

** ⑴ Import the library **

** ⑵ Import data and check the contents **

** least squares </ font> **

** ⑶ Calculate the average value of each variable x and y **

** ⑷ Calculate the deviation of each variable x and y **

** ⑸ Calculate the variance of variable x **

** ⑹ Calculate covariance **

** ⑺ Calculate regression coefficient a **

** ⑻ Calculate intercept b **

** ⑼ Calculate the coefficient of determination and check the accuracy of the regression equation **

** ⑽ Show the regression line along with the scatter plot **