Over the last few years, with the growing interest in AI and big data, we have come to hear analytical methods such as k-means as an approach for machine learning. All of these are ** multivariate analysis ** methods that have been used in business and academic fields for decades since the Showa era. One of the most popular methods is ** regression analysis **. So, first, let's implement ** simple regression analysis ** using scikit-learn, a machine learning library. (As a general rule, write the code and check the result on Google Colaboratory)
Regression analysis can be divided into ** simple regression analysis ** for two variables and ** multiple regression analysis ** for three or more variables. First, consider ** simple regression analysis **. Simple regression analysis derives the law of ** linear ** or ** non-linear ** in data (phenomenon) ....., to put it plainly, $ y $ when $ x $ increases. Also reveals the rule of increasing / decreasing at a constant rate.
A very simple ** linear simple regression analysis ** is expressed by the following equation. This equation is called ** regression equation (simple regression equation) **. Once you have decided on $ a $ and $ b $, you can draw a straight line. Then, $ x $ can explain $ y $, or $ x $ can be used to predict $ y $. Since the variable $ y $ is explained by the variable $ x $, the target $ y $ is the ** objective variable ** (dependent variable), and the $ x $ that explains this is the ** explanatory variable ** (independent). It is called a variable). Also, $ a $, which indicates the slope of the regression line, is called the ** regression coefficient **, and $ b $, which indicates the intersection with the $ y $ axis, is called the ** intercept **. That is, the goal of simple regression analysis is to find the regression coefficients a and intercept b.
#Library required for numerical calculation
import numpy as np
import pandas as pd
#Package to draw a graph
import matplotlib.pyplot as plt
# scikit-linear model of learn
from sklearn import linear_model
df = pd.read_csv("https://raw.githubusercontent.com/karaage0703/machine-learning-study/master/data/karaage_data.csv")
df
The variable df, where the data is stored, is in the form of a Pandas data frame. This is converted to Numpy's Array type and stored in variables x and y for later calculation.
x = df.loc[:, ['x']].values
y = df['y'].values
The variate $ x $ was stored as two-dimensional data by cutting out the elements of [all rows, x columns] with the loc
function of pandas, converting them to a Numpy array with values
. The variable $ y $ is taken out as one-dimensional data by specifying the column name $ y $ and converted to a Numpy array in the same way.
Using the drawing package matplotlib, specify (variable x, variable y,'' marker type'') in the argument of the plot
function.
plt.plot(x, y, "o")
** From here, we will use the linear regression model of the machine learning library scikit-learn to calculate the regression coefficient $ a $ and the intercept $ b $. ** **
#Load the linear regression model and use it as the function clf
clf = linear_model.LinearRegression()
#Fluent x for function clf,Apply y
clf.fit(x, y)
The regression coefficient can be obtained with coef_
and the intercept with ʻintercept_`.
#Regression coefficient
a = clf.coef_
#Intercept
b = clf.intercept_
Then get the coefficient of determination as score (variable x, variable y)
.
#Coefficient of determination
r = clf.score(x, y)
The coefficient of determination is an index that indicates the accuracy of the obtained ** regression equation **. Accuracy in this case is "how well the regression equation can explain the distribution of the data". The coefficient of determination is defined as follows: The data actually observed is called ** measured value **. As you can see from the scatter plot, the measured values are scattered on the coordinates. Since this is summarized in a straight line, we will reject some of the information that the original variance has. This rejected part, that is, the error associated with the regression equation, is called ** residual ** and can be expressed in the form of a fraction as follows. The denominator is the variance of the objective variable $ y $, which is the measured value, and the numerator is balanced up and down with the variance of the predicted value $ \ hat {y} $ by the regression equation and the reduction of the residual. In other words, the coefficient of determination is what percentage of the variance of the measured value is the variance of the predicted value. Since it is a ratio, the coefficient of determination $ R ^ 2 $ always takes a value between 0 and 1, and the closer it is to 1, the better the accuracy of the regression equation is.
First, generate the $ x $ value that is the source of the regression line using Numpy's linspace
function.
Specify (start point, end point, number of delimiters)
as an argument.
fig_x = np.linspace(0, 40, 40)
print(a.shape) #Regression coefficient
print(b.shape) #Intercept
print(fig_x.shape) #x value
Tips
If you substitute $ y = ax + b $ as it is, an error will occur. This is because the regression coefficient $ a $ is a 1-row x 1-column array type, and the $ x $ value is a 40-row x 1-column array type, so it does not meet the rules for multiplying arrays. Therefore, we need to convert the regression coefficient $ a $ to a single value so that all $ x $ values are multiplied equally.
When defining the formula for the y value, the variable fig_y, we use Numpy's reshape
function to transform the shape of the regression coefficient $ a $.
fig_y = a.reshape(1)*fig_x + b
#Scatter plot
plt.plot(x, y, "o")
#Regression line
plt.plot(fig_x, fig_y, "r") #Change the line color with the third argument"r"Designated to
plt.show()
As mentioned above, by using the machine learning library, it is possible to obtain analysis results without complicated calculations. However, although it is not limited to regression analysis, when it comes to how to properly interpret the calculation result or to perform delicate tuning in operating the method, we still know the calculation mechanism (algorithm). It is desirable to keep it. So next, you'll learn all your own simple regression analysis without using scikit-learn.