Here you will learn a linear ** multiple regression analysis ** that deals with three or more variables. Let's analyze the house prices in Boston, a big city in the northeastern part of Massachusetts in the United States, using various explanatory variables.
#Library required for numerical calculation
import numpy as np
import pandas as pd
#Package to draw a graph
import matplotlib.pyplot as plt
#Machine learning library scikit-linear model of learn
from sklearn import linear_model
Read one of the datasets "Boston house prices dataset" that comes with scikit-learn and store it as the variable boston.
from sklearn.datasets import load_boston
boston = load_boston()
print(boston)
First, view the contents to get an overview of the data.
print(boston.DESCR)
Check out the description of the Boston dataset. The dataset has 13 items as explanatory variables and 1 item as objective variables.
Variable name | Original word | Definition |
---|---|---|
CRIM | crime rate | Crime rate per capita by town |
ZN | zone | 25,Percentage of residential areas over 000 square feet |
INDUS | industry | Percentage of non-retail (manufacturing) area per town |
CHAS | Charles River | Charles River dummy variable (1 if the area borders the river, 0 otherwise) |
NOX | nitric oxides | Nitric oxide concentration (1 in 10 million) |
RM | rooms | Average number of rooms per unit |
AGE | ages | Percentage of homes built before 1940 |
DIS | distances | Weighted distances to five Boston employment centers |
RAD | radial highways | Accessibility index to highways |
TAX | tax rate | 10,Property tax rate per $ 000 |
PTRATIO | pupil-teacher ratio | Student / teacher ratio by town |
B | blacks | Percentage of blacks by town |
LSTAT | lower status | Percentage of lower classes in the population |
MEDV | median value | Median homeownership in the $ 1000 range |
boston_df = pd.DataFrame(boston.data)
print(boston_df)
It says [506 rows x 13 columns], and the shape of the data is 506 rows x 13 columns, that is, the number of samples is 506 for 13 variables.
#Specify column name
boston_df.columns = boston.feature_names
print(boston_df)
Specify boston's feature_names
as boston_df's columns
.
#Add objective variable
boston_df['PRICE'] = pd.DataFrame(boston.target)
print(boston_df)
Convert the boston data target
to a Pandas data frame and store it in boston_df with the column name PRICE
. The objective variable "PRICE" is added to the rightmost column.
From now on, we will perform multiple regression analysis using scikit-learn, but here is the procedure.
① Create the model source (instance)
Model variable name = LinearRegression ()
(2) Create a model based on the explanatory variable X and the objective variable Y
Model variable name.fit (X, Y)
③ Calculate the regression coefficient using the created model
Model variable name.coef_
④ Calculate the intercept using the created model
Model variable name.intercept_
⑤ Calculate the coefficient of determination to obtain the accuracy of the model
Model variable name.score (X, Y)
#Delete only the objective variable and store it in variable X
X = boston_df.drop("PRICE", axis=1)
#Extract only the objective variable and store it in variable Y
Y = boston_df["PRICE"]
print(X)
print(Y)
Delete the column of the data frame with the argument (" column name ", axis = 1)
in the drop
function.
Furthermore, the column of the data frame is specified by data frame name [" column name "]
, extracted, and stored in the variable Y.
model = linear_model.LinearRegression()
model.fit(X,Y)
model.coef_
coef_
means coefficient.
Since the coefficient is a little difficult to understand, convert it to a data frame to make it easier to see.
#Store the coefficient value in the variable coefficient
coefficient = model.coef_
#Convert to data frame and specify column name and index name
df_coefficient = pd.DataFrame(coefficient,
columns=["coefficient"],
index=["Crime rate", "Residential land rate", "Manufacturing ratio", "Charles river", "Nitric oxide concentration",
"Average number of rooms", "Home ownership rate", "Employment center", "Highway", "Property tax rate",
"Student / teacher ratio", "Black ratio", "Underclass ratio"])
df_coefficient
model.intercept_
model.score(X, Y)
There is a concept called cross-validation as a method of verifying the accuracy of the model, that is, the validity of the analysis itself.
In most cases, the sample is randomly divided into two groups, one group is used to create a model, and the remaining one group is used to test the model.
The former is called training data and the latter is called test data.
Now, scikit-learn provides a method to split training data and test data.
sklearn.model_selection.train_test_split
There is no fixed rule on the ratio of training and test, but if you do not specify the argument of train_test_split
, a quarter, that is, 25% of all samples will be divided as test data.
#sklearn train_test_Import split method
from sklearn.model_selection import train_test_split
#Variable X,Divide Y for training and testing respectively
X_train, X_test, Y_train, Y_test = train_test_split(X,Y)
Let's check the contents of each of the variables X for training and testing.
print("Training data for variable x:", X_train, sep="\n")
print("Test data for variable x:", X_test, sep="\n")
The variable name multi_lreg
is an abbreviation for multiple linear regression analysis.
multi_lreg = linear_model.LinearRegression()
multi_lreg.fit(X_train, Y_train)
multi_lreg.fit(X_train, Y_train)
multi_lreg.score(X_test,Y_test)
A pair of training and testing, which can be replaced with "known data" and "unknown data". If you create a model with the data you already have and apply that model to the newly obtained data, how well will it work? The data to be analyzed is always a part of the whole, either past or present perfect. However, we are not analyzing anything just to confirm the current situation, but from there we should have the aim of "reading ahead" and "fortune-telling the future." In that sense, if the coefficient of determination is low, the accuracy of the model itself is suspected, but the higher the coefficient, the better. Rather, being too expensive is a problem. A high coefficient of determination, that is, a small residual (error), means that the model fits the data analyzed. If it fits too much, the coefficient of determination may drop when applied to the newly obtained data. This is the so-called "overfitting" problem.
Next, in order to understand how the calculation of multiple regression analysis works, we will perform multiple regression analysis without using the convenient scikit-learn.