This time, I implemented a machine learning prediction model using scikit-learn. I also summarized the points when using each method.
The flow of building a forecast model is summarized below. There are important things in each phase, but the details will be organized separately.
(1) Arrangement of issues: Clarify business issues to be solved (2) Data collection: Organize available data and evaluate whether the goal can be achieved. (3) Basic data aggregation: Visualize the characteristics of the data to be analyzed and analyze the basic aggregation together. (4) Data preprocessing: Cleans data by removing dust hidden in the data (5) Extraction of features: Remove unnecessary features and use only necessary explanatory variables (6) Data normalization: Data normalization to match the scale of features (7) Selection of method: Select an appropriate method according to the data (8) Model learning: Learn the data rules by the selected method (9) Model verification / evaluation: Confirm the prediction accuracy of the learned method and evaluate the validity of the model.
scikit-learn is a Python machine learning library.
This time, we will build a forecast model using the price data of Boston houses published in the UCI Machine Learning Repository.
item | Overview |
---|---|
data set | ・ Boston house-price |
number of samples | ・ 506 pieces |
Number of columns | ・ 14 pieces |
The python code is below.
#Import required libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
#Data set reading
boston = load_boston()
#Creating a data frame
#Storage of explanatory variables
df = pd.DataFrame(boston.data, columns = boston.feature_names)
#Add objective variable
df['MEDV'] = boston.target
#Check the contents of the data
df.head()
The explanation of each column name is omitted. ・ Explanatory variables: 13 ・ Objective variable: 1 (MEDV)
Since there are 13 explanatory variables this time, we will use a multivariate association diagram to efficiently see the relationships between each explanatory variable and objective variable. This time, I would like to utilize a library called seaborn for visualization. First, create a multivariate association diagram.
#Import required libraries
import seaborn as sns
#Multivariate association diagram
sns.pairplot(df, size=1.0)
At first glance, RM (average number of rooms per dwelling unit) and MEDV (house price) seem to have a positive correlation. I will analyze it in a little more detail even if I narrow it down to two.
#Relationship between RM (average number of rooms per dwelling unit) and MEDV (house price)
sns.regplot('RM','MEDV',data = df)
Looking at the relationship in detail in this way, it seems that there is a correlation between RM (average number of rooms per dwelling unit) and MEDV (house price).
Next, I would like to find the correlation coefficient matrix.
#Calculate the correlation coefficient matrix
df.corr()
In preprocessing, it is necessary to remove dust (outliers, outliers, missing values) hidden in the data. Preprocessing is important in data analysis, but this time we will only check for missing values.
#Confirmation of missing values
df.isnull().sum()
CRIM 0
ZN 0
INDUS 0
CHAS 0
NOX 0
RM 0
AGE 0
DIS 0
RAD 0
TAX 0
PTRATIO 0
B 0
LSTAT 0
MEDV 0
dtype: int64
Since there are no missing values in the price data of Boston houses, we will analyze it as it is.
This time, feature engineering is excluded (it should be done). Next, in building the linear regression model, the data is divided into training data and evaluation data. After that, normalization is performed to match the scale of the explanatory variables.
#Library import
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
#Create training data and evaluation data
x_train, x_test, y_train, y_test = train_test_split(df.iloc[:, 0:13], df.iloc[:, 13],
test_size=0.2, random_state=1)
#Standardize data
sc = StandardScaler()
sc.fit(x_train) #Standardized with training data
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)
Once this is done, we will select the method and build a predictive model. This time, I decided to implement the following method.
The general prediction formula for linear regression is as follows.
\begin{eqnarray}
y = \sum_{i=1}^{n}(w_{i}x_{i})+b=w_{1}x_{1}+w_{2}x_{2}+・ ・ ・+w_{n}x_{n}+b
\end{eqnarray}
$ w_i $: Weight for explanatory variable $ x_i $ (regression coefficient) $ b $: Bias (intercept)
#Library import
from sklearn.linear_model import LinearRegression
#Library for score calculation
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
#Model learning
lr = LinearRegression()
lr.fit(x_train_std, y_train)
#Forecast
pred_lr = lr.predict(x_test_std)
#Evaluation
#Coefficient of determination(R2)
r2_lr = r2_score(y_test, pred_lr)
#Average absolute error(MAE)
mae_lr = mean_absolute_error(y_test, pred_lr)
print("R2 : %.3f" % r2_lr)
print("MAE : %.3f" % mae_lr)
#Regression coefficient
print("Coef = ", lr.coef_)
#Intercept
print("Intercept =", lr.intercept_)
The output result is as follows.
R2 : 0.779
MAE : 3.113
Coef = [-0.93451207 0.85487686 -0.10446819 0.81541757 -1.90731862 2.54650028
0.25941464 -2.92654009 2.80505451 -1.95699832 -2.15881929 1.09153332
-3.91941941]
Intercept = 22.44133663366339
Judging only by the numbers of the evaluation index is not good, so I will try to show the predicted value and the measured value in a scatter plot.
#Library import
import matplotlib.pyplot as plt
%matplotlib inline
plt.xlabel("pred_lr")
plt.ylabel("y_test")
plt.scatter(pred_lr, y_test)
plt.show()
Looking at this result, I don't think we have made such a strange prediction. Actually, we will investigate in detail here to improve the accuracy, but this time we will try another method.
Ridge regression is the loss function of linear regression with a regularization term. The loss function of linear regression is as follows.
\begin{eqnarray}
L = (\boldsymbol{y} - X\boldsymbol{w})^{T}(\boldsymbol{y}-X\boldsymbol{w})
\end{eqnarray}
$ \ boldsymbol {y} $: Vectorized measured value of the objective variable $ \ boldsymbol {w} $: Vectorized regression coefficients $ X $: Matrix of measured values of $ n $ number of samples and $ m $ number of explanatory variables
In Ridge regression, the loss function changes as follows.
\begin{eqnarray}
L = (\boldsymbol{y} - X\boldsymbol{w})^{T}(\boldsymbol{y}-X\boldsymbol{w}) + λ||\boldsymbol{w}||_{2}^{2}
\end{eqnarray}
Ridge regression makes it regular by adding the square of the L2 norm of the weight $ \ boldsymbol {w} $ as described above.
The python code is below. It's easy with scikit-learn.
#Library import
from sklearn.linear_model import Ridge
#Model learning
ridge = Ridge(alpha=10)
ridge.fit(x_train_std, y_train)
#Forecast
pred_ridge = ridge.predict(x_test_std)
#Evaluation
#Coefficient of determination(R2)
r2_ridge = r2_score(y_test, pred_ridge)
#Average absolute error(MAE)
mae_ridge = mean_absolute_error(y_test, pred_ridge)
print("R2 : %.3f" % r2_ridge)
print("MAE : %.3f" % mae_ridge)
#Regression coefficient
print("Coef = ", ridge.coef_)
The regularization parameters are set appropriately (default of scikit-learn is alpha = 1.0) The output result is as follows.
R2 : 0.780
MAE : 3.093
Coef = [-0.86329633 0.7285083 -0.27135102 0.85108307 -1.63780795 2.6270911
0.18222203 -2.64613645 2.17038535 -1.42056563 -2.05032997 1.07266175
-3.76668388]
I would like to show the predicted value and the measured value in a scatter plot.
plt.xlabel("pred_ridge")
plt.ylabel("y_test")
plt.scatter(pred_ridge, y_test)
plt.show()
It's not much different from linear regression because we haven't tuned or selected variables.
The Lasso regression and the Ridge regression have different regularization terms. In Lasso regression, the loss function changes as follows.
\begin{eqnarray}
L = \frac{1}{2}(\boldsymbol{y} - X\boldsymbol{w})^{T}(\boldsymbol{y}-X\boldsymbol{w}) + λ||\boldsymbol{w}||_{1}
\end{eqnarray}
Lasso regression differs from Ridge regression in that the regularization term is the L1 norm. I will omit the details this time.
The python code is below.
#Library import
from sklearn.linear_model import Lasso
#Model learning
lasso = Lasso(alpha=0.05)
lasso.fit(x_train_std, y_train)
#Forecast
pred_lasso = lasso.predict(x_test_std)
#Evaluation
#Coefficient of determination(R2)
r2_lasso = r2_score(y_test, pred_lasso)
#Average absolute error(MAE)
mae_lasso = mean_absolute_error(y_test, pred_lasso)
print("R2 : %.3f" % r2_lasso)
print("MAE : %.3f" % mae_lasso)
#Regression coefficient
print("Coef = ", lasso.coef_)
The regularization parameters are set appropriately (default of scikit-learn is alpha = 1.0) The output result is as follows.
R2 : 0.782
MAE : 3.071
Coef = [-0.80179157 0.66308749 -0.144492 0.81447322 -1.61462819 2.63721307
0.05772041 -2.64430158 2.11051544 -1.40028941 -2.06766744 1.04882786
-3.85778379]
I would like to show the predicted value and the measured value in a scatter plot.
plt.xlabel("pred_lasso")
plt.ylabel("y_test")
plt.scatter(pred_lasso, y_test)
plt.show()
The Lasso regression doesn't change much either.
It is a method that combines Elastic Net regression, L1 regularization, and L2 regularization.
The python code is below.
#Library import
from sklearn.linear_model import ElasticNet
#Model learning
elasticnet = ElasticNet(alpha=0.05)
elasticnet.fit(x_train_std, y_train)
#Forecast
pred_elasticnet = elasticnet.predict(x_test_std)
#Evaluation
#Coefficient of determination(R2)
r2_elasticnet = r2_score(y_test, pred_elasticnet)
#Average absolute error(MAE)
mae_elasticnet = mean_absolute_error(y_test, pred_elasticnet)
print("R2 : %.3f" % r2_elasticnet)
print("MAE : %.3f" % mae_elasticnet)
#Regression coefficient
print("Coef = ", elasticnet.coef_)
The regularization parameters are set appropriately (default of scikit-learn is alpha = 1.0) The output result is as follows.
R2 : 0.781
MAE : 3.080
Coef = [-0.80547228 0.64625644 -0.27082019 0.84654972 -1.51126947 2.66279832
0.09096052 -2.51833347 1.89798734 -1.21656705 -2.01097151 1.05199894
-3.73854124]
I would like to show the predicted value and the measured value in a scatter plot.
plt.xlabel("pred_elasticnet")
plt.ylabel("y_test")
plt.scatter(pred_elasticnet, y_test)
plt.show()
Elastic Net regression hasn't changed much either.
Next, we will build a prediction model for the decision tree system. First is the Random Forest regression.
RandomForest is a collection of many different decision trees based on the bagging of ensemble learning. The drawback of decision trees alone is that they are easy to overfit, but Random Forest is one way to deal with this problem.
The python code is below.
#Library import
from sklearn.ensemble import RandomForestRegressor
#Model learning
RF = RandomForestRegressor()
RF.fit(x_train_std, y_train)
#Forecast
pred_RF = RF.predict(x_test_std)
#Evaluation
#Coefficient of determination(R2)
r2_RF = r2_score(y_test, pred_RF)
#Average absolute error(MAE)
mae_RF = mean_absolute_error(y_test, pred_RF)
print("R2 : %.3f" % r2_RF)
print("MAE : %.3f" % mae_RF)
#Variable importance
print("feature_importances = ", RF.feature_importances_)
The parameters are left at their defaults. The output result is as follows.
R2 : 0.899
MAE : 2.122
feature_importances = [0.04563176 0.00106449 0.00575792 0.00071877 0.01683655 0.31050293
0.01897821 0.07745557 0.00452725 0.01415068 0.0167309 0.01329619
0.47434878]
I would like to show the predicted value and the measured value in a scatter plot.
plt.xlabel("pred_RF")
plt.ylabel("y_test")
plt.scatter(pred_RF, y_test)
plt.show()
Sounds better than regression system models (linear regression, Ridge regression, Lasso regression, ElasticNet regression). I think it's useful to know that you can also return to Random Forest. Also, since RandomForest does not know the regression coefficient, we evaluate the validity of the model by looking at the importance of variables.
Next is GBDT (gradient boosting tree).
GBDT is also one of ensemble learning, and is an algorithm that aims to improve generalization performance by sequentially creating decision trees that correct mistakes in certain decision trees.
The python code is below.
#Library import
from sklearn.ensemble import GradientBoostingRegressor
#Model learning
GBDT = GradientBoostingRegressor()
GBDT.fit(x_train_std, y_train)
#Forecast
pred_GBDT = GBDT.predict(x_test_std)
#Evaluation
#Coefficient of determination(R2)
r2_GBDT = r2_score(y_test, pred_GBDT)
#Average absolute error(MAE)
mae_GBDT = mean_absolute_error(y_test, pred_GBDT)
print("R2 : %.3f" % r2_GBDT)
print("MAE : %.3f" % mae_GBDT)
#Variable importance
print("feature_importances = ", GBDT.feature_importances_)
The parameters are left at their defaults. The output result is as follows.
R2 : 0.905
MAE : 2.097
feature_importances = [0.03411472 0.00042674 0.00241657 0.00070636 0.03040394 0.34353116
0.00627447 0.10042527 0.0014266 0.0165308 0.03114765 0.01129208
0.42130366]
I would like to show the predicted value and the measured value in a scatter plot.
plt.xlabel("pred_GBDT")
plt.ylabel("y_test")
plt.scatter(pred_GBDT, y_test)
plt.show()
It's the most accurate method so far. However, please be aware that GBDT is easy to overfit if you do not set the parameters properly.
The last is SVR (Support Vector Machine). Support Vector Machine (SVM) is an algorithm originally developed to solve binary classification problems. Therefore, some people may think that only classification problems can be used. In fact, SVM has an SVR that extends the objective variable to continuous values so that it can handle regression problems. SVR is characterized by being able to solve nonlinear regression problems with relatively high accuracy.
The python code is below.
#Library import
from sklearn.svm import SVR
#Model learning
SVR = SVR(kernel='linear', C=1, epsilon=0.1, gamma='auto')
SVR.fit(x_train_std, y_train)
#Forecast
pred_SVR = SVR.predict(x_test_std)
#Evaluation
#Coefficient of determination(R2)
r2_SVR = r2_score(y_test, pred_SVR)
#Average absolute error(MAE)
mae_SVR = mean_absolute_error(y_test, pred_SVR)
print("R2 : %.3f" % r2_SVR)
print("MAE : %.3f" % mae_SVR)
#Regression coefficient
print("Coef = ", SVR.coef_)
This time, the kernel function used (linear: linear regression). Parameter tuning is required because there are four other kernel functions.
The output result is as follows.
R2 : 0.780
MAE : 2.904
Coef = [[-1.18218512 0.62268229 0.09081358 0.4148341 -1.04510071 3.50961979
-0.40316769 -1.78305137 1.58605612 -1.78749695 -1.54742196 1.01255493
-2.35263548]]
I would like to show the predicted value and the measured value in a scatter plot.
plt.xlabel("pred_SVR")
plt.ylabel("y_test")
plt.scatter(pred_SVR, y_test)
plt.show()
SVR is not so accurate compared to Random Forest and GBDT.
Organize the scripts so far.
#Library for score calculation
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
def preprocess_sc(df):
"""Divide the data into training data and evaluation data and standardize
Parameters
----------
df : pd.DataFrame
Data set (explanatory variable + objective variable)
Returns
-------
x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
y_train : pd.DataFrame
Training data (objective variable)
x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
y_test : pd.DataFrame
Evaluation data (objective variable)
"""
x_train, x_test, y_train, y_test = train_test_split(df.iloc[:, 0:13], df.iloc[:, 13],
test_size=0.2, random_state=1)
#Standardize data
sc = StandardScaler()
sc.fit(x_train) #Standardized with training data
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)
return x_train_std, x_test_std, y_train, y_test
def Linear_Regression(x_train_std, y_train, x_test_std):
"""Predict by linear regression
Parameters
----------
x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
y_train : pd.DataFrame
Training data (objective variable)
x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
Returns
-------
pred_lr : pd.DataFrame
Prediction results of linear regression
"""
lr = LinearRegression()
lr.fit(x_train_std, y_train)
pred_lr = lr.predict(x_test_std)
return pred_lr
def Ridge_Regression(x_train_std, y_train, x_test_std, ALPHA=10.0):
"""Predict with Ridge regression
Parameters
----------
x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
y_train : pd.DataFrame
Training data (objective variable)
x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
ALPHA : float
Regularization parameter α
Returns
-------
pred_ridge : pd.DataFrame
Ridge regression prediction results
"""
ridge = Ridge(alpha=ALPHA)
ridge.fit(x_train_std, y_train)
pred_ridge = ridge.predict(x_test_std)
return pred_ridge
def Lasso_Regression(x_train_std, y_train, x_test_std, ALPHA=0.05):
"""Predict by Lasso regression
Parameters
----------
x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
y_train : pd.DataFrame
Training data (objective variable)
x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
ALPHA : float
Regularization parameter α
Returns
-------
pred_lasso : pd.DataFrame
Lasso regression prediction results
"""
lasso = Lasso(alpha=ALPHA)
lasso.fit(x_train_std, y_train)
pred_lasso = lasso.predict(x_test_std)
return pred_lasso
def ElasticNet_Regression(x_train_std, y_train, x_test_std, ALPHA=0.05):
"""Predict with Elastic Net regression
Parameters
----------
x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
y_train : pd.DataFrame
Training data (objective variable)
x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
ALPHA : float
Regularization parameter α
Returns
-------
pred_elasticnet : pd.DataFrame
Elastic Net regression prediction results
"""
elasticnet = ElasticNet(alpha=ALPHA)
elasticnet.fit(x_train_std, y_train)
pred_elasticnet = elasticnet.predict(x_test_std)
return pred_elasticnet
def RandomForest_Regressor(x_train_std, y_train, x_test_std):
"""Predict with Random Forest regression
Parameters
----------
x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
y_train : pd.DataFrame
Training data (objective variable)
x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
Returns
-------
pred_RF : pd.DataFrame
Predicted results of Random Forest regression
"""
RF = RandomForestRegressor()
RF.fit(x_train_std, y_train)
pred_RF = RF.predict(x_test_std)
return pred_RF
def GradientBoosting_Regressor(x_train_std, y_train, x_test_std):
"""Predict with GBDT
Parameters
----------
x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
y_train : pd.DataFrame
Training data (objective variable)
x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
Returns
-------
pred_GBDT : pd.DataFrame
GBDT prediction results
"""
GBDT = GradientBoostingRegressor()
GBDT.fit(x_train_std, y_train)
pred_GBDT = GBDT.predict(x_test_std)
return pred_GBDT
def SVR_Regression(x_train_std, y_train, x_test_std):
"""Predict with SVR
Parameters
----------
x_train_std : pd.DataFrame
Training data after standardization (explanatory variable)
y_train : pd.DataFrame
Training data (objective variable)
x_test_std : pd.DataFrame
Evaluation data after standardization (explanatory variable)
Returns
-------
pred_SVR : pd.DataFrame
GBDT prediction results
"""
svr = SVR()
svr.fit(x_train_std, y_train)
pred_SVR = svr.predict(x_test_std)
return pred_SVR
def main():
#Data set reading
boston = load_boston()
#Creating a data frame
#Storage of explanatory variables
df = pd.DataFrame(boston.data, columns = boston.feature_names)
#Add objective variable
df['MEDV'] = boston.target
#Data preprocessing
x_train_std, x_test_std, y_train, y_test = preprocess_sc(df)
pred_lr = pd.DataFrame(Linear_Regression(x_train_std, y_train, x_test_std))
pred_ridge = pd.DataFrame(Ridge_Regression(x_train_std, y_train, x_test_std, ALPHA=10.0))
pred_lasso = pd.DataFrame(Lasso_Regression(x_train_std, y_train, x_test_std, ALPHA=0.05))
pred_elasticnet = pd.DataFrame(ElasticNet_Regression(x_train_std, y_train, x_test_std, ALPHA=0.05))
pred_RF = pd.DataFrame(RandomForest_Regressor(x_train_std, y_train, x_test_std))
pred_GBDT = pd.DataFrame(GradientBoosting_Regressor(x_train_std, y_train, x_test_std))
pred_SVR = pd.DataFrame(SVR_Regression(x_train_std, y_train, x_test_std))
pred_all = pd.concat([df_lr, pred_ridge, pred_lasso, pred_elasticnet, pred_RF, pred_GBDT, pred_SVR], axis=1, sort=False)
pred_all.columns = ["df_lr", "pred_ridge", "pred_lasso", "pred_elasticnet", "pred_RF", "pred_GBDT", "pred_SVR"]
return pred_all
if __name__ == "__main__":
pred_all = main()
Thank you for reading to the end. However, I hope you understand that there are various methods for building a prediction model. Also, I hope you found that you can easily implement any of them by using scikit-learn.
In practice, from here, evaluation of each model, parameter tuning, feature quantity engineering, etc. are further performed. It is necessary to improve the accuracy.
If you have a request for correction, we would appreciate it if you could contact us.
Recommended Posts