I tried Yellowbrick a little before [^ 1], but I just tried to move it based on the sample code of Yellowbrick, so I tried to find out what Yellowbrick can do. This time, I will build LightGBM, which is often used in kaggle, using Yellowbrick, and even save the model. However, since there are cases where Yellowbrick cannot perform preprocessing such as creating features and detailed model accuracy evaluation, it is not dealt with.
The execution environment is as follows.
$sw_vers
ProductName: Mac OS X
ProductVersion: 10.13.6
BuildVersion: 17G8037
$python3 --version
Python 3.7.4
The installation of Yellowbrick is described in [^ 1], so it will be omitted. For the installation of LightGBM, refer to here [^ 2].
Import the library to be used this time.
import pandas as pd
import numpy as np
import yellowbrick
from yellowbrick.datasets import load_bikeshare
from yellowbrick.model_selection import LearningCurve,ValidationCurve,FeatureImportances
from yellowbrick.regressor import ResidualsPlot
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from joblib import dump, load
For data, use load_bikeshare
prepared by Yellowbrick.
# Load data
X, y = load_bikeshare()
print(X.head())
There are 12 explanatory variables, all of which are numerical data. The objective variable is the number of shared bikes rented. This time, I will put this data into LightGBM as it is and make a model.
season year month hour holiday weekday workingday weather temp \
0 1 0 1 0 0 6 0 1 0.24
1 1 0 1 1 0 6 0 1 0.22
2 1 0 1 2 0 6 0 1 0.22
3 1 0 1 3 0 6 0 1 0.24
4 1 0 1 4 0 6 0 1 0.24
feelslike humidity windspeed
0 0.2879 0.81 0.0
1 0.2727 0.80 0.0
2 0.2727 0.80 0.0
3 0.2879 0.75 0.0
4 0.2879 0.75 0.0
Divide the data for training and validation before training. The split ratio is set to 8: 2 for texto.
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
The model uses LightGBM as mentioned above. However, since Yellowbrick is a library like an extended version of scikit-learn, LightGBM also uses scikit-learn's API [^ 3].
# Model
model = lgb.LGBMRegressor()
Now, let's use Yellowbrick's ValidationCurve to determine the hyperparameters. This time, I will try to investigate the relationship between the values of max_depth, n_estimators, and num_leaves and the precision. See here [^ 5] for the API specifications of ValidationCurve.
Specify the model, the parameter name to be checked, and the parameter range in the argument of ValidationCurve as follows. cv can set the number of cross-validation divisions and the generator. This time, the number of cross-validation divisions is set to 5. The last scoring specifies the index to see the accuracy, and of the index [^ 4] defined by scikit-learn, neg_mean_squared_error is set.
visualizer = ValidationCurve(
model, param_name="max_depth",
param_range=np.arange(1, 11), cv=5, scoring='neg_mean_squared_error'
)
visualizer.fit(X_train, y_train)
visualizer.show()
The output is as shown in the figure below, and the vertical axis is neg_mean_squared_error. This index literally multiplies the average squared error by (-1), indicating that the upper side (closer to 0) in the figure has higher accuracy. Looking at the Cross Validation Score, if max_depth is 6 or more, the accuracy will hardly change, so set max_depth to 6.
Next, let's examine n_estimators in the same way. The program is as follows.
visualizer = ValidationCurve(
model, param_name="n_estimators",
param_range=np.arange(100, 1100, 100), cv=5, scoring='neg_mean_squared_error'
)
visualizer.fit(X_train, y_train)
visualizer.show()
The output is as shown in the figure below. Looking at the Cross Validation Score, if the nestimators are 600 or higher, the accuracy is almost the same, so set n_estimators to 600.
Finally, check num_leaves in the same way. The program is as follows.
visualizer = ValidationCurve(
model, param_name="num_leaves",
param_range=np.arange(2, 54, 4), cv=5, scoring='neg_mean_squared_error'
)
visualizer.fit(X_train, y_train)
visualizer.show()
This output is as shown in the figure below. Looking at the Cross Validation Score, num_leaves is 20 or more and the accuracy has hardly changed, so set it to 20.
As described above, it was possible to easily tune the parameters with ValidationCurve. Define the model again.
# Model
model = lgb.LGBMRegressor(
boosting_type='gbdt',
num_leaves=20,
max_depth=6,
n_estimators=600,
random_state=1234,
importance_type='gain')
To see if the model is underfit or overfit Let's look at the accuracy of the model while changing the amount of training data. It can be easily visualized by using LearningCurve.
visualizer = LearningCurve(model, cv=5, scoring='neg_mean_squared_error')
visualizer.fit(X_train, y_train)
visualizer.show()
The result is shown in the figure below. It can be seen that the accuracy of the Cross Validation Score improves as the amount of data increases. Even if the amount of data can be increased, the accuracy will not improve dramatically.
It is also easy to display explanatory variables in order of importance when predicting the number of shared bikes rented.
visualizer = FeatureImportances(model)
visualizer.fit(X_train, y_train)
visualizer.show()
The result is shown in the figure below. The most effective time was the time of day. Other variables are displayed with this importance set to 100.
As shown in the previous article [^ 1], check the accuracy with the residual distribution.
visualizer = ResidualsPlot(model)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.show()
The output is as shown in the figure below. Looking at the scatter plot, there are places where the predicted value is off, but the $ R ^ 2 $ value is 0.9 or more, Looking at the residual distribution in the histogram, the accuracy seems to be good because there is a peak around the residual of 0.
It seems that the score displayed in the figure cannot be changed to an index other than $ R ^ 2 $, so when I calculated the RMSE again, it was about 38. Now that we have a model like that, we will end the model construction here.
model = model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
# The rmse of prediction is: 38.82245025441572
Save the built model. When I looked it up with Scikit-learn, it described how to save it with joblib
[^ 6], so save it in the same way.
dump(model, 'lightgbm.joblib')
In the above program, LightGBM sample code [^ 7] is also referred to. The impression is that it is convenient because you can easily check the accuracy of the model and the plot that can be used for verification with Yellowbrick. On the other hand, it feels strange to make the Visualizer "fit" every time ...
Recommended Posts