I used the `` `predict``` function when predicting with a model made with LightGBM.
pred = model.predict(data)It's like that.
Suddenly, when I looked at the [Official Document](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.predict), there was a parameter `` `pred_contrib``` in the argument of `` `predict```, and I could get the contribution to prediction using SHAP It was written, so I tried it.
# environment
The environment is as follows.
```bash
$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.13.6
BuildVersion: 17G14042
Since I was working on Jupyterlab (Version 0.35.4), I will also list the version of the python kernel.
Python 3.7.3 (default, Mar 27 2019, 16:54:48)
IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.
Prepare the data and model for forecasting. The data used was the Boston dataset provided by scikit-learn.
import pandas as pd
import sklearn.datasets as skd
data = skd.load_boston()
df_X = pd.DataFrame(data.data, columns=data.feature_names)
df_y = pd.DataFrame(data.target, columns=['y'])
As shown below, since all columns are non-null float type with 506 rows and 13 columns data, we will create a model as it is.
df_X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM 506 non-null float64
ZN 506 non-null float64
INDUS 506 non-null float64
CHAS 506 non-null float64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
RAD 506 non-null float64
TAX 506 non-null float64
PTRATIO 506 non-null float64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
In LightGBM, build the model with the hyperparameters almost the default.
In addition, since SHAP to be used later goes to see the parameter setting value of `objective``` set in the model, there is no problem by default, but
objective``` is added to ``
params It is clearly stated (otherwise an error will occur in the later `` `explainer.shap_values
).
import lightgbm as lgb
from sklearn.model_selection import train_test_split
df_X_train, df_X_test, df_y_train, df_y_test = train_test_split(df_X, df_y, test_size=0.2, random_state=4)
lgb_train = lgb.Dataset(df_X_train, df_y_train)
lgb_eval = lgb.Dataset(df_X_test, df_y_test)
params = {
'seed':4,
'objective': 'regression',
'metric':'rmse'}
lgbm = lgb.train(params,
lgb_train,
valid_sets=lgb_eval,
num_boost_round=200,
early_stopping_rounds=20,
verbose_eval=50)
Training until validation scores don't improve for 20 rounds
[50] valid_0's rmse: 3.58803
[100] valid_0's rmse: 3.39545
[150] valid_0's rmse: 3.31867
[200] valid_0's rmse: 3.28222
Did not meet early stopping. Best iteration is:
[192] valid_0's rmse: 3.27283
Prepare the forecast data to be used later.
#Data to predict
data_for_pred = pd.DataFrame([df_X_test.iloc[0, :]])
print(data_for_pred)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX \
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0
PTRATIO B LSTAT
8 15.2 386.63 29.93
predict
You can predict by passing the data to.
#Ordinary prediction
print(lgbm.predict(data_for_pred))
[16.12018486]
Now run `predict``` with
`pred_contrib = True``` as an argument.
# pred_contrib=True
print(lgbm.predict(data=data_for_pred, pred_contrib=True))
[[ 8.11013815e-01 1.62335755e-03 -6.90242856e-02 9.22244470e-03
4.92616768e-01 -3.16444968e+00 -1.22276730e+00 -1.11934703e-01
2.56615903e-02 -1.99428008e-01 1.25166390e+00 3.43507676e-02
-4.03663118e+00 2.22982674e+01]]
In this way, with this data, we get a two-dimensional array with 1 row and 14 columns. In the official documentation
Note that unlike the shap package, with pred_contrib we return a matrix with an extra column, where the last column is the expected value.
Since it is written, the contribution of each feature is in the 1st to 13th columns, and the expected value is in the 14th column. Let's check this area with SHAP.
If you pass the created model to TreeExplainer and output `shap_values``` and
expected_value```, it surely matches the value output by ``
predict. In other words, it was confirmed that the same information used in the figure output by `` `force_plot
can be obtained by `` `predict```.
import shap
explainer = shap.TreeExplainer(lgbm)
shap_values = explainer.shap_values(data_for_pred)
print('shap_values: ', shap_values)
print('expected_value: ', explainer.expected_value)
shap.force_plot(base_value=explainer.expected_value, shap_values=shap_values[0,:], features=data_for_pred.iloc[0,:], matplotlib=True)
shap_values: [[ 8.11013815e-01 1.62335755e-03 -6.90242856e-02 9.22244470e-03
4.92616768e-01 -3.16444968e+00 -1.22276730e+00 -1.11934703e-01
2.56615903e-02 -1.99428008e-01 1.25166390e+00 3.43507676e-02
-4.03663118e+00]]
expected_value: 22.29826737657883
By the way, what is `expected_value``` here? I thought, and when I looked at SHAP's [Document](https://github.com/slundberg/shap),
`expected_value``` was
the average model output over the training dataset we passed
And that. Hmmm, maybe the average value of the objective variable of the training data? I thought, and when I put out mean, it surely matched.
print(df_y_train.mean())
y 22.298267
dtype: float64
So, I found that LightGBM's predict
gives shap_value and expected_value.
However, I thought it should be noted that the predicted value itself is not output when pred_contrib = True
is set.
So
pred = model.predict(data)
contrib = model.predict(data, pred_contrib=True)
#When pred is larger than the mean value of the objective variable of the training data: Output the reason why the prediction was pushed up
#When pred is smaller than the mean value of the objective variable of the training data: Output the reason why the prediction was pushed down
I wonder if it should be used like that.
(Jupyter can convert notebook files to markdown. I didn't know. I would like to utilize it from this time onwards.)
Recommended Posts