When analyzing a dataset containing multiple features, a decision tree-based ensemble analyzer represented by Random Forest can calculate the importance of the features. So far, I have used this function as a black box, but as the number of tools used has increased, I would like to sort out how to use it.
First of all, the importance calculation algorithm is quoted from "First pattern recognition" (Chapter 11).
Out-Of-Bag (OOB) The error rate is calculated as follows. Random forest selects the training data to be used by bootstrap when making one tree. As a result, about 1/3 of the data is not used for learning. For a certain learning, only the decision trees for which the learning data was not used can be collected to form a partial forest, and the learning data can be used as test data to evaluate errors.
When performing a decision tree-based ensemble, it seems that the importance of features that lead to improved accuracy can be measured by monitoring the OOB error rate using the above method.
Below, we will look at the following tools (Python package).
--Scikit-learn's RandomForest
XGBoost and LightGBM have APIs that are closer to native and Scikit-learn APIs, but I would like to use Scikit-learn APIs as much as possible in consideration of learning efficiency.
(The tools and libraries used are as follows. Python 3.5.2, Scikit-learn 0.18.1, XGBoost 0.6a2, LightGBM 0.1)
First, it becomes the basic form.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=0)
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, y_train)
y_pred = clf_rf.predict(X_test)
accu = accuracy_score(y_test, y_pred)
print('accuracy = {:>.4f}'.format(accu))
# Feature Importance
fti = clf_rf.feature_importances_
print('Feature Importances:')
for i, feat in enumerate(iris['feature_names']):
print('\t{0:20s} : {1:>.6f}'.format(feat, fti[i]))
As many of you may be familiar with, all you have to do is define an instance of the classifier, fit (), and then get the number feature_importances_
.
Feature Importances:
sepal length (cm) : 0.074236
sepal width (cm) : 0.015321
petal length (cm) : 0.475880
petal width (cm) : 0.434563
As mentioned above, in the "iris" dataset, the results show that the features related to "petal" are important. (This importance is a relative number.)
We have dealt with the "Boston" dataset that comes with Scikit-learn.
rf_reg = RandomForestRegressor(n_estimators=10)
rf_reg = rf_reg.fit(X_train, y_train)
fti = rf_reg.feature_importances_
print('Feature Importances:')
for i, feat in enumerate(boston['feature_names']):
print('\t{0:10s} : {1:>.6f}'.format(feat, fti[i]))
The function name changes to RandomForestRegressor ()
, but the importance of features is calculated in the same way.
Feature Importances:
CRIM : 0.040931
ZN : 0.001622
INDUS : 0.005131
CHAS : 0.000817
NOX : 0.042200
RM : 0.536127
AGE : 0.018797
DIS : 0.054397
RAD : 0.001860
TAX : 0.010357
PTRATIO : 0.011388
B : 0.007795
LSTAT : 0.268579
In the regression problem, "Feature Importances" was obtained in the same way as the classification problem. The result is that the features of "RM" and "LSTAT" are important for the "Boston" dataset. (Please note that we have hardly adjusted the hyperparameters for the purpose of "finding the importance of features" this time.)
Check the method of XGBoost, which is a representative of Gradient Boosting.
clf_xgb = xgb.XGBClassifier(objective='multi:softmax',
max_depth = 5,
learning_rate=0.1,
n_estimators=100)
clf_xgb.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric='mlogloss',
early_stopping_rounds=10)
y_pred_proba = clf_xgb.predict_proba(X_test)
y_pred = np.argmax(y_pred_proba, axis=1)
accu = accuracy_score(y_test, y_pred)
print('accuracy = {:>.4f}'.format(accu))
# Feature Importance
fti = clf_xgb.feature_importances_
print('Feature Importances:')
for i, feat in enumerate(iris['feature_names']):
print('\t{0:20s} : {1:>.6f}'.format(feat, fti[i]))
Since the Scikit-learn API can be used, we were able to obtain Feature Importance in exactly the same way as the RandomForestClassifier mentioned above.
Like classification, I wanted to use the Scikit-learn API, but it seems that the calculation of feature importances is not supported at this time. It was posted on GitHub as an open issue.
https://github.com/dmlc/xgboost/issues/1543
Instead, use the API that is closer to the native of XGBoost.
'''
# Error
reg_xgb = xgb.XGBRegressor(max_depth=3)
reg_xgb.fit(X_train, y_train)
fti = reg_xgb.feature_importances_
'''
def create_feature_map(features):
outfile = open('boston_xgb.fmap', 'w')
i = 0
for feat in features:
outfile.write('{0}\t{1}\tq\n'.format(i, feat))
i = i + 1
outfile.close()
create_feature_map(boston['feature_names'])
xgb_params = {"objective": "reg:linear", "eta": 0.1, "max_depth": 6, "silent": 1}
num_rounds = 100
dtrain = xgb.DMatrix(X_train, label=y_train)
gbdt = xgb.train(xgb_params, dtrain, num_rounds)
fti = gbdt.get_fscore(fmap='boston_xgb.fmap')
print('Feature Importances:')
for feat in boston['feature_names']:
print('\t{0:10s} : {1:>12.4f}'.format(feat, fti[feat]))
It is processed after being output to the feature map file once. The following results were obtained from the above.
Feature Importances:
CRIM : 565.0000
ZN : 55.0000
INDUS : 99.0000
CHAS : 10.0000
NOX : 191.0000
RM : 377.0000
AGE : 268.0000
DIS : 309.0000
RAD : 50.0000
TAX : 88.0000
PTRATIO : 94.0000
B : 193.0000
LSTAT : 285.0000
There are some inconsistencies with the results of Scikit-learn RandomForestRegressor, but it is presumed that this is due to insufficient parameter adjustment.
Here, the Scikit-learn API could be used for both classification / regression.
Classification
gbm = lgb.LGBMClassifier(objective='multiclass',
num_leaves = 23,
learning_rate=0.1,
n_estimators=100)
gbm.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric='multi_logloss',
early_stopping_rounds=10)
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
y_pred_proba = gbm.predict_proba(X_test, num_iteration=gbm.best_iteration)
accu = accuracy_score(y_test, y_pred)
print('accuracy = {:>.4f}'.format(accu))
# Feature Importance
fti = gbm.feature_importances_
print('Feature Importances:')
for i, feat in enumerate(iris['feature_names']):
print('\t{0:20s} : {1:>.6f}'.format(feat, fti[i]))
Regression
gbm_reg = lgb.LGBMRegressor(objective='regression',
num_leaves = 31,
n_estimators=100)
gbm_reg.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric='l2',
verbose=0)
# Feature Importances
fti = gbm_reg.feature_importances_
print('Feature Importances:')
for i, feat in enumerate(boston['feature_names']):
print('\t{0:10s} : {1:>12.4f}'.format(feat, fti[i]))
It's still a "young" tool, but LightGBM is useful!
So far, we have seen three types of tools. The importance of features shows a similar tendency. I think that some inconsistencies are due to (again) insufficient adjustment of hyperparameters.
As a supplement, I was curious, so I thought a little about "selection of features" in statistical modeling. I think the correct way to do this is to create multiple models that correspond to a subset of the data features, fit them to the data, and compare the values of the model selection criteria (eg AIC, Akaike's information criterion) for each model. .. However, it is not possible to determine the degree of influence of all features in one process.
By the way, there is a p-value (or z-value) as a value calculated at the time of one modeling calculation. I would like to take a look at whether there is any relationship between this number and the Feature Importance of random forest tools.
I used ** StatsModels ** as a statistical modeling tool. When I tried to use it for the first time in a long time, version 0.8.0 was released. (It seems that the release of 0.7.x, which was under development, was skipped. Also, the documentation was previously on SourceForge, but in 0.8.0, another site http://www.statsmodels.org/stable/ I was moving to.)
import statsmodels.api as sm
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, train_test_split
print("Boston Housing: regression")
boston = load_boston()
y = boston['target']
X = boston['data']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=201703
)
# Using StatsModels
X1_train = sm.add_constant(X_train)
X1_test = sm.add_constant(X_test)
model = sm.OLS(y_train, X1_train)
fitted = model.fit()
print('summary = \n', fitted.summary())
The summary in the above code is as follows.
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.758
Model: OLS Adj. R-squared: 0.749
Method: Least Squares F-statistic: 78.38
Date: Fri, 10 Mar 2017 Prob (F-statistic): 8.08e-92
Time: 15:14:41 Log-Likelihood: -1011.3
No. Observations: 339 AIC: 2051.
Df Residuals: 325 BIC: 2104.
Df Model: 13
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 38.3323 6.455 5.939 0.000 25.634 51.031
x1 -0.1027 0.035 -2.900 0.004 -0.172 -0.033
x2 0.0375 0.018 2.103 0.036 0.002 0.073
x3 0.0161 0.081 0.198 0.843 -0.144 0.176
x4 2.8480 1.058 2.692 0.007 0.767 4.929
x5 -19.4905 5.103 -3.819 0.000 -29.530 -9.451
x6 3.9906 0.501 7.973 0.000 3.006 4.975
x7 0.0004 0.016 0.024 0.980 -0.031 0.032
x8 -1.5980 0.256 -6.236 0.000 -2.102 -1.094
x9 0.3687 0.090 4.099 0.000 0.192 0.546
x10 -0.0139 0.005 -2.667 0.008 -0.024 -0.004
x11 -0.9445 0.167 -5.640 0.000 -1.274 -0.615
x12 0.0086 0.004 2.444 0.015 0.002 0.015
x13 -0.6182 0.063 -9.777 0.000 -0.743 -0.494
==============================================================================
Omnibus: 115.727 Durbin-Watson: 2.041
Prob(Omnibus): 0.000 Jarque-Bera (JB): 485.421
Skew: 1.416 Prob(JB): 3.91e-106
Kurtosis: 8.133 Cond. No. 1.53e+04
==============================================================================
The main statistical information was output, but this time I would like to focus on the p-value. I output it with the following code.
pvalues = fitted.pvalues
feats = boston['feature_names']
print('P>|t| :')
for i, feat in enumerate(feats):
print('\t{0:10s} : {1:>.6f}'.format(feat, pvalues[(i + 1)]))
P>|t| :
CRIM : 0.003980
ZN : 0.036198
INDUS : 0.843357
CHAS : 0.007475
NOX : 0.000160
RM : 0.000000
AGE : 0.980493
DIS : 0.000000
RAD : 0.000053
TAX : 0.008030
PTRATIO : 0.000000
B : 0.015065
LSTAT : 0.000000
Of the 13 features, "INDUS" and "AGE" are close to 1.0, and all of them are within 0.05. I plotted it to see the relationship between this and the Feature Importance obtained by LightGBM.
Fig.1 Feature Importance vs. StatsModels' p-value
Expand the vertical axis and look at the neighborhood of y = 0.
Fig.2 Feature Importance vs. StatsModels' p-value
The horizontal axis is Feature Importance and the vertical axis is p-value. In this area, as the horizontal axis increases, the variation on the vertical axis seems to decrease.
Next, I examined the classification problem (binary classification) of "Titanic" that was taken up by Kaggle. We performed logistic regression of StatsModels as follows.
import statsmodels.api as sm
# by StatsModels
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
model_lr = sm.Logit(y_train, X_train)
res = model_lr.fit()
print('res.summary = ', res.summary())
y_pred = res.predict(X_test)
y_pred = np.asarray([int(y_i > 0.5) for y_i in y_pred])
# check feature importance
pvalues = res.pvalues
print('P>|z|:')
for i, feat in enumerate(feats):
print('\t{0:15s} : {1:>12.4f}'.format(feat, pvalues[i]))
As mentioned above, sm.Logit ()
is used. The result is as follows.
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 668
Model: Logit Df Residuals: 657
Method: MLE Df Model: 10
Date: Fri, 10 Mar 2017 Pseudo R-squ.: 0.3219
Time: 15:36:15 Log-Likelihood: -305.07
converged: True LL-Null: -449.89
LLR p-value: 2.391e-56
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const 1.2615 8.99e+06 1.4e-07 1.000 -1.76e+07 1.76e+07
x1 0.0001 0.000 0.309 0.758 -0.001 0.001
x2 -0.9141 0.162 -5.644 0.000 -1.232 -0.597
x3 2.7590 0.231 11.939 0.000 2.306 3.212
x4 -0.0322 0.009 -3.657 0.000 -0.049 -0.015
x5 -0.4421 0.132 -3.350 0.001 -0.701 -0.183
x6 -0.0709 0.141 -0.503 0.615 -0.347 0.206
x7 2.456e-07 1.77e-07 1.386 0.166 -1.02e-07 5.93e-07
x8 0.0023 0.003 0.926 0.354 -0.003 0.007
x9 0.6809 8.99e+06 7.57e-08 1.000 -1.76e+07 1.76e+07
x10 0.3574 8.99e+06 3.97e-08 1.000 -1.76e+07 1.76e+07
x11 0.2231 8.99e+06 2.48e-08 1.000 -1.76e+07 1.76e+07
==============================================================================
P>|z|:
PassengerId : 1.0000
Pclass : 0.7575
Sex : 0.0000
Age : 0.0000
SibSp : 0.0003
Parch : 0.0008
Ticket : 0.6151
Fare : 0.1657
C : 0.3543
Q : 1.0000
S : 1.0000
I will omit the explanation of the data preprocessing before putting it in the model, but "Embarked" that was in the dataset (Port of embarkation) is changed to 3 features ['C','Q','S'] by One-Hot encoding. ("Embarked" was, as expected, not a very important feature.) I will try to compare it with the result of LightGBM again. The horizontal axis is LightGBM's Feature Importance, and the vertical axis is p-value.
Fig.3 Feature Importance vs. StatsModels' p-value
Since the concept of machine learning, especially the decision tree ensemble system model and statistical modeling is completely different, It may be unreasonable to try to see the correlation in each plot (Fig. 1, 2, 3).
--Feature Importance: Effect of features that contribute to reduction of Out-of-Bag error rate in the learning process --P-value: An indicator of whether the data sample fits neatly against the fitted statistical model (population).
This is my vague understanding, but the two numbers are based on different concepts. However, it is not completely unrelated, and I imagine that there may be a causal relationship (depending on the situation) that the Feature Importance does not increase unless the p-value becomes small to some extent. (There is no basis, it is just an imagination.)
In addition, for the purpose of obtaining a "good model", "effective features are incorporated into the model as they are", "for ineffective features and features whose p-value is not small, preprocessing or model improvement", " I feel that the approach of "throw away what is unlikely" is important in both modeling methods.
--First pattern recognition, by Mr. Hirai, Morikita Publishing --Introduction to Statistical Modeling for Data Analysis, by Mr. Kubo, Iwanami Shoten
Recommended Posts