Introduction

When analyzing a dataset containing multiple features, a decision tree-based ensemble analyzer represented by Random Forest can calculate the importance of the features. So far, I have used this function as a black box, but as the number of tools used has increased, I would like to sort out how to use it.

First of all, the importance calculation algorithm is quoted from "First pattern recognition" (Chapter 11).

Out-Of-Bag (OOB) The error rate is calculated as follows. Random forest selects the training data to be used by bootstrap when making one tree. As a result, about 1/3 of the data is not used for learning. For a certain learning, only the decision trees for which the learning data was not used can be collected to form a partial forest, and the learning data can be used as test data to evaluate errors.

When performing a decision tree-based ensemble, it seems that the importance of features that lead to improved accuracy can be measured by monitoring the OOB error rate using the above method.

Below, we will look at the following tools (Python package).

--Scikit-learn's RandomForest

XGBoost
LightGBM

XGBoost and LightGBM have APIs that are closer to native and Scikit-learn APIs, but I would like to use Scikit-learn APIs as much as possible in consideration of learning efficiency.

(The tools and libraries used are as follows. Python 3.5.2, Scikit-learn 0.18.1, XGBoost 0.6a2, LightGBM 0.1)

Scikit-learn's RandomForest

First, it becomes the basic form.

Classification problem

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=0)

clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, y_train)
y_pred = clf_rf.predict(X_test)

accu = accuracy_score(y_test, y_pred)
print('accuracy = {:>.4f}'.format(accu))

# Feature Importance
fti = clf_rf.feature_importances_   

print('Feature Importances:')
for i, feat in enumerate(iris['feature_names']):
    print('\t{0:20s} : {1:>.6f}'.format(feat, fti[i]))

As many of you may be familiar with, all you have to do is define an instance of the classifier, fit (), and then get the number feature_importances_.

Feature Importances:
	sepal length (cm)    : 0.074236
	sepal width (cm)     : 0.015321
	petal length (cm)    : 0.475880
	petal width (cm)     : 0.434563

As mentioned above, in the "iris" dataset, the results show that the features related to "petal" are important. (This importance is a relative number.)

Regression problem

We have dealt with the "Boston" dataset that comes with Scikit-learn.

rf_reg = RandomForestRegressor(n_estimators=10)
rf_reg = rf_reg.fit(X_train, y_train)

fti = rf_reg.feature_importances_

print('Feature Importances:')
for i, feat in enumerate(boston['feature_names']):
    print('\t{0:10s} : {1:>.6f}'.format(feat, fti[i]))

The function name changes to RandomForestRegressor (), but the importance of features is calculated in the same way.

Feature Importances:
	CRIM       : 0.040931
	ZN         : 0.001622
	INDUS      : 0.005131
	CHAS       : 0.000817
	NOX        : 0.042200
	RM         : 0.536127
	AGE        : 0.018797
	DIS        : 0.054397
	RAD        : 0.001860
	TAX        : 0.010357
	PTRATIO    : 0.011388
	B          : 0.007795
	LSTAT      : 0.268579

In the regression problem, "Feature Importances" was obtained in the same way as the classification problem. The result is that the features of "RM" and "LSTAT" are important for the "Boston" dataset. (Please note that we have hardly adjusted the hyperparameters for the purpose of "finding the importance of features" this time.)

XGBoost classifier, regressionr

Check the method of XGBoost, which is a representative of Gradient Boosting.

Classification problem

clf_xgb = xgb.XGBClassifier(objective='multi:softmax',
                        max_depth = 5,
                        learning_rate=0.1,
                        n_estimators=100)
clf_xgb.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='mlogloss',
        early_stopping_rounds=10)
y_pred_proba = clf_xgb.predict_proba(X_test)
y_pred = np.argmax(y_pred_proba, axis=1)

accu = accuracy_score(y_test, y_pred)
print('accuracy = {:>.4f}'.format(accu))

# Feature Importance
fti = clf_xgb.feature_importances_   

print('Feature Importances:')
for i, feat in enumerate(iris['feature_names']):
    print('\t{0:20s} : {1:>.6f}'.format(feat, fti[i]))

Since the Scikit-learn API can be used, we were able to obtain Feature Importance in exactly the same way as the RandomForestClassifier mentioned above.

Regression problem

Like classification, I wanted to use the Scikit-learn API, but it seems that the calculation of feature importances is not supported at this time. It was posted on GitHub as an open issue.

https://github.com/dmlc/xgboost/issues/1543

Instead, use the API that is closer to the native of XGBoost.

'''
# Error
reg_xgb = xgb.XGBRegressor(max_depth=3)
reg_xgb.fit(X_train, y_train)
fti = reg_xgb.feature_importances_
'''

def create_feature_map(features):
    outfile = open('boston_xgb.fmap', 'w')
    i = 0
    for feat in features:
        outfile.write('{0}\t{1}\tq\n'.format(i, feat))
        i = i + 1

    outfile.close()

create_feature_map(boston['feature_names'])

xgb_params = {"objective": "reg:linear", "eta": 0.1, "max_depth": 6, "silent": 1}
num_rounds = 100

dtrain = xgb.DMatrix(X_train, label=y_train)
gbdt = xgb.train(xgb_params, dtrain, num_rounds)

fti = gbdt.get_fscore(fmap='boston_xgb.fmap')

print('Feature Importances:')
for feat in boston['feature_names']:
    print('\t{0:10s} : {1:>12.4f}'.format(feat, fti[feat]))

It is processed after being output to the feature map file once. The following results were obtained from the above.

Feature Importances:
	CRIM       :     565.0000
	ZN         :      55.0000
	INDUS      :      99.0000
	CHAS       :      10.0000
	NOX        :     191.0000
	RM         :     377.0000
	AGE        :     268.0000
	DIS        :     309.0000
	RAD        :      50.0000
	TAX        :      88.0000
	PTRATIO    :      94.0000
	B          :     193.0000
	LSTAT      :     285.0000

There are some inconsistencies with the results of Scikit-learn RandomForestRegressor, but it is presumed that this is due to insufficient parameter adjustment.

LightGBM classifier, regressionr

Here, the Scikit-learn API could be used for both classification / regression.

Classification

gbm = lgb.LGBMClassifier(objective='multiclass',
                        num_leaves = 23,
                        learning_rate=0.1,
                        n_estimators=100)
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='multi_logloss',
        early_stopping_rounds=10)
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
y_pred_proba = gbm.predict_proba(X_test, num_iteration=gbm.best_iteration)

accu = accuracy_score(y_test, y_pred)
print('accuracy = {:>.4f}'.format(accu))

# Feature Importance
fti = gbm.feature_importances_

print('Feature Importances:')
for i, feat in enumerate(iris['feature_names']):
    print('\t{0:20s} : {1:>.6f}'.format(feat, fti[i]))

Regression

gbm_reg = lgb.LGBMRegressor(objective='regression',
                        num_leaves = 31,
                        n_estimators=100)
gbm_reg.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='l2',
        verbose=0)

# Feature Importances
fti = gbm_reg.feature_importances_

print('Feature Importances:')
for i, feat in enumerate(boston['feature_names']):
    print('\t{0:10s} : {1:>12.4f}'.format(feat, fti[i]))

It's still a "young" tool, but LightGBM is useful!

So far, we have seen three types of tools. The importance of features shows a similar tendency. I think that some inconsistencies are due to (again) insufficient adjustment of hyperparameters.

(Supplement) Relationship with p value in statistical modeling

As a supplement, I was curious, so I thought a little about "selection of features" in statistical modeling. I think the correct way to do this is to create multiple models that correspond to a subset of the data features, fit them to the data, and compare the values of the model selection criteria (eg AIC, Akaike's information criterion) for each model. .. However, it is not possible to determine the degree of influence of all features in one process.

By the way, there is a p-value (or z-value) as a value calculated at the time of one modeling calculation. I would like to take a look at whether there is any relationship between this number and the Feature Importance of random forest tools.

I used ** StatsModels ** as a statistical modeling tool. When I tried to use it for the first time in a long time, version 0.8.0 was released. (It seems that the release of 0.7.x, which was under development, was skipped. Also, the documentation was previously on SourceForge, but in 0.8.0, another site http://www.statsmodels.org/stable/ I was moving to.)

"Boston" linear regression model

import statsmodels.api as sm

from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, train_test_split

print("Boston Housing: regression")
boston = load_boston()
y = boston['target']
X = boston['data']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=201703
)

# Using StatsModels
X1_train = sm.add_constant(X_train)
X1_test = sm.add_constant(X_test)
model = sm.OLS(y_train, X1_train)
fitted = model.fit()

print('summary = \n', fitted.summary())

The summary in the above code is as follows.

                             OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.758
Model:                            OLS   Adj. R-squared:                  0.749
Method:                 Least Squares   F-statistic:                     78.38
Date:                Fri, 10 Mar 2017   Prob (F-statistic):           8.08e-92
Time:                        15:14:41   Log-Likelihood:                -1011.3
No. Observations:                 339   AIC:                             2051.
Df Residuals:                     325   BIC:                             2104.
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         38.3323      6.455      5.939      0.000      25.634      51.031
x1            -0.1027      0.035     -2.900      0.004      -0.172      -0.033
x2             0.0375      0.018      2.103      0.036       0.002       0.073
x3             0.0161      0.081      0.198      0.843      -0.144       0.176
x4             2.8480      1.058      2.692      0.007       0.767       4.929
x5           -19.4905      5.103     -3.819      0.000     -29.530      -9.451
x6             3.9906      0.501      7.973      0.000       3.006       4.975
x7             0.0004      0.016      0.024      0.980      -0.031       0.032
x8            -1.5980      0.256     -6.236      0.000      -2.102      -1.094
x9             0.3687      0.090      4.099      0.000       0.192       0.546
x10           -0.0139      0.005     -2.667      0.008      -0.024      -0.004
x11           -0.9445      0.167     -5.640      0.000      -1.274      -0.615
x12            0.0086      0.004      2.444      0.015       0.002       0.015
x13           -0.6182      0.063     -9.777      0.000      -0.743      -0.494
==============================================================================
Omnibus:                      115.727   Durbin-Watson:                   2.041
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              485.421
Skew:                           1.416   Prob(JB):                    3.91e-106
Kurtosis:                       8.133   Cond. No.                     1.53e+04
==============================================================================

The main statistical information was output, but this time I would like to focus on the p-value. I output it with the following code.

pvalues = fitted.pvalues
feats = boston['feature_names']
print('P>|t| :')
for i, feat in enumerate(feats):
    print('\t{0:10s} : {1:>.6f}'.format(feat, pvalues[(i + 1)]))

P>|t| :
	CRIM       : 0.003980
	ZN         : 0.036198
	INDUS      : 0.843357
	CHAS       : 0.007475
	NOX        : 0.000160
	RM         : 0.000000
	AGE        : 0.980493
	DIS        : 0.000000
	RAD        : 0.000053
	TAX        : 0.008030
	PTRATIO    : 0.000000
	B          : 0.015065
	LSTAT      : 0.000000

Of the 13 features, "INDUS" and "AGE" are close to 1.0, and all of them are within 0.05. I plotted it to see the relationship between this and the Feature Importance obtained by LightGBM.

Fig.1 Feature Importance vs. StatsModels' p-value

Expand the vertical axis and look at the neighborhood of y = 0.

Fig.2 Feature Importance vs. StatsModels' p-value

The horizontal axis is Feature Importance and the vertical axis is p-value. In this area, as the horizontal axis increases, the variation on the vertical axis seems to decrease.

"Titanic" classification problem

Next, I examined the classification problem (binary classification) of "Titanic" that was taken up by Kaggle. We performed logistic regression of StatsModels as follows.

import statsmodels.api as sm

# by StatsModels
    X_train = sm.add_constant(X_train)
    X_test = sm.add_constant(X_test)

    model_lr = sm.Logit(y_train, X_train)
    res = model_lr.fit()
    print('res.summary = ', res.summary())

    y_pred = res.predict(X_test)
    y_pred = np.asarray([int(y_i > 0.5) for y_i in y_pred])
   
    # check feature importance
    pvalues = res.pvalues

    print('P>|z|:')
    for i, feat in enumerate(feats):
        print('\t{0:15s} : {1:>12.4f}'.format(feat, pvalues[i]))

As mentioned above, sm.Logit () is used. The result is as follows.

                            Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  668
Model:                          Logit   Df Residuals:                      657
Method:                           MLE   Df Model:                           10
Date:                Fri, 10 Mar 2017   Pseudo R-squ.:                  0.3219
Time:                        15:36:15   Log-Likelihood:                -305.07
converged:                       True   LL-Null:                       -449.89
                                        LLR p-value:                 2.391e-56
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.2615   8.99e+06    1.4e-07      1.000   -1.76e+07    1.76e+07
x1             0.0001      0.000      0.309      0.758      -0.001       0.001
x2            -0.9141      0.162     -5.644      0.000      -1.232      -0.597
x3             2.7590      0.231     11.939      0.000       2.306       3.212
x4            -0.0322      0.009     -3.657      0.000      -0.049      -0.015
x5            -0.4421      0.132     -3.350      0.001      -0.701      -0.183
x6            -0.0709      0.141     -0.503      0.615      -0.347       0.206
x7          2.456e-07   1.77e-07      1.386      0.166   -1.02e-07    5.93e-07
x8             0.0023      0.003      0.926      0.354      -0.003       0.007
x9             0.6809   8.99e+06   7.57e-08      1.000   -1.76e+07    1.76e+07
x10            0.3574   8.99e+06   3.97e-08      1.000   -1.76e+07    1.76e+07
x11            0.2231   8.99e+06   2.48e-08      1.000   -1.76e+07    1.76e+07
==============================================================================

P>|z|:
	PassengerId     :       1.0000
	Pclass          :       0.7575
	Sex             :       0.0000
	Age             :       0.0000
	SibSp           :       0.0003
	Parch           :       0.0008
	Ticket          :       0.6151
	Fare            :       0.1657
	C               :       0.3543
	Q               :       1.0000
	S               :       1.0000

I will omit the explanation of the data preprocessing before putting it in the model, but "Embarked" that was in the dataset (Port of embarkation) is changed to 3 features ['C','Q','S'] by One-Hot encoding. ("Embarked" was, as expected, not a very important feature.) I will try to compare it with the result of LightGBM again. The horizontal axis is LightGBM's Feature Importance, and the vertical axis is p-value.

Fig.3 Feature Importance vs. StatsModels' p-value

Since the concept of machine learning, especially the decision tree ensemble system model and statistical modeling is completely different, It may be unreasonable to try to see the correlation in each plot (Fig. 1, 2, 3).

--Feature Importance: Effect of features that contribute to reduction of Out-of-Bag error rate in the learning process --P-value: An indicator of whether the data sample fits neatly against the fitted statistical model (population).

This is my vague understanding, but the two numbers are based on different concepts. However, it is not completely unrelated, and I imagine that there may be a causal relationship (depending on the situation) that the Feature Importance does not increase unless the p-value becomes small to some extent. (There is no basis, it is just an imagination.)

In addition, for the purpose of obtaining a "good model", "effective features are incorporated into the model as they are", "for ineffective features and features whose p-value is not small, preprocessing or model improvement", " I feel that the approach of "throw away what is unlikely" is important in both modeling methods.

References, web site

--First pattern recognition, by Mr. Hirai, Morikita Publishing --Introduction to Statistical Modeling for Data Analysis, by Mr. Kubo, Iwanami Shoten

Scikit-learn Documentation - http://scikit-learn.org/stable/documentation.html
XGBoost Documentation - https://xgboost.readthedocs.io/en/latest/
LightGBM Documentation - https://github.com/Microsoft/LightGBM/tree/master/python-package
StatsModels Documentation - http://www.statsmodels.org/stable/index.html

Measure the importance of features with a random forest tool

Introduction

Scikit-learn's RandomForest

Classification problem

Regression problem

XGBoost classifier, regressionr

Classification problem

Regression problem

LightGBM classifier, regressionr

(Supplement) Relationship with p value in statistical modeling

"Boston" linear regression model

"Titanic" classification problem

References, web site