Introduction

Basic machine learning procedure: (1) Classification model organizes the procedure for creating a basic classification model. This time, I would like to focus on the selection of features and compare and examine various selection methods of features.

Procedure so far

-Basic machine learning procedure: ① Classification model -Basic machine learning procedure: ② Prepare data

Analytical environment

Google BigQuery Google Colaboratory

Target data

(1) Similar to the classification model, purchase data is stored in the following table structure.

id	result	product1	product2	product3	product4	product5
001	1	2500	1200	1890	530	null
002	0	750	3300	null	1250	2000

Since the purpose is to select the feature amount, the horizontal axis should be about 300.

0. How to select the target feature

From Feature selection, I chose the following method.

Embedded Method(SelectFromModel)
Wrapper Method(RFE)

Also, although it is not scikit-learn, it was introduced in Feature selection method Boruta using random forest and test. I would also like to use Boruta, which is one of the Wrapper Methods.

Wrapper Method(Boruta)

In order to compare under the same conditions, I would like to use RandomForestClassifier as the classifier used for feature selection.

1.Embedded Method(SelectFromModel)

First, use the Embedded Method used in Basic Machine Learning Procedure: (1) Classification Model. The Embedded Method embeds features in a particular model and selects the best features.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

#Change to numpy array
label = np.array(df.loc[0:, label_col]) 
features = np.array(df.loc[0:, feature_cols])

#Variable selection
clf = RandomForestClassifier(max_depth=7)

##Select variables using Embedded Method
feat_selector = SelectFromModel(clf)
feat_selector.fit(features, label)
df_feat_selected = df.loc[0:, feature_cols].loc[0:, feat_selector.get_support()]

36 variables were selected. The accuracy obtained using these variables is as follows. It's quite expensive, but I want to improve Recall a little.

Accuracy : 92%
Precision : 99%
Recall : 84%

2.Wrapper Method(RFE) Then use the Wrapper Method. This is a method of finding the optimal subset by turning the prediction model with a subset of features.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE

#Change to numpy array
label = np.array(df.loc[0:, label_col]) 
features = np.array(df.loc[0:, feature_cols])

#Variable selection
clf = RandomForestClassifier(max_depth=7)

##Select variables using Embedded Method
feat_selector = RFE(clf)
feat_selector.fit(features, label)
df_feat_selected = df.loc[0:, feature_cols].loc[0:, feat_selector.get_support()]

146 variables were selected. Compared to the Embedded Method, there are quite a lot. The accuracy obtained using these variables is as follows. The numbers after the decimal point are not so different, but they are almost the same as the Embedded Method.

Accuracy : 92%
Precision : 99%
Recall : 84%

3.Wrapper Method(Boruta) The last is Boruta. Boruta is not installed as standard with Colaboratory, so pip install it first.

pip install boruta

This is also the Wrapper Method, so we will find the optimal subset. However, it takes a lot of time compared to the previous RFE. There is progress, so let's wait slowly.

from boruta import BorutaPy

#Change to numpy array
label = np.array(df.loc[0:, label_col]) 
features = np.array(df.loc[0:, feature_cols])

#Variable selection
##Here, assuming classification, random forest(RandomForestClassifier)Use
clf = RandomForestRegressor(max_depth=7)

##Select variables using Boruta
feat_selector = BorutaPy(clf, n_estimators='auto', two_step=False, verbose=2, random_state=42)
feat_selector.fit(features, label)
df_feat_selected=df.loc[0:, feature_cols].loc[0:, feat_selector.support_]

97 variables were selected. The accuracy obtained using these variables is as follows. does not change. .. ..

Accuracy : 92%
Precision : 99%
Recall : 84%

in conclusion

In fact, the accuracy changes considerably depending on how you select variables! I wanted to get the result, but unfortunately the result was about the same. (I wonder if the sample data was wrong)

~~ This time, we compared only three types, but in the Summary of feature selection that I referred to earlier, There are some methods that I haven't tried this time, such as Step Forward and Step backward of Wrapper Method, so I would like to try them in the future. ~~

2/26 postscript

I tried Step Forward and Step backward of Wrapper Method by referring to Summary of feature selection, but it is slow. Or rather, it doesn't end.

It may be because the features are as large as 300, or it may be due to the power of Colab, but isn't it difficult to actually use the method of adding and subtracting features?

Other than that, there seems to be something like Optuna, which is an automated framework for feature selection, so in a word, feature selection. But there are various things that I can study.

Basic machine learning procedure: ③ Compare and examine the selection method of features