Feature engineering is an important factor in building regression and classification models. At that time, features are often selected using domain knowledge, but I tried to use scikit-learn to add a star, so I will organize it.
RFE (Recursive Feature Elimination) is a recursive feature reduction method. Build a model starting with all features and remove the least important features in that model. Then build the model again and remove the least important features. This procedure is repeated until the specified number of features is reached.
The python code is below.
#Import required libraries
mport pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.feature_selection import RFE
from sklearn.ensemble import GradientBoostingRegressor
#Data set reading
boston = load_boston()
#Creating a data frame
#Storage of explanatory variables
df = pd.DataFrame(boston.data, columns = boston.feature_names)
#Add objective variable
df['MEDV'] = boston.target
#Use GBDT as an estimator. Select 5 features
selector = RFE(GradientBoostingRegressor(n_estimators=100, random_state=10), n_features_to_select=5)
selector.fit(df.iloc[:, 0:13], df.iloc[:, 13])
mask = selector.get_support()
print(boston.feature_names)
print(mask)
#Get only the selected feature column
X_selected = selector.transform(df.iloc[:, 0:13])
print("X.shape={}, X_selected.shape={}".format(df.iloc[:, 0:13].shape, X_selected.shape))
The execution result is as follows.
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[False False False False True True False True False False True False
True]
X.shape=(506, 13), X_selected.shape=(506, 5)
This is a method of selecting features using feature_importances_, which expresses the importance of features obtained in the model.
The python code is below.
from sklearn.feature_selection import SelectFromModel
#Use GBDT as an estimator.
selector = SelectFromModel(GradientBoostingRegressor(n_estimators=100, random_state=10), threshold="median")
selector.fit(df.iloc[:, 0:13], df.iloc[:, 13])
mask = selector.get_support()
print(boston.feature_names)
print(mask)
#Get only the selected feature column
X_selected = selector.transform(df.iloc[:, 0:13])
print("X.shape={}, X_selected.shape={}".format(df.iloc[:, 0:13].shape, X_selected.shape))
The execution result is as follows.
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[ True False False False True True False True False False True True
True]
X.shape=(506, 13), X_selected.shape=(506, 7)
This is a method to select the top k of the explanatory variables.
The python code is below.
from sklearn.feature_selection import SelectKBest, f_regression
#Select 5 features
selector = SelectKBest(score_func=f_regression, k=5)
selector.fit(df.iloc[:, 0:13], df.iloc[:, 13])
mask = selector.get_support() #Get the mask of whether or not each feature is selected
print(boston.feature_names)
print(mask)
#Get only the selected feature column
X_selected = selector.transform(df.iloc[:, 0:13])
print("X.shape={}, X_selected.shape={}".format(df.iloc[:, 0:13].shape, X_selected.shape))
The execution result is as follows.
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[False False True False False True False False False True True False
True]
X.shape=(506, 13), X_selected.shape=(506, 5)
This is a method to select the upper k% of the explanatory variables.
The python code is below.
from sklearn.feature_selection import SelectPercentile, f_regression
#50 of the features%choose
selector = SelectPercentile(score_func=f_regression, percentile=50)
selector.fit(df.iloc[:, 0:13], df.iloc[:, 13])
mask = selector.get_support()
print(boston.feature_names)
print(mask)
#Get only the selected feature column
X_selected = selector.transform(df.iloc[:, 0:13])
print("X.shape={}, X_selected.shape={}".format(df.iloc[:, 0:13].shape, X_selected.shape))
The execution result is as follows.
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[False False True False True True False False False True True False
True]
X.shape=(506, 13), X_selected.shape=(506, 6)
Thank you for reading to the end. This time, we have organized the feature selection method using sklearn. In actual work, I think it is important to carry out appropriate feature engineering in combination with domain knowledge while using the library.
If you have a request for correction, we would appreciate it if you could contact us.
Recommended Posts