Calculate the relationship between each explanatory variable and the objective variable, and select the associated features with the highest certainty.
Select the top k of the explanatory variables. Normally, the argument score_func
specifies f_classif
(default value) for classification and f_regression
for regression. Specify the number of features to be selected in the argument k
.
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest, f_regression
boston = load_boston()
X = boston.data
y = boston.target
#Select 5 features
selector = SelectKBest(score_func=f_regression, k=5)
selector.fit(X, y)
mask = selector.get_support() #Get the mask of whether or not each feature is selected
print(boston.feature_names)
print(mask)
#Get only the selected feature column
X_selected = selector.transform(X)
print("X.shape={}, X_selected.shape={}".format(X.shape, X_selected.shape))
output
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[False False True False False True False False False True True False
True]
X.shape=(506, 13), X_selected.shape=(506, 5)
Select the top k% of the explanatory variables. Normally, the argument score_func
specifies f_classif
(default value) for classification and f_regression
for regression. Specify the ratio (0 to 100) of the feature amount to be selected in the argument percentile
.
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectPercentile, f_regression
boston = load_boston()
X = boston.data
y = boston.target
#40 of the features%choose
selector = SelectPercentile(score_func=f_regression, percentile=40)
selector.fit(X, y)
mask = selector.get_support()
print(boston.feature_names)
print(mask)
#Get only the selected feature column
X_selected = selector.transform(X)
print("X.shape={}, X_selected.shape={}".format(X.shape, X_selected.shape))
output
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[False False True False False True False False False True True False
True]
X.shape=(506, 13), X_selected.shape=(506, 5)
Set the mode ('percentile', ‘k_best’, ‘fpr’, ‘fdr’, ‘fwe’) with mode
, and set the parameters of each mode with param
.
For example
selector = GenericUnivariateSelect(mode='percentile', score_func=f_regression, param=40)
When
selector = SelectPercentile(score_func=f_regression, percentile=40)
Are equivalent.
Select the features using the feature_importances_ attribute, which represents the importance of the features obtained in the model.
Specify the estimator and the threshold threshold
as arguments.
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
boston = load_boston()
X = boston.data
y = boston.target
#Use RandomForestRegressor as the estimator. Select one with importance of median or higher
selector = SelectFromModel(RandomForestRegressor(n_estimators=100, random_state=42), threshold="median")
selector.fit(X, y)
mask = selector.get_support()
print(boston.feature_names)
print(mask)
#Get only the selected feature column
X_selected = selector.transform(X)
print("X.shape={}, X_selected.shape={}".format(X.shape, X_selected.shape))
output
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[ True False False False True True False True False True True False
True]
X.shape=(506, 13), X_selected.shape=(506, 7)
An operation in which features that are not used at all are added one by one until a certain standard is satisfied, or features are removed one by one from the state in which all features are used. The feature amount is selected by repeating.
RFE (Recursive Feature Elimination) starts with all features, builds a model, and removes the least important features of the model. Then create a model again and delete the least important features. This process is repeated until a predetermined number of features are reached.
For the argument, specify the estimator and the number of features n_features_to_select
.
(Number of features --n_features_to_select) It takes a long time to create a model => delete features.
from sklearn.datasets import load_boston
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
boston = load_boston()
X = boston.data
y = boston.target
#Use RandomForestRegressor as the estimator. Select 5 features
selector = RFE(RandomForestRegressor(n_estimators=100, random_state=42), n_features_to_select=5)
selector.fit(X, y)
mask = selector.get_support()
print(boston.feature_names)
print(mask)
#Get only the selected feature column
X_selected = selector.transform(X)
print("X.shape={}, X_selected.shape={}".format(X.shape, X_selected.shape))
output
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[ True False False False True True False True False False False False
True]
X.shape=(506, 13), X_selected.shape=(506, 5)
Recommended Posts