This is a memorandum because I investigated feature selection using null importance. Please point out if there is something strange. Reference: Feature Selection with Null Importances
This is done to remove the features that become noise when selecting the features and to extract the really important features. The importance of each feature is measured using training data in which the objective variable is randomly shuffled.
Import required libraries
import pandas as pd
import numpy as np
np.random.seed(123)
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
import time
import lightgbm as lgb
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
import warnings
warnings.simplefilter('ignore', UserWarning)
import gc
gc.enable()
import time
Preparing the data. This time we will use the data from the Kaggle House Price tutorial. House Prices: Advanced Regression Techniques
#Data read
data = pd.read_csv("./House_Price/train.csv")
target = data['SalePrice']
#Get categorical variables
cat_features = [
f for f in data.columns if data[f].dtype == 'object'
]
for feature in cat_features:
#Convert categorical variables to numbers
data[feature], _ = pd.factorize(data[feature])
#Convert type to category
data[feature] = data[feature].astype('category')
#For the time being, the features including missing values are deleted.
drop_cols = [f for f in data.columns if data[f].isnull().any(axis=0) == True]
# drop_cols.append('SalePrice') #Delete the objective variable
data = data.drop(drop_cols, axis=1)
Prepare a function that returns the importance of features. This time, I used LightGBM as in the article I referred to.
def get_feature_importances(data, cat_features, shuffle, seed=None):
#Get features
train_features = [f for f in data if f not in 'SalePrice']
#Shuffle objective variable if necessary
y = data['SalePrice'].copy()
if shuffle:
y = data['SalePrice'].copy().sample(frac=1.0)
#Training with LightGBM
dtrain = lgb.Dataset(data[train_features], y, free_raw_data=False, silent=True)
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': {'l2'},
'num_leaves': 128,
'learning_rate': 0.01,
'num_iterations':100,
'feature_fraction': 0.38,
'bagging_fraction': 0.68,
'bagging_freq': 5,
'verbose': 0
}
clf = lgb.train(params=params, train_set=dtrain, num_boost_round=200, categorical_feature=cat_features)
#Get the importance of features
imp_df = pd.DataFrame()
imp_df["feature"] = list(train_features)
imp_df["importance"] = clf.feature_importance()
return imp_df
Create a distribution of Null Importance.
null_imp_df = pd.DataFrame()
nb_runs = 80
start = time.time()
for i in range(nb_runs):
imp_df = get_feature_importances(data=data, cat_features=cat_features, shuffle=True)
imp_df['run'] = i + 1
null_imp_df = pd.concat([null_imp_df, imp_df], axis=0)
actual_imp_df = get_feature_importances(data=data, cat_features=cat_features, shuffle=False)
Use the importance of a normal feature divided by the 75th percentile of the null importance distribution
feature_scores = []
for _f in actual_imp_df['feature'].unique():
f_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance'].values
f_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance'].mean()
imp_score = np.log(1e-10 + f_act_imps / (1 + np.percentile(f_null_imps, 75)))
feature_scores.append((_f, imp_score))
scores_df = pd.DataFrame(feature_scores, columns=['feature', 'imp_score'])
Set an appropriate threshold and select the feature amount. This time, I decided to use the one with a score of 0.5 or more.
sorted_features = scores_df.sort_values(by=['imp_score'], ascending=False).reset_index(drop=True)
new_features = sorted_features.loc[sorted_features.imp_score >= 0.5, 'feature'].values
print(new_features)
# ['CentralAir' 'GarageCars' 'OverallQual' 'HalfBath' 'OverallCond' 'BsmtFullBath']
It was used by the top prizewinners in the competition I participated in the other day, so I checked it. I think there are various other methods for selecting features, so I would like to investigate.
Recommended Posts