In this article, we will introduce the Predictive Power Score, an index that can be used for feature selection, which was released in April this year, and the library ppscore that implements it.
By the way, when creating a prediction model, selecting the features and explanatory variables to be used is called feature selection.
--Reduce noise of unpredictable data --Reduce redundant data to reduce computational costs
It is done for the purpose of.
The method of feature selection can be roughly classified as follows. Predictive Power Score is equivalent to Wrapper method, but it also has the goodness of Filter method from the implementation contents.
Method | Overview | Feature |
---|---|---|
Filter | Select the feature amount by calculating the statistics of the data itself, setting a threshold value and cutting off. | Relatively the least computationally expensive and suitable for large datasets |
Embedded | Perform feature selection and model construction at the same time like regularization | Has intermediate characteristics between the Filter method and the Wrapper method |
Wrapper | Select features useful for prediction by repeating model construction and feature selection and trying | Since the model is actually constructed, it is possible to precisely select features useful for prediction, but the calculation cost is high. |
Predictive Power Score
Predictive Power Score (hereinafter referred to as PPS) is a concept developed by a software company called 8080Labs based in Germany. It can be used in a form similar to the analysis using the correlation matrix of Pearson correlation coefficients, which is often used in EDA, etc., so that it can be used more universally. It seems that it is being developed with the motivation of making a good index.
PPS has the following features.
--Applicable to both categorical and numeric variables --PPS has a value between 0 and 1, 0 corresponds to the feature x, which has no power to predict the target y, and 1 means that x can predict y perfectly. Corresponds to --Pearson correlation coefficient etc. is based on the linear relationship between x and y, but PPS can evaluate even non-linear relationships. --However, the interaction between variables is not considered. --As will be described later, because a simple model is built, the calculation speed is inferior to that of the Filter method, but the ordinary Wrapper method (variable importance and Permutation Importance Faster than methods like jp / blog / permutation-importance /) --The value of PPS is between 0 and 1, but the comparison between PPS calculated for different targets has no strict mathematical meaning. --The calculation of PPS is implemented only when MAE and F1 are used, and if you want to try other indicators, you need to implement it yourself.
The definition formula mentioned in 2. is as shown in the table below.
task | PPS calculation definition formula |
---|---|
Regression | PPS = 1 - (MAE_model / MAE_naive) |
Classification | PPS = (F1_model - F1_naive) / (1 - F1_naive) |
MAE_model
and MAE_naive
are MAE when predicting y using x, respectively, and MAE when predicting the median of y. The reason for calculating _naive
is to set the criteria for normalizing the PPS to be in the 0 to 1 range.
In the case of F1_naive
, the weighted F1 for the most frequent class is calculated.
You may be wondering here, but how do you make "prediction"?
It is equivalent to the Wrapper method, but it also has the goodness of the Filter method from the implementation contents.
As mentioned above, the calculation of PPS can be classified into the Wrapper method in the context of feature selection, but the prediction is made by constructing a decision tree model, and the score is mediated by the model. (Building a model by cross-validation when calculating scores) However, calculate PPS once. Once you've done that, you can put the narrowed-down feature set into a more complex model, so you can use it as a Filter method. This is because, as the developers say, we are building a simple decision tree model with one variable when calculating the score, and the decision tree itself is faster than SVM, GBDT, NN, etc. .. Another reason why decision trees are used is that they capture non-linear relationships and have relatively robust predictive performance.
--Exclude low PPS features such as variable importance --A PPS matrix is created between features, and those with high PPS between features may be features that contain redundant information, similar to finding multicollinearity in a correlation matrix. Is high, so leave only the important ones --If you want to select features more precisely, you can try increasing or decreasing variables in Greedy, or randomly adding or decreasing variables.
Find patterns in the data
Detection of Data leakage
Since PPS is calculated for both categorical variables and numerical variables, it is convenient to find relationships including non-linearity between various variables.
If the PPS is significantly higher than other features, it is possible to suspect that it is a feature that contributes to leakage that contains a lot of information that cannot be used at the time of prediction.
pip install ppscore
import ppscore as pps
pps.score(df, "feature_column", "target_column")
pps.matrix(df)
Telco Customer Churn is a dataset of customer information and cancellation information of Internet services. The environment uses Kaggle's notebook. There is a blue button on the dataset page called "New Notebook" that you can click to launch the notebook in a way that allows you to access the dataset immediately.
Install it.
!pip install ppscore
Import the library and load the data.
import numpy as np
import pandas as pd
import ppscore as pps
import seaborn as sns
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
PATH ='/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = pd.read_csv(f'{PATH}')
df.shape
Check the column name.
list(df.columns)
Churn
is the target.
['customerID',
'gender',
'SeniorCitizen',
'Partner',
'Dependents',
'tenure',
'PhoneService',
'MultipleLines',
'InternetService',
'OnlineSecurity',
'OnlineBackup',
'DeviceProtection',
'TechSupport',
'StreamingTV',
'StreamingMovies',
'Contract',
'PaperlessBilling',
'PaymentMethod',
'MonthlyCharges',
'TotalCharges',
'Churn']
Check the data type.
df.dtypes
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object
dtype: object
pps.score(df, 'InternetService', 'Churn')
The results are returned in dictionary format.
{'x': 'InternetService',
'y': 'Churn',
'task': 'classification',
'ppscore': 1.625853361551631e-07,
'metric': 'weighted F1',
'baseline_score': 0.6235392486748098,
'model_score': 0.6235393098818076,
'model': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')}
Looking at the matrix, in the Churn
row, the PPS of each feature for the target is represented by a heatmap, especially the PPS of tenure
, MonthlyCharges
, and TotalCharges
. The meaning of each is the service usage period, monthly usage fee, and cumulative usage fee, which are features that are closely related to cancellation.
Also, if you look at the Monthly Charges
line, the contrast of the PPS from the ʻInternet Service to the
Stream TVis high. As you can see from the data type above, these features are categorical variables, but it is convenient to be able to see the relationship with the numeric variable
MonthlyCharges together. It is easy to interpret, and you can see that the option subscription status of various Internet services is strongly related to the usage fee. In addition, looking at the PPS between the features of ʻInternet Service
and StreamTV
, it is inferred that they have dark contrast and similar information to each other, and dimension compression and dimension reduction can be considered. I will.
The matrix calculation took 1 minute 55 seconds in the data frame (7043, 21). It's not too early, but if you have 10,000 units of data, you can wait for a while, and if the number of data increases, you can try sampling to get a trend.
df_matrix = pps.matrix(df)
plt.figure(figsize=(18,18))
sns.heatmap(df_matrix, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)
plt.show()
So far, we have introduced the Predictive Power Score (PPS). Since it can be easily applied to data and the relationships between data are easily visualized, it was found that it can be used for both EDA and feature selection. The implementation of ppscore uses MAE and F1 to calculate PPS, but you could try other indicators while incorporating the concept of PPS.
Recommended Posts