From the end of 2019 to the beginning of 2020? I remember that Target Encoding became a hot topic.
Target Encoding replaces the categorical variable with the average value of the objective variable, but if you simply process it, a leak will occur, so you need to devise it. To prevent leaks, you can take measures such as the Leave One Out method that uses the average value other than the row to be converted, or the K-fold division and replacement with the average value other than the fold that includes the target industry.
There are many explanations of Target Encoding in the world, so I will not explain it in this article (for example, this site out-TS-% E4% BD% BF% E3% 81% A3% E3% 81% A1% E3% 82% 83% E3% 83% 80% E3% 83% A1) is very helpful). In this article, I'm trying to see how it works by actually using the nice Target Encoding code from this site. I will try it.
--Python: How to use Target Encoding (https://blog.amedama.jp/entry/target-mean-encoding-types#Leave-one-out-TS-%E4%BD%BF%E3%81%A3% E3% 81% A1% E3% 82% 83% E3% 83% 80% E3% 83% A1)
A function that creates a sample data frame.
import numpy as np
import pandas as pd
def getRandomDataFrame(data, numCol):
if data== 'train':
key = ["A" if x ==0 else 'B' for x in np.random.randint(2, size=(numCol,))]
value = np.random.randint(2, size=(numCol,))
df = pd.DataFrame({'Feature':key, 'Target':value})
return df
elif data=='test':
key = ["A" if x ==0 else 'B' for x in np.random.randint(2, size=(numCol,))]
df = pd.DataFrame({'Feature':key})
return df
else:
print(';)')
You can generate a data frame with the following code. If test
is specified as the first argument, the objective variable string will not be output. Specify the number of lines in the second argument.
train = getRandomDataFrame('train', 10)
test = getRandomDataFrame('test', 10)
The contents are as shown in the figure below.
K-fold Target Encoding
K-fold Target Encoding class. It has fit
and transform
, so it can be used in the same way as sklern's preprocessing. The Test encoder takes the train data result as an input and adds the Target Encoding feature.
In addition, the part (1) written in the comment is processed to fill the line that becomes nan when K-folded with the average value. We will see this later.
from sklearn import base
from sklearn.model_selection import KFold
class KFoldTargetEncoderTrain(base.BaseEstimator,
base.TransformerMixin):
"""How to use.
targetc = KFoldTargetEncoderTrain('Feature','Target',n_fold=5)
new_train = targetc.fit_transform(train)
"""
def __init__(self,colnames,targetName,
n_fold=5, verbosity=True,
discardOriginal_col=False):
self.colnames = colnames
self.targetName = targetName
self.n_fold = n_fold
self.verbosity = verbosity
self.discardOriginal_col = discardOriginal_col
def fit(self, X, y=None):
return self
def transform(self,X):
assert(type(self.targetName) == str)
assert(type(self.colnames) == str)
assert(self.colnames in X.columns)
assert(self.targetName in X.columns)
mean_of_target = X[self.targetName].mean()
kf = KFold(n_splits = self.n_fold,
shuffle = False, random_state=2019)
col_mean_name = self.colnames + '_' + 'Kfold_Target_Enc'
X[col_mean_name] = np.nan
for tr_ind, val_ind in kf.split(X):
X_tr, X_val = X.iloc[tr_ind], X.iloc[val_ind]
X.loc[X.index[val_ind], col_mean_name] = X_val[self.colnames].map(X_tr.groupby(self.colnames)[self.targetName].mean())
X[col_mean_name].fillna(mean_of_target, inplace = True) #Fill in the place that has become nan with the average value--(1)
if self.verbosity:
encoded_feature = X[col_mean_name].values
print('Correlation between the new feature, {} and, {} is {}.'.format(col_mean_name,self.targetName,
np.corrcoef(X[self.targetName].values,encoded_feature)[0][1]))
if self.discardOriginal_col:
X = X.drop(self.targetName, axis=1)
return X
class TargetEncoderTest(base.BaseEstimator, base.TransformerMixin):
"""How to use.
test_targetc = TargetEncoderTest(new_train,
'Feature',
'Feature_Kfold_Target_Enc')
new_test = test_targetc.fit_transform(test)
"""
def __init__(self,train,colNames,encodedName):
self.train = train
self.colNames = colNames
self.encodedName = encodedName
def fit(self, X, y=None):
return self
def transform(self,X):
mean = self.train[[self.colNames, self.encodedName]].groupby(self.colNames).mean().reset_index()
dd = {}
for index, row in mean.iterrows():
dd[row[self.colNames]] = row[self.encodedName]
X[self.encodedName] = X[self.colNames]
X = X.replace({self.encodedName: dd})
return X
Use it as follows. In the constructor of KFoldTargetEncoderTrain
, specify the category variable column name to encode, the objective variable column name, and the number of folds. In the constructor of TargetEncoderTest
, specify the encoded data frame, the encoded categorical variable column name, and the Target Encoded feature amount column name ([encoded categorical variable column name] _Kfold_Target_Enc).
targetc = KFoldTargetEncoderTrain('Feature','Target',n_fold=5)
new_train = targetc.fit_transform(train)
test_targetc = TargetEncoderTest(new_train, 'Feature', 'Feature_Kfold_Target_Enc')
new_test = test_targetc.fit_transform(test)
Each has the following contents.
Let's check new_train
. Since it is a 5-fold, the data is divided into 5 folds, 2 lines each. The first fold is the first and second lines from the top. To encode the first and second lines, look at the combined data of the other four folds, the records on lines 3-10. The average value of Target
in each group of A and B is 3/4 = 0.75 for A and 1/4 = 0.25 for B. Use this value to encode the value of the first fold. The first and second lines of the first fold are both A, so encode with 0.75. Perform the above procedure for all folds.
Let's check new_test
. Test data is encoded by using the average value of the Target Encoding features of Train data as the categorical variables to be encoded. A is (0.75 + 0.75 + 0.6 + 0.8 + 0.5 + 0.5) / 6 = 0.65, and B is (0.3333333333333333 + 0.3333333333333333 + 0.0 + 0.0) / 4 = 0.166666666666666666.
Next, consider the case of becoming nan.
train = getRandomDataFrame('train', 10)
train['Feature'].iloc[0] = "C"
With this data, it is necessary to calculate the mean value of C group in the remaining folds in order to encode the first line, but there is no C in the remaining folds. Therefore, the Target Encoding feature will be nan. Therefore, fill it with the average value of the objective variables in all lines. Therefore, C is (1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1) / 10 = 0.5.
By the way, if you comment out the part (1), you can leave it as np.nan
. With LightGBM, you can learn and predict even with nan, so it may be better not to fill in with the average value.
Leave-one-out Target Encoding It is said that this method should not be used because it leaks more than K-fold Target Encoding. However, since it's a big deal, I'll put the code here.
class LOOTargetEncoderTrain(base.BaseEstimator,
base.TransformerMixin):
"""How to use.
targetc = LOOTargetEncoderTrain('Feature','Target')
new_train = targetc.fit_transform(train)
"""
def __init__(self,colnames,targetName,
verbosity=True, discardOriginal_col=False):
self.colnames = colnames
self.targetName = targetName
self.verbosity = verbosity
self.discardOriginal_col = discardOriginal_col
def fit(self, X, y=None):
return self
def transform(self,X):
assert(type(self.targetName) == str)
assert(type(self.colnames) == str)
assert(self.colnames in X.columns)
assert(self.targetName in X.columns)
col_mean_name = self.colnames + '_' + 'Kfold_Target_Enc'
X[col_mean_name] = np.nan
self.agg_X = X.groupby(self.colnames).agg({self.targetName: ['sum', 'count']})
X[col_mean_name] = X.apply(self._loo_ts, axis=1)
return X
def _loo_ts(self, row):
group_ts = self.agg_X.loc[row[self.colnames]]
loo_sum = group_ts.loc[(self.targetName, 'sum')] - row[self.targetName]
loo_count = group_ts.loc[(self.targetName, 'count')] - 1
return loo_sum / loo_count
This time I tried K-Fold Target Encoding.
If the objective variable is binary, there seems to be a way to prevent overfitting, such as Smoothing.
Recommended Posts