I participated in the 1st Beginner Limited Competition (https://signate.jp/competitions/292) held at SIGNATE in August. This was the first time I had a solid competition, but the final score was AUC = 0.8588949, which was 13th place (although it was a very half-finished result ...). In this competition, if the score was higher than a certain value, I was able to be promoted from Beginner to Intermediate, and I was promoted successfully.
I would like to summarize what I did and what I should have looked back on for myself in the future.
The model and analysis results of this competition are disclosed in accordance with the information disclosure policy.
The data is campaign data for time deposits at financial institutions. The source of the data is here, but I think it has been slightly processed. The evaluation index is AUC. See the link above for details.
$sw_vers
ProductName: Mac OS X
ProductVersion: 10.13.6
BuildVersion: 17G14019
$python --version
Python 3.7.3
It's like washing your hands before cooking, but it's important because it may not be reproduced later.
Be sure to assign when using a function that has an argument of random_seed
or
random_state``` to ensure that the result is reproduced.
I put it in H2O and checked the data information and what kind of algorithm comes to the top when it is turned by AutoML. Please see Past Articles for H2O. As a result of running with AutoML while looking at the data at this point, the decision tree algorithm came to the top, so in the future LightGBM I decided to go with.
Notebook files are prepared separately for data processing and model construction (because if one file is used, the visibility may be poor or unnecessary processing may be performed each time).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce
%matplotlib inline
pd.set_option('display.max_columns', None)
random_state = 1234
df = pd.read_csv('./0_rawdata/train.csv')
I will write some code for checking the data. Check the data type and the presence or absence of null ↓
df.info()
df.describe()
Visualization of numerical data ↓
df.hist( figsize=(14, 10), bins=20)
Visualization of character string data ↓
plt.figure( figsize = (20, 15))
cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
for i, col in enumerate(cols):
plt.subplot(3,3,i+1)
df[col].value_counts().plot.bar()
plt.title(col)
In the above visualization, of course, `id```, but
balance``` and ``
pdaysseemed to have a uniform distribution, so we will use it for later learning. Delete from the data.
default```Since most of the data was no, delete it.
In addition, we created data for learning by adding processing to digitize character strings and category data.
df2 = df.copy()
df2 = df2.drop( columns=['id', 'balance', 'pdays', 'default'])
# month
month_map={
'jan':1,
'feb':2,
'mar':3,
'apr':4,
'may':5,
'jun':6,
'jul':7,
'aug':8,
'sep':9,
'oct':10,
'nov':11}
df2['month'] = df2['month'].fillna(0)
df2['month'] = df2['month'].map(month_map)
# job, marital, education, housing, loan, contact, poutcome
cols = ['job', 'marital', 'education', 'housing', 'loan', 'contact', 'poutcome']
ce_onehot = ce.OneHotEncoder(cols=cols,handle_unknown='impute')
ce_onehot.fit( df2 )
df2 = ce_onehot.transform( df2 )
df2['duration'] = df2['duration'] / 3600
df2.to_csv('mytrain.csv', index=False)
import pandas as pd
import numpy as np
import category_encoders as ce
import lightgbm as lgb
#import optuna
from optuna.integration import lightgbm as lgb_optuna
from sklearn import preprocessing
from sklearn.model_selection import train_test_split,StratifiedKFold,cross_validate
from sklearn.metrics import roc_auc_score
pd.set_option('display.max_columns', None)
random_state = 1234
version = 'v1'
Divide the data for training and validation (8: 2).
df_train = pd.read_csv('mytrain.csv')
X = df_train.drop( columns=['y'] )
y = df_train['y']
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, random_state=random_state)
The following methods were used for model construction and accuracy verification.
--Cross Validation by dividing the training data into 5 by stratified sampling --Hyper parameter (hereinafter, high para) tuning is left to optuna --The index used for optimization is logloss. --Retrain the model with the entire training data and calculate the AUC using the verification data
def build():
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
lgb_train = lgb_optuna.Dataset(X_train, y_train)
lgbm_params = {
'objective': 'binary',
'metric': 'binary_logloss',
'random_state':random_state,
'verbosity': 0
}
tunecv = lgb_optuna.LightGBMTunerCV(
lgbm_params,
lgb_train,
num_boost_round=100,
early_stopping_rounds=20,
seed = random_state,
verbose_eval=20,
folds=kf
)
tunecv.run()
print( 'Best score = ',tunecv.best_score)
print( 'Best params= ',tunecv.best_params)
return tunecv
tunecv = build()
Retrain the model with the entire training data and calculate the AUC using the verification data ↓
train_data = lgb.Dataset( X_train, y_train )
eval_data = lgb.Dataset(X_holdout, label=y_holdout, reference= train_data)
clf = lgb.train( tunecv.best_params,
train_data,
valid_sets=eval_data,
num_boost_round=50,
verbose_eval=0
)
y_pred = clf.predict( X_holdout )
print('AUC: ', roc_auc_score(y_holdout, y_pred))
# AUC: 0.8486429810797091
# | What i did | AUC | submit score | Impressions |
---|---|---|---|---|
00 | Make the above process the default | 0.8486 | --- | --- |
01 | job , marital , education , poutcome Change the encoding of to target encoding |
0.8458 | --- | It went down slightly, but once it goes |
02 | num_boost_round=200 (Because the score seemed to improve a little more if I put out the learning curve) | 0.8536 | --- | It went up. Go with this |
03 | Notice that the learning parameters of the part that retrains the model in the entire training data are different from the parameters for high para tuning. num_boost_round=200、early_stopping_rounds =Unified with 20. | 0.8585 | --- | Go with this |
04 | Try to set the optimization index to AUC | 0.8557 | --- | lowered. Leave logloss |
05 | loan, housing,Change contact to ordinal encoding | 0.8593 | 0.8556 | The AUC is up, so I'll go with this. However, the submit score is a little low. |
06 | Check the difference between test data and training data. There is no big difference when compared by visualization. I tried to create a model that predicts test data, but AUC=0.Since it is about 5, it is judged that there is no difference between test data and training data | --- | --- | --- |
07 | Change the encoding of month (combine several months with a small amount of data) | 0.8583 | 0.8585 | Almost the same as the AUC of 03. Rejected. |
08 | Change the encoding of month (combine several months with a small amount of data) | 0.8583 | 0.8585 | AUC dropped from 05. Rejected. |
09 | Add last month's average of y as a column like a time series lag variable | 0.8629 | 0.8559 | The training data improved the score, but it was rejected because the test score decreased. |
10 | age Categorize (small number of lines)age Combined) |
0.8599 | 0.8588 | Subtly improved. I will go with this. |
11 | Try to get into PCA | 0.8574 | --- | lowered |
12 | Try other algorithms (SVM), RandomForest, LogisticRegression) | --- | --- | lowered |
I tried to change other details besides the above, but the accuracy did not improve. Also, it's annoying to record each time ... It feels like the competition period is over.
--Data processing system
――If you look closely at the data, including a little more cross tabulation, you may have discovered something.
--Try merging with the original data (UCI) (probably partly processed, so some ingenuity is required)
--Consideration of interaction term
--Model system
--You could try the ensemble with LightGBM that changed the random_state.
--Interpretation system
――I should have dug deeper into the part where the accuracy was poor (there was a part where age
could be categorized, but if possible a little more)
--Tools and other systems
――Git is fine, but I should have used a code management tool
――Similarly, I should have added an experiment management tool (like MLOps)
There are many other things you should do. I would appreciate it if you could comment. When I go to the next competition, I would like to incorporate the technique while referring to this reflection and the kaggle kernel.
Recommended Posts