Signate_ Review of the 1st Beginner Limited Competition

Introduction

I participated in the 1st Beginner Limited Competition (https://signate.jp/competitions/292) held at SIGNATE in August. This was the first time I had a solid competition, but the final score was AUC = 0.8588949, which was 13th place (although it was a very half-finished result ...). In this competition, if the score was higher than a certain value, I was able to be promoted from Beginner to Intermediate, and I was promoted successfully.

I would like to summarize what I did and what I should have looked back on for myself in the future.

The model and analysis results of this competition are disclosed in accordance with the information disclosure policy.

Overview of the competition

The data is campaign data for time deposits at financial institutions. The source of the data is here, but I think it has been slightly processed. The evaluation index is AUC. See the link above for details.

environment


$sw_vers 
ProductName:	Mac OS X
ProductVersion:	10.13.6
BuildVersion:	17G14019

$python --version
Python 3.7.3

What i did

0. Determine random_seed

It's like washing your hands before cooking, but it's important because it may not be reproduced later. Be sure to assign when using a function that has an argument of random_seed or random_state``` to ensure that the result is reproduced.

1. Let's see what it looks like (H2O)

I put it in H2O and checked the data information and what kind of algorithm comes to the top when it is turned by AutoML. Please see Past Articles for H2O. As a result of running with AutoML while looking at the data at this point, the decision tree algorithm came to the top, so in the future LightGBM I decided to go with.

2. Create a flow from data acquisition to machine learning model construction to prediction (JupyterNotebook)

Notebook files are prepared separately for data processing and model construction (because if one file is used, the visibility may be poor or unnecessary processing may be performed each time).

2-1. Data processing part

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce


%matplotlib inline
pd.set_option('display.max_columns', None)
random_state = 1234

df = pd.read_csv('./0_rawdata/train.csv')

I will write some code for checking the data. Check the data type and the presence or absence of null ↓

df.info()
df.describe()

Visualization of numerical data ↓

df.hist( figsize=(14, 10), bins=20)

vis01.png

Visualization of character string data ↓

plt.figure( figsize = (20, 15))

cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
for i, col in enumerate(cols):
    plt.subplot(3,3,i+1)
    df[col].value_counts().plot.bar()
    plt.title(col)

vis02.png

In the above visualization, of course, `id```, but balance``` and `` pdaysseemed to have a uniform distribution, so we will use it for later learning. Delete from the data.default```Since most of the data was no, delete it. In addition, we created data for learning by adding processing to digitize character strings and category data.

df2 = df.copy()
df2 = df2.drop( columns=['id', 'balance', 'pdays', 'default'])

# month
month_map={
    'jan':1,
    'feb':2,
    'mar':3,
    'apr':4,
    'may':5,
    'jun':6,
    'jul':7,
    'aug':8,
    'sep':9,
    'oct':10,
    'nov':11}
df2['month'] = df2['month'].fillna(0)
df2['month'] = df2['month'].map(month_map)

# job, marital, education, housing, loan, contact, poutcome
cols = ['job', 'marital', 'education', 'housing', 'loan', 'contact', 'poutcome']
ce_onehot = ce.OneHotEncoder(cols=cols,handle_unknown='impute')
ce_onehot.fit( df2 )
df2 = ce_onehot.transform( df2 )

df2['duration'] = df2['duration'] / 3600

df2.to_csv('mytrain.csv', index=False)

2-2. Model construction / prediction part


import pandas as pd
import numpy as np
import category_encoders as ce
import lightgbm as lgb
#import optuna
from optuna.integration import lightgbm as lgb_optuna
from sklearn import preprocessing
from sklearn.model_selection import train_test_split,StratifiedKFold,cross_validate
from sklearn.metrics import roc_auc_score

pd.set_option('display.max_columns', None)

random_state = 1234
version = 'v1'

Divide the data for training and validation (8: 2).


df_train = pd.read_csv('mytrain.csv')

X = df_train.drop( columns=['y'] )
y = df_train['y']
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, random_state=random_state)

The following methods were used for model construction and accuracy verification.

--Cross Validation by dividing the training data into 5 by stratified sampling --Hyper parameter (hereinafter, high para) tuning is left to optuna --The index used for optimization is logloss. --Retrain the model with the entire training data and calculate the AUC using the verification data


def build():
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)
    
    lgb_train = lgb_optuna.Dataset(X_train, y_train)
    
    lgbm_params = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'random_state':random_state,
        'verbosity': 0
    }
    
    tunecv = lgb_optuna.LightGBMTunerCV(
        lgbm_params,
        lgb_train,
        num_boost_round=100,
        early_stopping_rounds=20,
        seed = random_state,
        verbose_eval=20,
        folds=kf
    )
    
    tunecv.run()
    
    print( 'Best score = ',tunecv.best_score)
    print( 'Best params= ',tunecv.best_params)
    
    return tunecv

tunecv = build()

Retrain the model with the entire training data and calculate the AUC using the verification data ↓

train_data = lgb.Dataset( X_train, y_train )
eval_data = lgb.Dataset(X_holdout, label=y_holdout, reference= train_data)
clf = lgb.train( tunecv.best_params, 
                train_data,
                valid_sets=eval_data,
                num_boost_round=50,
                verbose_eval=0
               )
y_pred = clf.predict( X_holdout )
print('AUC: ', roc_auc_score(y_holdout, y_pred))
# AUC:  0.8486429810797091

3. Trial and error while looking at data and accuracy

# What i did AUC submit score Impressions
00 Make the above process the default 0.8486 --- ---
01 job, marital, education, poutcomeChange the encoding of to target encoding 0.8458 --- It went down slightly, but once it goes
02 num_boost_round=200 (Because the score seemed to improve a little more if I put out the learning curve) 0.8536 --- It went up. Go with this
03 Notice that the learning parameters of the part that retrains the model in the entire training data are different from the parameters for high para tuning. num_boost_round=200、early_stopping_rounds =Unified with 20. 0.8585 --- Go with this
04 Try to set the optimization index to AUC 0.8557 --- lowered. Leave logloss
05 loan, housing,Change contact to ordinal encoding 0.8593 0.8556 The AUC is up, so I'll go with this. However, the submit score is a little low.
06 Check the difference between test data and training data. There is no big difference when compared by visualization. I tried to create a model that predicts test data, but AUC=0.Since it is about 5, it is judged that there is no difference between test data and training data --- --- ---
07 Change the encoding of month (combine several months with a small amount of data) 0.8583 0.8585 Almost the same as the AUC of 03. Rejected.
08 Change the encoding of month (combine several months with a small amount of data) 0.8583 0.8585 AUC dropped from 05. Rejected.
09 Add last month's average of y as a column like a time series lag variable 0.8629 0.8559 The training data improved the score, but it was rejected because the test score decreased.
10 ageCategorize (small number of lines)ageCombined) 0.8599 0.8588 Subtly improved. I will go with this.
11 Try to get into PCA 0.8574 --- lowered
12 Try other algorithms (SVM), RandomForest, LogisticRegression) --- --- lowered

I tried to change other details besides the above, but the accuracy did not improve. Also, it's annoying to record each time ... It feels like the competition period is over.

What I should have done

--Data processing system ――If you look closely at the data, including a little more cross tabulation, you may have discovered something. --Try merging with the original data (UCI) (probably partly processed, so some ingenuity is required) --Consideration of interaction term --Model system --You could try the ensemble with LightGBM that changed the random_state. --Interpretation system ――I should have dug deeper into the part where the accuracy was poor (there was a part where age could be categorized, but if possible a little more) --Tools and other systems ――Git is fine, but I should have used a code management tool ――Similarly, I should have added an experiment management tool (like MLOps)

at the end

There are many other things you should do. I would appreciate it if you could comment. When I go to the next competition, I would like to incorporate the technique while referring to this reflection and the kaggle kernel.

Recommended Posts

Signate_ Review of the 1st Beginner Limited Competition
Signate 2nd _Beginner Limited Competition Review
SIGNATE [1st _Beginner Limited Competition] Participated in bank customer targeting
SIGNATE [1st _Beginner Limited Competition] How to Solve Bank Customer Targeting
[SIGNATE] [lightgbm] Competition House price forecast for the American city of Ames Participation record (2/2)
Signate_ Review of the 1st Beginner Limited Competition
Review of the basics of Python (FizzBuzz)
AtCoder Beginner Contest 102 Review of past questions
AtCoder Beginner Contest 072 Review of past questions
AtCoder Beginner Contest 085 Review of past questions
AtCoder Beginner Contest 062 Review of past questions
AtCoder Beginner Contest 051 Review of past questions
AtCoder Beginner Contest 127 Review of past questions
AtCoder Beginner Contest 119 Review of past questions
AtCoder Beginner Contest 151 Review of past questions
AtCoder Beginner Contest 075 Review of past questions
AtCoder Beginner Contest 110 Review of past questions
AtCoder Beginner Contest 117 Review of past questions
AtCoder Beginner Contest 070 Review of past questions
AtCoder Beginner Contest 105 Review of past questions
AtCoder Beginner Contest 112 Review of past questions
AtCoder Beginner Contest 076 Review of past questions
AtCoder Beginner Contest 089 Review of past questions
AtCoder Beginner Contest 079 Review of past questions
AtCoder Beginner Contest 056 Review of past questions
AtCoder Beginner Contest 087 Review of past questions
AtCoder Beginner Contest 067 Review of past questions
AtCoder Beginner Contest 046 Review of past questions
AtCoder Beginner Contest 123 Review of past questions
AtCoder Beginner Contest 049 Review of past questions
AtCoder Beginner Contest 078 Review of past questions
AtCoder Beginner Contest 081 Review of past questions
Review the concept and terminology of regression
AtCoder Beginner Contest 047 Review of past questions
AtCoder Beginner Contest 060 Review of past questions
AtCoder Beginner Contest 104 Review of past questions
AtCoder Beginner Contest 057 Review of past questions
AtCoder Beginner Contest 121 Review of past questions
AtCoder Beginner Contest 126 Review of past questions
AtCoder Beginner Contest 090 Review of past questions
AtCoder Beginner Contest 103 Review of past questions
AtCoder Beginner Contest 061 Review of past questions
AtCoder Beginner Contest 059 Review of past questions
AtCoder Beginner Contest 044 Review of past questions
AtCoder Beginner Contest 083 Review of past questions
AtCoder Beginner Contest 048 Review of past questions
AtCoder Beginner Contest 124 Review of past questions
AtCoder Beginner Contest 116 Review of past questions
AtCoder Beginner Contest 097 Review of past questions
AtCoder Beginner Contest 088 Review of past questions
AtCoder Beginner Contest 092 Review of past questions
AtCoder Beginner Contest 099 Review of past questions
AtCoder Beginner Contest 065 Review of past questions
AtCoder Beginner Contest 053 Review of past questions
AtCoder Beginner Contest 094 Review of past questions
AtCoder Beginner Contest 063 Review of past questions
Pandas of the beginner, by the beginner, for the beginner [Python]
AtCoder Beginner Contest 107 Review of past questions
AtCoder Beginner Contest 071 Review of past questions
AtCoder Beginner Contest 064 Review of past questions
AtCoder Beginner Contest 082 Review of past questions
AtCoder Beginner Contest 084 Review of past questions