At SIGNATE, one of the domestic platforms for machine learning competitions, I participated in the "[1st _Beginner Limited Competition] Bank Customer Targeting" held in August 2020, so I also used my own memorandum to solve the problem. I will describe it. ** In addition, we do not have a particularly original solution. I hope it will be helpful for beginners in machine learning ** (the text is long).
In SIGNATE, titles are given according to the results of the competition, but the title at the time of registration with SIGNATE will be "Begginer". This competition was a competition that only people in the bottom Beginer class could participate in (it seems that this is the first time for the Beginer limited competition).
Normally, you will participate in the competition for the next title Intermediate from Beginner, and if you enter the top 60% even once, you will be promoted, but in this competition, if you achieve the specified score, you will be automatically promoted to Intermediate at that point. It will be a competition to the effect.
I only registered SIGNATE and was a Beginer, so I participated.
As a result of a campaign conducted by a bank, whether or not a customer has opened an account is predicted based on customer attribute data and contact information from past campaigns. This is a so-called "classification" problem in machine learning.
The data provided is as follows. The train data was 27100 records and the test data was 18050 records.
column | Header name | Data type | Description |
---|---|---|---|
0 | id | int | Line serial number |
1 | age | int | age |
2 | job | varchar | Occupation |
3 | marital | varchar | Unmarried/married |
4 | education | varchar | Education level |
5 | default | varchar | Is there a default (yes), no) |
6 | balance | int | Average annual balance (€) |
7 | housing | varchar | Mortgage (yes), no) |
8 | loan | varchar | Personal loan (yes), no) |
9 | contact | varchar | Contact method |
10 | day | int | Last contact date |
11 | month | char | Last contact month |
12 | duration | int | Last contact time (seconds) |
13 | compaign | int | Number of contacts in the current campaign |
14 | pdays | int | Elapsed days: Days after contact with the previous campaign |
15 | previous | int | Contact record: Number of contacts with customers before the current campaign |
16 | poutcome | varchar | Results of the previous campaign |
17 | y | boolean | Whether or not to apply for a fixed deposit (1):Yes, 0:None) |
OS: Windows10 Processor: core i7 5500U Memory: 16GB Anaconda3 environment (Python 3.7.6)
Bank_Prediction ├ notebook/ ●●●.ipynb ├ input/ train.csv、test.csv └ output / Output the prediction result here
Create a prediction model in the following order.
First of all, we will perform an analysis to confirm the structure and characteristics of the given data. For the sake of simplicity in the article, I will omit the EDA result of the test data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns
#Set the maximum number of display columns to 50
pd.set_option('display.max_columns', 50)
#Reading various data
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")
train.info()
The number of records is 27100, the number of features is 18, and it is clear which features are numerical variables and categorical variables. Also, it seems that there are no missing values in the data given this time. Since the data this time is like the data created for the competition, it is a beautiful data with no missing values, but in the case of reality-based data, it is common to perform complementary processing with a lot of missing values.
train.describe()
train.hist(figsize=(20,20), color='r')
Although y indicates whether or not an account has been opened, it can be seen that the number of opened (1) is very small compared to the number of unopened (0), resulting in imbalanced data.
colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.select_dtypes(exclude='object').astype(int).corr(),linewidths=0.1,vmax=1.0, vmin=-1.0,
square=True, cmap=colormap, linecolor='white', annot=True)
Among the features, previous (the number of contacts with customers so far) seems to have the highest correlation with whether or not an account has been opened.
g = sns.pairplot(train, hue='y', palette = 'seismic',size=1.2,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
g.set(xticklabels=[])
Blue is the distribution of customers who have not opened an account, and red is the distribution of customers who have opened an account. Looking at the age of the diagonal histogram, it seems that younger people are more likely not to open an account. There is also a difference in distribution on day (last contact day).
for _ in range(len(train.select_dtypes(include='object').columns)):
print(train.select_dtypes(include='object').columns.values[_])
print(len(train.select_dtypes(include='object').iloc[:,_].value_counts().to_dict()))
print(train.select_dtypes(include='object').iloc[:,_].value_counts().to_dict())
I was able to confirm the number of elements included in each categorical variable.
From here, we will perform data preprocessing for creating a prediction model.
First, for categorical variables, we added one feature that combines three features related to loans.
We also added features for numerical variables. Here, we added a feature that is the difference between the average of each existing feature and the feature of each record. It seems that generalization performance may be improved by adding a new feature quantity squared or cubed as a feature quantity, but this time I did not try it.
#Merge train and test data
train2 = pd.concat([train,test],axis=0)
#Feature addition
train2['default_housing_loan'] = train2['default'].str.cat([train2['housing'],train2['loan']], sep='_')
train2['age_median'] = train2['age'] - train2['age'].median()
train2['day_median'] = train2['day'] - train2['day'].median()
train2['duration_median'] = train2['duration'] - train2['duration'].median()
train2['campaign_median'] = train2['campaign'] - train2['campaign'].median()
train2['previous_median'] = train2['previous'] - train2['previous'].median()
Label Encoding Categorical variables cannot be entered into the prediction model as training data as they are, so they must be encoded. There are several encoding methods, but since the algorithm used for learning this time is a gradient boosting tree, we use Label Encoding (One-Hot-Encoding is better when solving problems such as "regression").
The following is an example of Label Encoding the feature amount marital.
married → 0 single → 1 divorced → 2
#Label Encoding
from sklearn.preprocessing import LabelEncoder
category = train2.select_dtypes(include='object')
for col in list(category):
le = LabelEncoder()
le.fit(train2[col])
le.transform(train2[col])
train2[col] = le.transform(train2[col])
Now that the data preprocessing for the given data is complete, we will continue to train and predict. The algorithm used for learning is LightGBM. This time, 20 models were created by changing the random numbers when dividing into training data and verification data, and the average of each predicted value was taken as the final prediction result (Random Seed Average). The hyperparameters are tuned by Oputuna.
~~ Also, due to ** unbalanced data, I specified "'class_weight':'balanced'" in LightGBM params **. ~~ ** (Correction) AUC was not necessary because it is an evaluation index that is not affected by data bias. Also, it was LightGBM Classiefier that could specify class_weight as a parameter. ** **
train&predict
#import lightgbm
import optuna.integration.lightgbm as lgb #High para tuning with Optuna
from sklearn.model_selection import train_test_split
import datetime
#Divide the merged train2 into train and test again
train = train2[:27100]
test = train2[27100:].drop(['y'],axis=1)
#Get the values of the objective and explanatory variables of train
target = train['y'].values
features = train.drop(['id','y'],axis=1).values
#test data
test_X = test.drop(['id'],axis=1).values
lgb_params = {'objective': 'binary',
'metric': 'auc', #The evaluation index specified by the competition is AUC
#'class_weight': 'balanced' #I didn't need it here
}
#Random seed average 20 times
for _ in range(20):
#Divide train into training data and verification data
(features , val_X , target , val_y) = train_test_split(features, target , test_size = 0.2)
#Creating a dataset for LightGBM
lgb_train = lgb.Dataset(features, target,feature_name = list(train.drop(['id','y'],axis=1))) #For learning
lgb_eval = lgb.Dataset(val_X, val_y, reference=lgb_train) #For Boosting
#Specifying categorical variables
categorical_features = ['job', 'marital', 'education', 'default', 'balance','month',
'housing', 'loan','poutcome', 'default_housing_loan']
#Learning
model = lgb.train(lgb_params, lgb_train, valid_sets=lgb_eval,
categorical_feature = categorical_features,
num_boost_round=1000,
early_stopping_rounds=20,
verbose_eval=10)
pred = model.predict(test_X) #Account application probability value
#Store each prediction result
if _ == 0:
output = pd.DataFrame(pred,columns=['pred' + str(_+1)])
output2 = output
else:
output2 = pd.concat([output2,output],axis=1)
#End of for
#Average each prediction result
df_mean = output2.mean(axis='columns')
df_result = pd.concat([test['id'],df_mean],axis=1)
#Export with time attached to file name
now = datetime.datetime.now()
df_result.to_csv('../output/submission' + now.strftime('%Y%m%d_%H%M%S') + '.csv',index=None,header=None)
The score (AUC) specified in the competition was 0.85, but my ** final score was 0.855 **. I was successfully promoted to Intermediate. ** The final ranking was 62nd out of 787 people **, which was neither bad nor extremely good.
By the way, the transition of the score is as follows.
** 0.8470: No Random Seed Average ** ↓ (+0.0034) ** 0.8504: Random Seed Average 5 times ** ↓ (+0.0035) ~~ ** 0.8539: Specifying "'class_weight':'balanced'" ** ~~ ↓ (+0.0016) ** 0.8555: Random Seed Average 20 times **
~~ In my case, I feel that the specification of "'class_weight':'balanced'" was quite effective. ~~
In addition, although I have corrected it in the code posted on Qiita, there was one fatal mistake, so I feel that I could have gone up to about 0.857 without it (a little disappointing).
By the way, on the forum (competition bulletin board), it was written that if you do Random Seed Average 100 times, the score will increase considerably. I should have increased the average number of times (I wasn't prepared to learn for 10 hours lol).
** (Correction) As described above, this evaluation index AUC is an evaluation index that is not affected by data bias, so it was not necessary to consider it this time. Also, it was LightGBM Classiefier that could specify class_weight as a parameter. ** **
I noticed that the training data this time was unbalanced data. When training with imbalanced data, it is easy to predict that the prediction model is a negative example, so the following processing is common.
This time, I did not undersample, but did 2. I referred to the following page.
It is better to set class weight when classifying biased data in random forest
If you want to implement undersampling of 1, the following page will be helpful.
Downsampling + bagging with LightGBM --a memorandum of u ++
By the way, in my experience, whether undersampling or weighting is good depends on the problem. Therefore, it is recommended to try both once and adopt the one with the better score.
I also tried Pseudo Labeling, but I didn't use it because it wasn't very effective in this competition.
According to the stories of other people who participated in the competition, Target Encoding and Stacking are not very effective, so it seems that it was a good competition to attack orthodox with a single model.
Since this competition has the same subject in SIGNATE's exercises, you can download the data from the following page and check the operation of the code. If you want to actually move it, please.
[Practice question] Bank customer targeting
Although it was a Begginer limited competition, it was a very rewarding competition with many things to learn. In the future, I would like to challenge Kaggle's MoA (pharmacokinetic competition) and ProbSpace's Splatoon competition. By the way, I also applied for the AI human resources development program "AI QUEST" sponsored by the Minister of Economy, Trade and Industry, so if I'm lucky enough to pass it, I'm going to be busy every day.
P.S. It took a lot of time to draw the SIGNATE title pyramid ...
Recommended Posts