Recently, I've been addicted to data analysis competitions such as Kaggle and Signate, and I'm studying every day while participating in several competitions little by little. Before each first facing the data, I have a LightGBM template that I am doing to know the difficulty of the competition and the tendency of the data, so I will publish it.
Please let me know if you would like to do more like this!
Load the data and import the required libraries. If you start without checking the training data carefully, the amount of data may be unexpectedly huge, so check the amount of data.
from datetime import datetime
import numpy as np
import matplotlib.pyplot as plt
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
#Data reading
train_df = pd.read_csv("./train.csv")
test_df = pd.read_csv("./test.csv")
print(train_df.shape, test_df.shape)
(891, 12) (418, 11)
train_df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0
It's important to see the data in any competition.
Check the minimum data, such as not using the objective variables Suvived
, PassengerId
and Name
because they are unique features.
Divide into explanatory variables and objective variables.
train_x, train_y = train_df.drop("Survived", axis=1), train_df["Survived"]
Feature processing is also performed, but the minimum. It is done only from the following three viewpoints.
--Null padding
--Qualitative variables-> Quantitative variables (label encoding)
--Delete unnecessary columns (PassengerId
and Name
)
def label_encording(data_col):
'''
Label encoding
data_col :One column of the data frame of interest
'''
le = LabelEncoder()
le = le.fit(data_col)
#Convert label to integer
data_col = le.transform(data_col)
return data_col
def preprocess(df):
'''
Perform pretreatment
df : padnas.Dataframe
Target data frame
'''
df = df.drop("PassengerId", axis=1)
df = df.drop("Name", axis=1)
#Convert qualitative variables to numbers
for column_name in df:
if df[column_name][0].dtypes == object: #Substitute NULL for missing values
df[column_name] = df[column_name].fillna("NULL")
df[column_name] = label_encording(df[column_name])
elif df[column_name][0].dtypes == ( "int64" or "float64") : #Regarding missing values-Substitute 999
df[column_name] = df[column_name].fillna(-999)
return df
When performing label encoding, it is not good if the correspondence between the labels in the training data and the test data is broken, so the training data and the test data are subjected to feature quantity processing at the same time.
all_x = pd.concat([train_x, test_df])
preprocessed_all_x = preprocess(all_x)
#The preprocessed data is subdivided into training data and test data.
preprocessed_train_x, preprocessed_test_x = preprocessed_all_x[:train_x.shape[0]], preprocessed_all_x[train_x.shape[0]:]
print(preprocessed_train_x.head(5))
Create a class to learn LightGBM. See the official website below for detailed parameter explanations.
ʻObjectiveand
metrics` are changed according to the training data and competition.
# LightGBM
import lightgbm as lgb
class lightGBM:
def __init__(self, params=None):
self.model = None
if params is not None:
self.params = params
else:
self.params = {'objective':'binary',
'seed': 0,
'verbose':10,
'boosting_type': 'gbdt',
'metrics':'auc',
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'learning_rate':0.01,
'drop_rate':0.5
}
self.num_round = 20000
self.early_stopping_rounds = self.num_round/100
def fit(self, tr_x, tr_y, va_x, va_y):
self.target_columms = tr_x.columns
print(self.target_columms)
#Convert dataset
lgb_train = lgb.Dataset(tr_x, tr_y)
lgb_eval = lgb.Dataset(va_x, va_y)
self.model = lgb.train(self.params,
lgb_train,
num_boost_round=self.num_round,
early_stopping_rounds=self.early_stopping_rounds,
valid_names=['train', 'valid'],
valid_sets=[lgb_train, lgb_eval],
verbose_eval=self.num_round/100
)
return self.model
def predict(self, x):
data = lgb.Dataset(x)
pred = self.model.predict(x, num_iteration=self.model.best_iteration)
return pred
def get_feature_importance(self, target_columms=None):
'''
Feature output
'''
if target_columms is not None:
self.target_columms = target_columms
feature_imp = pd.DataFrame(sorted(zip(self.model.feature_importance(), self.target_columms)), columns=['Value','Feature'])
return feature_imp
Definition of learner
def model_learning(model, x, y):
'''
Train the model.
'''
tr_x, va_x, tr_y, va_y = train_test_split(x, train_y, test_size=0.2, random_state=0)
return model.fit(tr_x, tr_y, va_x, va_y)
By defining the model in a class and passing it to the learner, it is possible to minimize changes in the source code when using different models.
For example, when you want to use XGBoost, you can replace the model to be learned immediately by rewriting as follows.
class XGBoost:
def __init__(self, params=None):
#Initialization process~~~
def fit(self, tr_x, tr_y, va_x, va_y):
#Learning process~~~
def predict(self, x):
#Evaluation processing~~~
xgboost_model = XGBoost()
model_learning(xgboost_model, preprocessed_train_x, train_y)
lightgbm_model = lightGBM()
model_learning(lightgbm_model, preprocessed_train_x, train_y)
Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin',
'Embarked'],
dtype='object')
Training until validation scores don't improve for 200.0 rounds
Early stopping, best iteration is:
[172] train's auc: 0.945026 valid's auc: 0.915613
Learning is complete! It was over soon.
With LightGBM, you can check which of the learned features you used most often. This will give you hints on EDA for the next step.
Somehow, ʻAge,
Ticket, and
Fare are at the top, so it seems that age and seat position are important, and I can see the correlation between ʻAge
and Survived
, etc. ..
lightgbm_model.get_feature_importance()
Value Feature
0 32 Parch
1 58 SibSp
2 158 Embarked
3 165 Cabin
4 172 Sex
5 206 Pclass
6 1218 Fare
7 1261 Ticket
8 1398 Age
Evaluation of the model. The output result is a probability, but this time it must be either 0 or 1
, so format it accordingly.
#Evaluation of the model for testing
proba_ = lightgbm_model.predict(preprocessed_test_x)
proba = list(map(lambda x: 0 if x < 0.5 else 1, proba_))
Format the predicted value according to the submitted data. This is the easiest place to get stuck ...
#Creating test data
submit_df = pd.DataFrame({"Survived": proba})
submit_df.index.name = "PassengerId"
submit_df.index = submit_df.index + len(train_df) + 1
Save the file name in the submit_ {% Y-% m-% d-% H% M% S}
format.
By doing so, you can prevent accidental overwriting, and you don't have to think about the file name every time, which is convenient.
#Save
save_folder = "results"
if not os.path.exists(save_folder):
os.makedirs(save_folder)
submit_df.to_csv("{}/submit_{}.csv".format(save_folder, datetime.now().strftime("%Y-%m-%d-%H%M%S")),index=True)
When I submitted this result, the Public Score was 0.77033
, which was 6610th / 20114 people
. (As of 08/25/2020)
I think it's not a bad template for the purpose of grasping the difficulty and feeling of the competition by turning it for the time being.
I always think that EDA is sweet, so I've been doing EDA more firmly in the future.
Recommended Posts