I wrote this article because I thought that posting this unfinished code would catch the eye of many people and give me improvements and remedies such as what I did wrong and what I should have done. So, honestly, I think there are many questions such as why you are doing this, but I would be happy if you could see it with a warm eye.
This time I participated in the competition held from October 1st. https://signate.jp/competitions/295
To briefly introduce myself, I started attending an AI programming school in April of this year. I am currently in the process of changing jobs, have no programming experience, and am from the Faculty of Arts.
First of all, I participated this time a little late and started slowly from October 13th. For the first week, I just looked at the data and wrote the code with reference to what I learned how to do this. However, I couldn't even submit because of repeated errors ...
Then, one week before the end of the competition, I finally got the Kaggle Start Book and decided to copy it and make something like that. Regarding EDA, SIGNATE opened QUEST for free, so I referred to that.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import optuna
import optuna.integration.lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
For the time being, import the data and look at the contents. (The development environment used Kaggle's notebook)
train = pd.read_csv("../input/signatecomp/train.csv",header=0)
test = pd.read_csv("../input/signatecomp/test.csv",header=0)
print(train.info())
print(train.head())
print(train.info())
print(train.head())
Let's look at the characteristics of the data from here. First look at the numeric variables.
test.hist(figsize=(20,20), color='r')
Next, let's look at categorical variables.
emplength_var = train['employment_length'].value_counts()
#Specify graph title
emplength_var.plot.bar(title="employment_frequency of length")
#Specify x-axis name
plt.xlabel('employment_length')
#Specify y-axis name
plt.ylabel('count')
#Code required to display the created graph
plt.show()
#Visualization of purpose bar chart
purpose_var = train['purpose'].value_counts()
purpose_var.plot.bar()
#Displaying a bar graph for purpose
plt.show()
# application_Visualization of type bar chart
application_var = train['application_type'].value_counts()
application_var.plot.bar()
# application_Display of type bar graph
plt.show()
#Visualization of grade bar chart
grade_var = train['grade'].value_counts()
grade_var.value_counts()
#Display of grade bar graph
plt.show()
Next, let's look at the relationship between the objective variable and the categorical variable.
#Index (row) term column, loan_Cross tabulation with status column as column
cross_term = pd.crosstab(train['term'],train['loan_status'], margins = True)
#Divide the ChargedOff column by the All column and divide the variable c_Substitute for rate
c_rate = cross_term['ChargedOff'] / cross_term['All']
#Divide the FullyPaid column by the All column and divide the variable f_Substitute for rate
f_rate = cross_term['FullyPaid'] / cross_term['All']
#Variable c_rate and variable f_variable cross rate_New column c in term_rate、f_Substitute as rate respectively
cross_term['c_rate'] = c_rate
cross_term['f_rate'] = f_rate
#Display of cross-tabulation table
print(cross_term)
#Divide the ChargedOff column by the All column and divide the variable c_Substitute for rate
c_rate = cross_term['ChargedOff'] / cross_term['All']
#Divide the FullyPaid column by the All column and divide the variable f_Substitute for rate
f_rate = cross_term['FullyPaid'] / cross_term['All']
#Variable c_rate and variable f_variable cross rate_New column c in term_rate、f_Substitute as rate respectively
cross_term['c_rate'] = c_rate
cross_term['f_rate'] = f_rate
#Variable cross_Remove the All line from term and the variable cross_Reassign to term
cross_term = cross_term.drop(index = ["All"])
#Show cross tabulation
print(cross_term)
#Create a DataFrame for only the columns you want to use for the stacked bar chart
df_bar = cross_term[['c_rate', 'f_rate']]
#Create a stacked bar chart
df_bar.plot.bar(stacked=True)
#Graph title settings
plt.title('Bad debt rate and repayment rate for each repayment period')
#x-axis label settings
plt.xlabel('period')
#y-axis label settings
plt.ylabel('Percentage')
#Graph display
plt.show()
Apply this work to all categorical variables (It's okay to put all the code, but I'm sorry because it's almost the same work, but I will omit it)
For the time being, I took the average and converted the credit_score logarithmically because it was biased.
#Feature addition
train["log_cre"] = np.log(train.credit_score - train.credit_score.min() + 1)
test["log_cre"] = np.log(test.credit_score - test.credit_score.min() + 1)
train['loam_median'] = train['loan_amnt'] - train['loan_amnt'].median()
train['inter_median'] = train['interest_rate'] - train['interest_rate'].median()
test['loam_median'] = test['loan_amnt'] - test['loan_amnt'].median()
test['inter_median'] = test['interest_rate'] - test['interest_rate'].median()
This time I did label encoding.
#Convert train data
Label_Enc_list = ['term','grade','purpose','application_type',"employment_length","loan_status"]
#Implementation of Label Encoding
import category_encoders as ce
ce_oe = ce.OrdinalEncoder(cols=Label_Enc_list,handle_unknown='impute')
#Convert letters to ordinal
train = ce_oe.fit_transform(train)
#Change the value from the beginning of 1 to the beginning of 0
for i in Label_Enc_list:
train[i] = train[i] - 1
#Convert test data
from sklearn.preprocessing import LabelEncoder
category = test.select_dtypes(include='object')
for col in list(category):
le = LabelEncoder()
le.fit(test[col])
le.transform(test[col])
test[col] = le.transform(test[col])
print(train.head())
print(test.head())
#Get the values of the objective and explanatory variables of train
target = train['loan_status'].values
features = train.drop(['id','loan_status'],axis=1).values
#test data
test_X = test.drop(['id'],axis=1).values
#Divide train into training data and verification data
(features , val_X , target , val_y) = train_test_split(features, target , test_size = 0.2)
def objective(trial):
lgb_params = {'objective': 'binary',
'max_bin': trial.suggest_int("max_bin", 255, 500),
"learning_rate": 0.05,
"num_leaves": trial.suggest_int("num_leaves", 32, 128)
}
lgb_train = lgb.Dataset(features, target) #For learning
lgb_eval = lgb.Dataset(val_X, val_y,reference=lgb_train) #For Boosting
#Learning
model = lgb.train(lgb_params, lgb_train,
valid_sets=[lgb_train,lgb_eval],
num_boost_round=1000,
early_stopping_rounds=10,
verbose_eval=10)
y_pred = model.predict(val_X,
num_iteration=model.best_iteration)
score = log_loss(val_y,y_pred)
return score
study = optuna.create_study(sampler=optuna.samplers.RandomSampler(seed=0))
study.optimize(objective, n_trials=20)
study.best_params
lgb_params = {'boosting_type': 'gbdt',
'objective': 'binary',
'max_bin': study.best_params["max_bin"],
"learning_rate": 0.05,
"num_leaves": study.best_params["num_leaves"]
}
lgb_train = lgb.Dataset(features, target) #For learning
lgb_eval = lgb.Dataset(val_X, val_y,reference=lgb_train) #For Boosting
#Learning
model = lgb.train(lgb_params, lgb_train, valid_sets=[lgb_train,lgb_eval],
num_boost_round=1000,
early_stopping_rounds=10,
verbose_eval=10)
pred = model.predict(test_X,num_iteration=model.best_iteration)
In the start book, I used to classify binary on the condition that it is larger than 0.5, but I changed the condition because I got a score when I displayed about 50 lines and specified it as larger than 0.1, but I do not know how to handle it It was ... I intended to make an assignment again, but I didn't know how to specify it, so I am doing the stupid thing of adding a line once, opening the csv file and deleting it (; _;) (Is it okay to specify the column normally with Header = 0?)
pred1 = (pred > 0.1).astype(int)
submit = pd.read_csv("../input/signatecomp/submit.csv")
#Prediction result file output
submit.loc[:,0] = pred1[1:]
submit.to_csv("submit1.csv", index = False)
print("Your submission was successfully saved!")
I was promoted if F1Score = 0.355 was exceeded, but I could not be promoted because it was 0.3275218.
First, I was wasting a lot of time because I was trying to analyze data on a near zero basis.
Next, I learned a certain amount of knowledge and implementation because I learned at the programming school for E qualification, but I was sorry that I neglected to actually solve the problem and review it. The result was
Lastly, I was very bad at finding similar competitions in Kaggle, how to implement code, how to do data engineering, and so on.
When I entered the competition for the first time, I got a lot of things, such as how weak I was, what I should do from now on, how to deal with errors, etc.
Thank you to everyone who has seen this far. I am still immature, so I will continue to work harder.