This is the story of participating in the Kaggle </ b> competition for the first time. In the previous "Try all scikit-learn models with Kaggle's Titanic", perform "cross-validation" on the scikit-learn model. I was able to raise the score a little. This time, I would like to do "check the raw data" that should have been done first.
History
I read a book called The Power of Analysis that Changes the Company. One of the contents of the book says, "Let's check the raw data before analyzing the data." Outliers cannot be found without looking at the raw data. Before starting data analysis, first visualize the raw data to see if there are any outliers. It is said that you should acquire such a habit. Check the raw data, check for abnormal values, and check how to use the data again.
According to the result, by scrutinizing the input data, the score increased a little and became "0.80382". The result is the top 9% (as of January 7, 2020). I would like to see the flow up to submission.
Let's check some raw data.
Let's make a scatter plot of fares for each pclass (ticket class). It became as follows.
The horizontal axis is pclass. Fares of "1" tend to be high. As for the ticket class, the grade seems to improve in the order of 1> 2> 3. From the scatter plot, you can see that the fare "0" is in each pclass. Let's take a look at the raw data. Sort by fare (Fare) in ascending order.
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
180 | 0 | 3 | male | 36 | 0 | 0 | LINE | 0 | S | |
264 | 0 | 1 | male | 40 | 0 | 0 | 112059 | 0 | B94 | S |
272 | 1 | 3 | male | 25 | 0 | 0 | LINE | 0 | S | |
278 | 0 | 2 | male | 0 | 0 | 239853 | 0 | S | ||
303 | 0 | 3 | male | 19 | 0 | 0 | LINE | 0 | S | |
414 | 0 | 2 | male | 0 | 0 | 239853 | 0 | S | ||
467 | 0 | 2 | male | 0 | 0 | 239853 | 0 | S | ||
482 | 0 | 2 | male | 0 | 0 | 239854 | 0 | S | ||
598 | 0 | 3 | male | 49 | 0 | 0 | LINE | 0 | S | |
634 | 0 | 1 | male | 0 | 0 | 112052 | 0 | S | ||
675 | 0 | 2 | male | 0 | 0 | 239856 | 0 | S | ||
733 | 0 | 2 | male | 0 | 0 | 239855 | 0 | S | ||
807 | 0 | 1 | male | 39 | 0 | 0 | 112050 | 0 | A36 | S |
816 | 0 | 1 | male | 0 | 0 | 112058 | 0 | B102 | S | |
823 | 0 | 1 | male | 38 | 0 | 0 | 19972 | 0 | S | |
379 | 0 | 3 | male | 20 | 0 | 0 | 2648 | 4.0125 | C | |
873 | 0 | 1 | male | 33 | 0 | 0 | 695 | 5 | B51 B53 B55 | S |
327 | 0 | 3 | male | 61 | 0 | 0 | 345364 | 6.2375 | S | |
844 | 0 | 3 | male | 34.5 | 0 | 0 | 2683 | 6.4375 | C |
In ascending order of Fare, Fare is "0" and PClass is 1, 2, and 3. Fare "0" is not free and seems to mean "fare unknown". Let's exclude Fare "0" from the training data. If you exclude Fare "0" and create a scatter plot again, it will be as follows.
It's a little easier to see. I am also concerned about the small point of pclass "1". Looking at the table above, there is data for Fare "5" with pclass "1". This may also be an outlier, so let's exclude it.
It is a scatter plot with a certain range of fares for each pclass.
The ticket number is a nominal scale. I will sort them in ascending order of ticket numbers.
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
258 | 1 | 1 | female | 30 | 0 | 0 | 110152 | 86.5 | B77 | S |
505 | 1 | 1 | female | 16 | 0 | 0 | 110152 | 86.5 | B79 | S |
760 | 1 | 1 | female | 33 | 0 | 0 | 110152 | 86.5 | B77 | S |
263 | 0 | 1 | male | 52 | 1 | 1 | 110413 | 79.65 | E67 | S |
559 | 1 | 1 | female | 39 | 1 | 1 | 110413 | 79.65 | E67 | S |
586 | 1 | 1 | female | 18 | 0 | 2 | 110413 | 79.65 | E68 | S |
111 | 0 | 1 | male | 47 | 0 | 0 | 110465 | 52 | C110 | S |
476 | 0 | 1 | male | 0 | 0 | 110465 | 52 | A14 | S | |
431 | 1 | 1 | male | 28 | 0 | 0 | 110564 | 26.55 | C52 | S |
367 | 1 | 1 | female | 60 | 1 | 0 | 110813 | 75.25 | D37 | C |
If you look at the ticket number, you can't read the regularity, whether it's just numbers or a combination of letters and numbers. You can also see that there are people with the same ticket number. People with the same ticket number often have the same surname when looking at their names. I think it's a family. Also, when compared with Survived of people with the same ticket number, Survived tends to be the same with the same ticket number. We will consider the policy of labeling with the ticket number. The image below.
PassengerId | Survived | Ticket | Ticket (label) |
---|---|---|---|
505 | 1 | 110152 | Ticket A |
258 | 1 | 110152 | Ticket A |
760 | 1 | 110152 | Ticket A |
586 | 1 | 110413 | Ticket B |
559 | 1 | 110413 | Ticket B |
263 | 0 | 110413 | Ticket B |
111 | 0 | 110465 | Ticket C |
476 | 0 | 110465 | Ticket C |
431 | 1 | 110564 | NaN |
367 | 1 | 110813 | NaN |
We want to group the same ticket numbers, so we'll use "NaN" for unique tickets. Tickets A and B can be digitized as they are, but one-hot encoding is used to clearly indicate that they are labeled. The image is as follows. The source code will be described later, but you can do One-Hot encoding with pandas.get_dummies.
PassengerId | Survived | Ticket A | Ticket B | Ticket C |
---|---|---|---|---|
505 | 1 | 1 | 0 | 0 |
258 | 1 | 1 | 0 | 0 |
760 | 1 | 1 | 0 | 0 |
586 | 1 | 0 | 1 | 0 |
559 | 1 | 0 | 1 | 0 |
263 | 0 | 0 | 1 | 0 |
111 | 0 | 0 | 0 | 1 |
476 | 0 | 0 | 0 | 1 |
431 | 1 | 0 | 0 | 0 |
367 | 1 | 0 | 0 | 0 |
sibsp and parch also graphed previously, but let's graph it again. There was no significant difference in the correlation coefficient, but the graphs for both sibsp and parch show the following. ・ When sibsp and parch are 0, Survived is more often 0 (about twice). · If sibsp, parch is 1 or 2, Survived 0s and 1s are about the same ・ The population parameter is small when sibsp and parch are 3 or more.
Last time, I excluded it from the training data because the correlation coefficient is small, but it seems that 0, 1, and 2 can be used as label data. When only the data with sibsp and parch of 3 or less are extracted and the correlation coefficient is confirmed, the result is as follows.
#Check the number of Kramer correlations when SibSp is less than 3
df_SibSp = df[df['SibSp'] < 3]
cramersV(df_SibSp['Survived'], df_SibSp['SibSp'])
#Survived and SibSp(Less than 3)Display the cross-tabulation table of
cross_sibsp = pandas.crosstab(df_SibSp['Survived'], df_SibSp['SibSp'])
cross_sibsp
cross_sibsp.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()
0.16260950922794606
The correlation coefficient was 0.16, which was "weakly correlated". I'll omit it, but Parch has similar results. So, like Ticket, let's try One-Hot encoding for SibSp and Parch. The image is as follows.
PassengerId | Survived | SibSp_1 | SibSp_2 | SibSp_3 | SibSp_4 | SibSp_5 | SibSp_8 |
---|---|---|---|---|---|---|---|
505 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
258 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
760 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
586 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
559 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
263 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
111 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
476 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
431 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
367 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
Let's check Cabin. Of the 900 verification data (train.csv), about 200 Cabin. Cabin is a nominal scale. When the first character is regarded as the same group and grouped, it becomes as follows.
The result is that there are many Survived "1" s in each case. The label data of the first character seems to be useful. Cabin will also try One-Hot encoding the first character. The image is as follows.
PassengerId | Survived | Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T |
---|---|---|---|---|---|---|---|---|---|
505 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
258 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
760 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
586 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
559 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
263 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
111 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
476 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
431 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
367 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Let's learn based on the situation so far. The input data is as follows.
No | item name | item name(Japanese) | Conversion method |
---|---|---|---|
1 | Pclass | Ticket class | Standardization |
2 | Sex | sex | Quantify |
3 | SibSp | Brother/spouse | one-hot encoding |
4 | Parch | parent/children | one-hot encoding |
5 | Ticket | Ticket number | one-hot encoding |
6 | Fare | fare | Standardization |
7 | Cabin | Room number | The first character is one-hot encoding |
Try all models of kaggle⑤, and also with the model by grid search of kaggle④ When I tried the parameters, I got the following:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='exponential', max_depth=6,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='auto',
random_state=1, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
The full code is below. However, when I actually trained, the score did not increase when "Cabin" was included, so I finally excluded Cabin.
import numpy
import pandas
import matplotlib.pyplot as plt
######################################
#Number of Klamer correlations
# Cramer's coefficient of association
# 0.5 >= :Very strong correlation
# 0.25 >= :Strong correlation
# 0.1 >= :Slightly weak correlation
# 0.1 < :No correlation
######################################
def cramersV(x, y):
"""
Calc Cramer's V.
Parameters
----------
x : {numpy.ndarray, pandas.Series}
y : {numpy.ndarray, pandas.Series}
"""
table = numpy.array(pandas.crosstab(x, y)).astype(numpy.float32)
n = table.sum()
colsum = table.sum(axis=0)
rowsum = table.sum(axis=1)
expect = numpy.outer(rowsum, colsum) / n
chisq = numpy.sum((table - expect) ** 2 / expect)
return numpy.sqrt(chisq / (n * (numpy.min(table.shape) - 1)))
######################################
#Correlation ratio
# Correlation ratio
# 0.5 >= :Very strong correlation
# 0.25 >= :Strong correlation
# 0.1 >= :Slightly weak correlation
# 0.1 < :No correlation
######################################
def CorrelationV(x, y):
"""
Calc Correlation ratio
Parameters
----------
x : nominal scale {numpy.ndarray, pandas.Series}
y : ratio scale {numpy.ndarray, pandas.Series}
"""
variation = ((y - y.mean()) ** 2).sum()
inter_class = sum([((y[x == i] - y[x == i].mean()) ** 2).sum() for i in numpy.unique(x)])
correlation_ratio = inter_class / variation
return 1 - correlation_ratio
# train.load csv
# Load train.csv
df = pandas.read_csv('/kaggle/input/titanic/train.csv')
# test.load csv
# Load test.csv
df_test = pandas.read_csv('/kaggle/input/titanic/test.csv')
# 'PassengerId'To extract(To combine with the result)
# Extract 'PassengerId'(To combine with the result)
df_test_index = df_test[['PassengerId']]
df_all = pandas.concat([df, df_test], sort=False)
##############################
#Data preprocessing
#Extract the required items
# Data preprocessing
# Extract necessary items
##############################
df = df[['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Fare']]
df_test = df_test[['Pclass', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Fare']]
##############################
#Plot a scatter plot of Fare and pclass
# Draw scatter plot of Fare and pclass
##############################
plt.scatter(df['Pclass'], df['Fare'])
plt.xticks(numpy.linspace(1, 3, 3))
plt.ylim(0, 300)
plt.show()
##############################
#Exclude Fare 0
# Exclude Fare 0
##############################
df = df[df['Fare'] != 0].reset_index(drop=True)
##############################
#Plot a scatter plot of Fare and pclass
# Draw scatter plot of Fare and pclass
##############################
plt.scatter(df['Pclass'], df['Fare'])
plt.xticks(numpy.linspace(1, 3, 3))
#plt.xlim(1, 3)
plt.ylim(0, 300)
plt.show()
##############################
#Exclude Fare 0
# Exclude Fare 0
##############################
df = df[df['Fare'] != 5].reset_index(drop=True)
##############################
#Plot a scatter plot of Fare and pclass
# Draw scatter plot of Fare and pclass
##############################
plt.scatter(df['Pclass'], df['Fare'])
plt.xticks(numpy.linspace(1, 3, 3))
plt.ylim(0, 300)
plt.show()
##############################
#View Survived and Age crosstabs
# Display Survived and Age crosstabulation table
##############################
cross_age = pandas.crosstab(df_all['Survived'], round(df_all['Age'],-1))
cross_age
cross_age.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()
##############################
#View Survived and SibSp crosstabulation
# Display Survived and SibSp crosstabulation table
##############################
cross_sibsp = pandas.crosstab(df['Survived'], df['SibSp'])
cross_sibsp
cross_sibsp.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()
#Check the number of Kramer correlations when SibSp is less than 3
# Check Cramer's coefficient of association when SibSp is less than 3
df_SibSp = df[df['SibSp'] < 3]
cramersV(df_SibSp['Survived'], df_SibSp['SibSp'])
##############################
#Survived and SibSp(Less than 3)Display the cross-tabulation table of
# Display a crosstabulation of Survived and SibSp (less than 3)
##############################
cross_sibsp = pandas.crosstab(df_SibSp['Survived'], df_SibSp['SibSp'])
cross_sibsp
cross_sibsp.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()
##############################
#View Survived and Parch crosstabs
# Display Survived and Parch crosstabulation table
##############################
cross_parch = pandas.crosstab(df['Survived'], df['Parch'])
cross_parch
cross_parch.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()
#Check the number of Klamer correlations when Parch is less than 3
# Check Cramer's coefficient of association when Parch is less than 3
df_Parch = df[df['Parch'] < 3]
cramersV(df_Parch['Survived'], df_Parch['Parch'])
##############################
#Survived and Parch(Less than 3)Display the cross-tabulation table of
# Display a crosstabulation of Survived and Parch (less than 3)
##############################
cross_parch = pandas.crosstab(df_Parch['Survived'], df_Parch['Parch'])
cross_parch
cross_parch = pandas.crosstab(df_Parch['Survived'], df_Parch['Parch'])
cross_parch
cross_parch.T.plot(kind='bar', stacked=False, width=0.8)
plt.show()
from sklearn.preprocessing import LabelEncoder
##############################
#Data preprocessing
#Quantify the label (name)
# Data preprocessing
# Digitize labels
##############################
##############################
# Sex
##############################
encoder_sex = LabelEncoder()
df['Sex'] = encoder_sex.fit_transform(df['Sex'].values)
df_test['Sex'] = encoder_sex.transform(df_test['Sex'].values)
##############################
#Data preprocessing
# One-Hot encoding
# Data preprocessing
# One-Hot Encoding
##############################
##############################
# SibSp
##############################
SibSp_values = df_all['SibSp'].value_counts()
SibSp_values = pandas.Series(SibSp_values.index, name='SibSp')
categories = set(SibSp_values.tolist())
df['SibSp'] = pandas.Categorical(df['SibSp'], categories=categories)
df_test['SibSp'] = pandas.Categorical(df_test['SibSp'], categories=categories)
df = pandas.get_dummies(df, columns=['SibSp'])
df_test = pandas.get_dummies(df_test, columns=['SibSp'])
##############################
# Parch
##############################
Parch_values = df_all['Parch'].value_counts()
Parch_values = pandas.Series(Parch_values.index, name='Parch')
categories = set(Parch_values.tolist())
df['Parch'] = pandas.Categorical(df['Parch'], categories=categories)
df_test['Parch'] = pandas.Categorical(df_test['Parch'], categories=categories)
df = pandas.get_dummies(df, columns=['Parch'])
df_test = pandas.get_dummies(df_test, columns=['Parch'])
##############################
# Ticket
##############################
ticket_values = df_all['Ticket'].value_counts()
ticket_values = ticket_values[ticket_values > 1]
ticket_values = pandas.Series(ticket_values.index, name='Ticket')
categories = set(ticket_values.tolist())
df['Ticket'] = pandas.Categorical(df['Ticket'], categories=categories)
df_test['Ticket'] = pandas.Categorical(df_test['Ticket'], categories=categories)
df = pandas.get_dummies(df, columns=['Ticket'])
df_test = pandas.get_dummies(df_test, columns=['Ticket'])
##############################
#Data preprocessing
#Standardize numbers
# Data preprocessing
# Standardize numbers
##############################
from sklearn.preprocessing import StandardScaler
#Standardization
# Standardize numbers
standard = StandardScaler()
df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Fare']]), columns=['Pclass', 'Fare'])
df['Pclass'] = df_std['Pclass']
df['Fare'] = df_std['Fare']
df_test_std = pandas.DataFrame(standard.transform(df_test[['Pclass', 'Fare']]), columns=['Pclass', 'Fare'])
df_test['Pclass'] = df_test_std['Pclass']
df_test['Fare'] = df_test_std['Fare']
##############################
#Data preprocessing
#Handle missing values
# Data preprocessing
# Fill or remove missing values
##############################
df_test = df_test.fillna({'Fare':0})
#Prepare training data
# Prepare training data
x_train = df.drop(columns='Survived').values
y_train = df[['Survived']].values
# y_Remove train dimension
# Delete y_train dimension
y_train = numpy.ravel(y_train)
##############################
#Build a model
# Build the model
# GradientBoostingClassifier
##############################
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=1,loss='exponential', learning_rate=0.1, max_depth=6)
import os
if(os.path.exists('./result.csv')):
os.remove('./result.csv')
##############################
#Learning
# Trainig
##############################
model.fit(x_train, y_train)
##############################
#Predict results
# Predict results
##############################
x_test = df_test.values
y_test = model.predict(x_test)
#Combine the result with the DataFrame of the PassengerId
# Combine the data frame of PassengerId and the result
df_output = pandas.concat([df_test_index, pandas.DataFrame(y_test, columns=['Survived'])], axis=1)
# result.Write csv to current directory
# Write result.csv to the current directory
df_output.to_csv('result.csv', index=False)
When I submitted this, the score became "0.80382".
The score exceeded 0.8 and I was able to be in the top 10%. The input data finally used is as follows.
No | item name | item name(Japanese) | Conversion method |
---|---|---|---|
1 | Pclass | Ticket class | Standardization |
2 | Sex | sex | Quantify |
3 | SibSp | Brother/spouse | one-hot encoding |
4 | Parch | parent/children | one-hot encoding |
5 | Ticket | Ticket number | one-hot encoding |
6 | Fare | fare | Standardization |
Until this time, I was studying with scikit-learn. There are other frameworks for machine learning, so let's use another framework as well. Next time I would like to learn using keras.
2020/01/29 First edition released 2020/02/03 Corrected typographical errors 2020/02/15 Add next link