Introduction

This is the story of participating in the Kaggle </ b> competition for the first time. In the previous "Try all scikit-learn models with Kaggle's Titanic", perform "cross-validation" on the scikit-learn model. I was able to raise the score a little. This time, I would like to do "check the raw data" that should have been done first.

table of contents

Significance of checking raw data

Result

Check the raw data

Learn

All codes

Summary

History

1. Significance of checking raw data

I read a book called The Power of Analysis that Changes the Company. One of the contents of the book says, "Let's check the raw data before analyzing the data." Outliers cannot be found without looking at the raw data. Before starting data analysis, first visualize the raw data to see if there are any outliers. It is said that you should acquire such a habit. Check the raw data, check for abnormal values, and check how to use the data again.

2. Result

According to the result, by scrutinizing the input data, the score increased a little and became "0.80382". The result is the top 9% (as of January 7, 2020). I would like to see the flow up to submission.

3. Check the raw data

Let's check some raw data.

Fare

Let's make a scatter plot of fares for each pclass (ticket class). It became as follows.

The horizontal axis is pclass. Fares of "1" tend to be high. As for the ticket class, the grade seems to improve in the order of 1> 2> 3. From the scatter plot, you can see that the fare "0" is in each pclass. Let's take a look at the raw data. Sort by fare (Fare) in ascending order.

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked

180 0 3 male 36 0 0 LINE 0 S

264 0 1 male 40 0 0 112059 0 B94 S

272 1 3 male 25 0 0 LINE 0 S

278 0 2 male 0 0 239853 0 S

303 0 3 male 19 0 0 LINE 0 S

414 0 2 male 0 0 239853 0 S

467 0 2 male 0 0 239853 0 S

482 0 2 male 0 0 239854 0 S

598 0 3 male 49 0 0 LINE 0 S

634 0 1 male 0 0 112052 0 S

675 0 2 male 0 0 239856 0 S

733 0 2 male 0 0 239855 0 S

807 0 1 male 39 0 0 112050 0 A36 S

816 0 1 male 0 0 112058 0 B102 S

823 0 1 male 38 0 0 19972 0 S

379 0 3 male 20 0 0 2648 4.0125 C

873 0 1 male 33 0 0 695 5 B51 B53 B55 S

327 0 3 male 61 0 0 345364 6.2375 S

844 0 3 male 34.5 0 0 2683 6.4375 C

In ascending order of Fare, Fare is "0" and PClass is 1, 2, and 3. Fare "0" is not free and seems to mean "fare unknown". Let's exclude Fare "0" from the training data. If you exclude Fare "0" and create a scatter plot again, it will be as follows.

It's a little easier to see. I am also concerned about the small point of pclass "1". Looking at the table above, there is data for Fare "5" with pclass "1". This may also be an outlier, so let's exclude it.

It is a scatter plot with a certain range of fares for each pclass.

Ticket

The ticket number is a nominal scale. I will sort them in ascending order of ticket numbers.

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked

258 1 1 female 30 0 0 110152 86.5 B77 S

505 1 1 female 16 0 0 110152 86.5 B79 S

760 1 1 female 33 0 0 110152 86.5 B77 S

263 0 1 male 52 1 1 110413 79.65 E67 S

559 1 1 female 39 1 1 110413 79.65 E67 S

586 1 1 female 18 0 2 110413 79.65 E68 S

111 0 1 male 47 0 0 110465 52 C110 S

476 0 1 male 0 0 110465 52 A14 S

431 1 1 male 28 0 0 110564 26.55 C52 S

367 1 1 female 60 1 0 110813 75.25 D37 C

If you look at the ticket number, you can't read the regularity, whether it's just numbers or a combination of letters and numbers. You can also see that there are people with the same ticket number. People with the same ticket number often have the same surname when looking at their names. I think it's a family. Also, when compared with Survived of people with the same ticket number, Survived tends to be the same with the same ticket number. We will consider the policy of labeling with the ticket number. The image below.

PassengerId Survived Ticket Ticket (label)

505 1 110152 Ticket A

258 1 110152 Ticket A

760 1 110152 Ticket A

586 1 110413 Ticket B

559 1 110413 Ticket B

263 0 110413 Ticket B

111 0 110465 Ticket C

476 0 110465 Ticket C

431 1 110564 NaN

367 1 110813 NaN

We want to group the same ticket numbers, so we'll use "NaN" for unique tickets. Tickets A and B can be digitized as they are, but one-hot encoding is used to clearly indicate that they are labeled. The image is as follows. The source code will be described later, but you can do One-Hot encoding with pandas.get_dummies.

PassengerId Survived Ticket A Ticket B Ticket C

505 1 1 0 0

258 1 1 0 0

760 1 1 0 0

586 1 0 1 0

559 1 0 1 0

263 0 0 1 0

111 0 0 0 1

476 0 0 0 1

431 1 0 0 0

367 1 0 0 0

sibsp (number of siblings / spouse) / parch (number of parents / children)

sibsp and parch also graphed previously, but let's graph it again. There was no significant difference in the correlation coefficient, but the graphs for both sibsp and parch show the following. ・ When sibsp and parch are 0, Survived is more often 0 (about twice). · If sibsp, parch is 1 or 2, Survived 0s and 1s are about the same ・ The population parameter is small when sibsp and parch are 3 or more.

Last time, I excluded it from the training data because the correlation coefficient is small, but it seems that 0, 1, and 2 can be used as label data. When only the data with sibsp and parch of 3 or less are extracted and the correlation coefficient is confirmed, the result is as follows.

#Check the number of Kramer correlations when SibSp is less than 3 df_SibSp = df[df['SibSp'] < 3] cramersV(df_SibSp['Survived'], df_SibSp['SibSp']) #Survived and SibSp(Less than 3)Display the cross-tabulation table of cross_sibsp = pandas.crosstab(df_SibSp['Survived'], df_SibSp['SibSp']) cross_sibsp cross_sibsp.T.plot(kind='bar', stacked=False, width=0.8) plt.show()

0.16260950922794606

The correlation coefficient was 0.16, which was "weakly correlated". I'll omit it, but Parch has similar results. So, like Ticket, let's try One-Hot encoding for SibSp and Parch. The image is as follows.

PassengerId Survived SibSp_1 SibSp_2 SibSp_3 SibSp_4 SibSp_5 SibSp_8

505 1 0 0 0 0 0 0

258 1 0 0 0 0 0 0

760 1 0 0 0 0 0 0

586 1 0 0 0 0 0 0

559 1 1 0 0 0 0 0

263 0 1 0 0 0 0 0

111 0 0 0 0 0 0 0

476 0 0 0 0 0 0 0

431 1 0 0 0 0 0 0

367 1 1 0 0 0 0 0

Cabin (room number)

Let's check Cabin. Of the 900 verification data (train.csv), about 200 Cabin. Cabin is a nominal scale. When the first character is regarded as the same group and grouped, it becomes as follows.

The result is that there are many Survived "1" s in each case. The label data of the first character seems to be useful. Cabin will also try One-Hot encoding the first character. The image is as follows.

PassengerId Survived Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T

505 1 0 1 0 0 0 0 0 0

258 1 0 1 0 0 0 0 0 0

760 1 0 1 0 0 0 0 0 0

586 1 0 0 0 0 1 0 0 0

559 1 0 0 0 0 1 0 0 0

263 0 0 0 0 0 1 0 0 0

111 0 0 0 1 0 0 0 0 0

476 0 1 0 0 0 0 0 0 0

431 1 0 0 1 0 0 0 0 0

367 1 0 0 0 1 0 0 0 0

4. Learn

Let's learn based on the situation so far. The input data is as follows.

No item name item name(Japanese) Conversion method

1 Pclass Ticket class Standardization

2 Sex sex Quantify

3 SibSp Brother/spouse one-hot encoding

4 Parch parent/children one-hot encoding

5 Ticket Ticket number one-hot encoding

6 Fare fare Standardization

7 Cabin Room number The first character is one-hot encoding

Try all models of kaggle⑤, and also with the model by grid search of kaggle④ When I tried the parameters, I got the following:

GradientBoostingClassifier(criterion='friedman_mse', init=None, learning_rate=0.1, loss='exponential', max_depth=6, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='auto', random_state=1, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False)

5. All codes

The full code is below. However, when I actually trained, the score did not increase when "Cabin" was included, so I finally excluded Cabin.

import numpy import pandas import matplotlib.pyplot as plt ###################################### #Number of Klamer correlations # Cramer's coefficient of association # 0.5 >= :Very strong correlation # 0.25 >= :Strong correlation # 0.1 >= :Slightly weak correlation # 0.1　< :No correlation ###################################### def cramersV(x, y): """ Calc Cramer's V. Parameters ---------- x : {numpy.ndarray, pandas.Series} y : {numpy.ndarray, pandas.Series} """ table = numpy.array(pandas.crosstab(x, y)).astype(numpy.float32) n = table.sum() colsum = table.sum(axis=0) rowsum = table.sum(axis=1) expect = numpy.outer(rowsum, colsum) / n chisq = numpy.sum((table - expect) ** 2 / expect) return numpy.sqrt(chisq / (n * (numpy.min(table.shape) - 1))) ###################################### #Correlation ratio # Correlation ratio # 0.5 >= :Very strong correlation # 0.25 >= :Strong correlation # 0.1 >= :Slightly weak correlation # 0.1　< :No correlation ###################################### def CorrelationV(x, y): """ Calc Correlation ratio Parameters ---------- x : nominal scale {numpy.ndarray, pandas.Series} y : ratio scale {numpy.ndarray, pandas.Series} """ variation = ((y - y.mean()) ** 2).sum() inter_class = sum([((y[x == i] - y[x == i].mean()) ** 2).sum() for i in numpy.unique(x)]) correlation_ratio = inter_class / variation return 1 - correlation_ratio # train.load csv # Load train.csv df = pandas.read_csv('/kaggle/input/titanic/train.csv') # test.load csv # Load test.csv df_test = pandas.read_csv('/kaggle/input/titanic/test.csv') # 'PassengerId'To extract(To combine with the result) # Extract 'PassengerId'(To combine with the result) df_test_index = df_test[['PassengerId']] df_all = pandas.concat([df, df_test], sort=False) ############################## #Data preprocessing #Extract the required items # Data preprocessing # Extract necessary items ############################## df = df[['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Fare']] df_test = df_test[['Pclass', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Fare']] ############################## #Plot a scatter plot of Fare and pclass # Draw scatter plot of Fare and pclass ############################## plt.scatter(df['Pclass'], df['Fare']) plt.xticks(numpy.linspace(1, 3, 3)) plt.ylim(0, 300) plt.show() ############################## #Exclude Fare 0 # Exclude Fare 0 ############################## df = df[df['Fare'] != 0].reset_index(drop=True) ############################## #Plot a scatter plot of Fare and pclass # Draw scatter plot of Fare and pclass ############################## plt.scatter(df['Pclass'], df['Fare']) plt.xticks(numpy.linspace(1, 3, 3)) #plt.xlim(1, 3) plt.ylim(0, 300) plt.show() ############################## #Exclude Fare 0 # Exclude Fare 0 ############################## df = df[df['Fare'] != 5].reset_index(drop=True) ############################## #Plot a scatter plot of Fare and pclass # Draw scatter plot of Fare and pclass ############################## plt.scatter(df['Pclass'], df['Fare']) plt.xticks(numpy.linspace(1, 3, 3)) plt.ylim(0, 300) plt.show() ############################## #View Survived and Age crosstabs # Display Survived and Age crosstabulation table ############################## cross_age = pandas.crosstab(df_all['Survived'], round(df_all['Age'],-1)) cross_age cross_age.T.plot(kind='bar', stacked=False, width=0.8) plt.show() ############################## #View Survived and SibSp crosstabulation # Display Survived and SibSp crosstabulation table ############################## cross_sibsp = pandas.crosstab(df['Survived'], df['SibSp']) cross_sibsp cross_sibsp.T.plot(kind='bar', stacked=False, width=0.8) plt.show() #Check the number of Kramer correlations when SibSp is less than 3 # Check Cramer's coefficient of association when SibSp is less than 3 df_SibSp = df[df['SibSp'] < 3] cramersV(df_SibSp['Survived'], df_SibSp['SibSp']) ############################## #Survived and SibSp(Less than 3)Display the cross-tabulation table of # Display a crosstabulation of Survived and SibSp (less than 3) ############################## cross_sibsp = pandas.crosstab(df_SibSp['Survived'], df_SibSp['SibSp']) cross_sibsp cross_sibsp.T.plot(kind='bar', stacked=False, width=0.8) plt.show() ############################## #View Survived and Parch crosstabs # Display Survived and Parch crosstabulation table ############################## cross_parch = pandas.crosstab(df['Survived'], df['Parch']) cross_parch cross_parch.T.plot(kind='bar', stacked=False, width=0.8) plt.show() #Check the number of Klamer correlations when Parch is less than 3 # Check Cramer's coefficient of association when Parch is less than 3 df_Parch = df[df['Parch'] < 3] cramersV(df_Parch['Survived'], df_Parch['Parch']) ############################## #Survived and Parch(Less than 3)Display the cross-tabulation table of # Display a crosstabulation of Survived and Parch (less than 3) ############################## cross_parch = pandas.crosstab(df_Parch['Survived'], df_Parch['Parch']) cross_parch cross_parch = pandas.crosstab(df_Parch['Survived'], df_Parch['Parch']) cross_parch cross_parch.T.plot(kind='bar', stacked=False, width=0.8) plt.show() from sklearn.preprocessing import LabelEncoder ############################## #Data preprocessing #Quantify the label (name) # Data preprocessing # Digitize labels ############################## ############################## # Sex ############################## encoder_sex = LabelEncoder() df['Sex'] = encoder_sex.fit_transform(df['Sex'].values) df_test['Sex'] = encoder_sex.transform(df_test['Sex'].values) ############################## #Data preprocessing # One-Hot encoding # Data preprocessing # One-Hot Encoding ############################## ############################## # SibSp ############################## SibSp_values = df_all['SibSp'].value_counts() SibSp_values = pandas.Series(SibSp_values.index, name='SibSp') categories = set(SibSp_values.tolist()) df['SibSp'] = pandas.Categorical(df['SibSp'], categories=categories) df_test['SibSp'] = pandas.Categorical(df_test['SibSp'], categories=categories) df = pandas.get_dummies(df, columns=['SibSp']) df_test = pandas.get_dummies(df_test, columns=['SibSp']) ############################## # Parch ############################## Parch_values = df_all['Parch'].value_counts() Parch_values = pandas.Series(Parch_values.index, name='Parch') categories = set(Parch_values.tolist()) df['Parch'] = pandas.Categorical(df['Parch'], categories=categories) df_test['Parch'] = pandas.Categorical(df_test['Parch'], categories=categories) df = pandas.get_dummies(df, columns=['Parch']) df_test = pandas.get_dummies(df_test, columns=['Parch']) ############################## # Ticket ############################## ticket_values = df_all['Ticket'].value_counts() ticket_values = ticket_values[ticket_values > 1] ticket_values = pandas.Series(ticket_values.index, name='Ticket') categories = set(ticket_values.tolist()) df['Ticket'] = pandas.Categorical(df['Ticket'], categories=categories) df_test['Ticket'] = pandas.Categorical(df_test['Ticket'], categories=categories) df = pandas.get_dummies(df, columns=['Ticket']) df_test = pandas.get_dummies(df_test, columns=['Ticket']) ############################## #Data preprocessing #Standardize numbers # Data preprocessing # Standardize numbers ############################## from sklearn.preprocessing import StandardScaler #Standardization # Standardize numbers standard = StandardScaler() df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Fare']]), columns=['Pclass', 'Fare']) df['Pclass'] = df_std['Pclass'] df['Fare'] = df_std['Fare'] df_test_std = pandas.DataFrame(standard.transform(df_test[['Pclass', 'Fare']]), columns=['Pclass', 'Fare']) df_test['Pclass'] = df_test_std['Pclass'] df_test['Fare'] = df_test_std['Fare'] ############################## #Data preprocessing #Handle missing values # Data preprocessing # Fill or remove missing values ############################## df_test = df_test.fillna({'Fare':0}) #Prepare training data # Prepare training data x_train = df.drop(columns='Survived').values y_train = df[['Survived']].values # y_Remove train dimension # Delete y_train dimension y_train = numpy.ravel(y_train) ############################## #Build a model # Build the model # GradientBoostingClassifier ############################## from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier(random_state=1,loss='exponential', learning_rate=0.1, max_depth=6) import os if(os.path.exists('./result.csv')): os.remove('./result.csv') ############################## #Learning # Trainig ############################## model.fit(x_train, y_train) ############################## #Predict results # Predict results ############################## x_test = df_test.values y_test = model.predict(x_test) #Combine the result with the DataFrame of the PassengerId # Combine the data frame of PassengerId and the result df_output = pandas.concat([df_test_index, pandas.DataFrame(y_test, columns=['Survived'])], axis=1) # result.Write csv to current directory # Write result.csv to the current directory df_output.to_csv('result.csv', index=False)

When I submitted this, the score became "0.80382".

6. Summary

The score exceeded 0.8 and I was able to be in the top 10%. The input data finally used is as follows.

No item name item name(Japanese) Conversion method

1 Pclass Ticket class Standardization

2 Sex sex Quantify

3 SibSp Brother/spouse one-hot encoding

4 Parch parent/children one-hot encoding

5 Ticket Ticket number one-hot encoding

6 Fare fare Standardization

Until this time, I was studying with scikit-learn. There are other frameworks for machine learning, so let's use another framework as well. Next time I would like to learn using keras.

History

2020/01/29 First edition released 2020/02/03 Corrected typographical errors 2020/02/15 Add next link

Recommended Posts
Check raw data with Kaggle's Titanic (kaggle ⑥)

Check the correlation with Kaggle's Titanic (kaggle③)

Select models with Kaggle's Titanic (kaggle ④)

Predict Kaggle's Titanic with keras (kaggle ⑦)

I tried learning with Kaggle's Titanic (kaggle②)

I tried factor analysis with Titanic data!

Data analysis before kaggle's titanic feature generation

Read data with python / netCDF> nc.variables [] / Check data size

Data analysis Titanic 2

Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)

I tried principal component analysis with Titanic data!

Challenge Kaggle Titanic

Data analysis Titanic 1

Try Theano with Kaggle's MNIST Data ~ Logistic Regression ~

Data analysis Titanic 3

Basic visualization techniques learned from Kaggle Titanic data

Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Modeling-

Data analysis with python 2

Visualize data with Streamlit

Reading data with TensorFlow

Data visualization with pandas

Data manipulation with Pandas!

Domain check with Python

Shuffle data with pandas

Data Augmentation with openCV

Try Kaggle's Titanic tutorial

Normarize data with Scipy

Data analysis with Python

Check version with python

LOAD DATA with PyMysql

I tried to predict and submit Titanic survivors with Kaggle

Check! Get sensor data via Bluetooth with Raspberry Pi ~ Preparation

Try to process Titanic data with preprocessing library DataLiner (Append)

Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Prediction / Evaluation-

[Causal search / causal inference] Implement a Bayesian network with Titanic data

Try to process Titanic data with preprocessing library DataLiner (Encoding)

Try to process Titanic data with preprocessing library DataLiner (conversion)

PassengerId	Survived	Pclass	Sex	Age	Ticket	Fare	Cabin	Embarked
180	0	3	male	36	LINE	0		S
264	0	1	male	40	112059	0	B94	S
272	1	3	male	25	LINE	0		S
278	0	2	male		239853	0		S
303	0	3	male	19	LINE	0		S
414	0	2	male		239853	0		S
467	0	2	male		239853	0		S
482	0	2	male		239854	0		S
598	0	3	male	49	LINE	0		S
634	0	1	male		112052	0		S
675	0	2	male		239856	0		S
733	0	2	male		239855	0		S
807	0	1	male	39	112050	0	A36	S
816	0	1	male		112058	0	B102	S
823	0	1	male	38	19972	0		S
379	0	3	male	20	2648	4.0125		C
873	0	1	male	33	695	5	B51 B53 B55	S
327	0	3	male	61	345364	6.2375		S
844	0	3	male	34.5	2683	6.4375		C

PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
258	1	1	female	30	0	0	110152	86.5	B77	S
505	1	1	female	16	0	0	110152	86.5	B79	S
760	1	1	female	33	0	0	110152	86.5	B77	S
263	0	1	male	52	1	1	110413	79.65	E67	S
559	1	1	female	39	1	1	110413	79.65	E67	S
586	1	1	female	18	0	2	110413	79.65	E68	S
111	0	1	male	47	0	0	110465	52	C110	S
476	0	1	male		0	0	110465	52	A14	S
431	1	1	male	28	0	0	110564	26.55	C52	S
367	1	1	female	60	1	0	110813	75.25	D37	C

PassengerId	Survived	Ticket	Ticket (label)
505	1	110152	Ticket A
258	1	110152	Ticket A
760	1	110152	Ticket A
586	1	110413	Ticket B
559	1	110413	Ticket B
263	0	110413	Ticket B
111	0	110465	Ticket C
476	0	110465	Ticket C
431	1	110564	NaN
367	1	110813	NaN

PassengerId	Survived	Ticket A	Ticket B	Ticket C
505	1	1	0	0
258	1	1	0	0
760	1	1	0	0
586	1	0	1	0
559	1	0	1	0
263	0	0	1	0
111	0	0	0	1
476	0	0	0	1
431	1	0	0	0
367	1	0	0	0

PassengerId	Survived	SibSp_1
505	1	0
258	1	0
760	1	0
586	1	0
559	1	1
263	0	1
111	0	0
476	0	0
431	1	0
367	1	1

PassengerId	Survived	Cabin_A	Cabin_B	Cabin_C	Cabin_D	Cabin_E
505	1	0	1	0	0	0
258	1	0	1	0	0	0
760	1	0	1	0	0	0
586	1	0	0	0	0	1
559	1	0	0	0	0	1
263	0	0	0	0	0	1
111	0	0	0	1	0	0
476	0	1	0	0	0	0
431	1	0	0	1	0	0
367	1	0	0	0	1	0

No	item name	item name(Japanese)	Conversion method
1	Pclass	Ticket class	Standardization
2	Sex	sex	Quantify
3	SibSp	Brother/spouse	one-hot encoding
4	Parch	parent/children	one-hot encoding
5	Ticket	Ticket number	one-hot encoding
6	Fare	fare	Standardization
7	Cabin	Room number	The first character is one-hot encoding

PassengerId	Survived	Ticket A	Ticket B	Ticket C
505	1	1	0	0
258	1	1	0	0
760	1	1	0	0
586	1	0	1	0
559	1	0	1	0
263	0	0	1	0
111	0	0	0	1
476	0	0	0	1
431	1	0	0	0
367	1	0	0	0

PassengerId	Survived	SibSp_1
505	1	0
258	1	0
760	1	0
586	1	0
559	1	1
263	0	1
111	0	0
476	0	0
431	1	0
367	1	1

PassengerId	Survived	Cabin_A	Cabin_B	Cabin_C	Cabin_D	Cabin_E
505	1	0	1	0	0	0
258	1	0	1	0	0	0
760	1	0	1	0	0	0
586	1	0	0	0	0	1
559	1	0	0	0	0	1
263	0	0	0	0	0	1
111	0	0	0	1	0	0
476	0	1	0	0	0	0
431	1	0	0	1	0	0
367	1	0	0	0	1	0

PassengerId	Survived	Ticket A	Ticket B	Ticket C
505	1	1	0	0
258	1	1	0	0
760	1	1	0	0
586	1	0	1	0
559	1	0	1	0
263	0	0	1	0
111	0	0	0	1
476	0	0	0	1
431	1	0	0	0
367	1	0	0	0

PassengerId	Survived	SibSp_1
505	1	0
258	1	0
760	1	0
586	1	0
559	1	1
263	0	1
111	0	0
476	0	0
431	1	0
367	1	1

PassengerId	Survived	Cabin_A	Cabin_B	Cabin_C	Cabin_D	Cabin_E
505	1	0	1	0	0	0
258	1	0	1	0	0	0
760	1	0	1	0	0	0
586	1	0	0	0	0	1
559	1	0	0	0	0	1
263	0	0	0	0	0	1
111	0	0	0	1	0	0
476	0	1	0	0	0	0
431	1	0	0	1	0	0
367	1	0	0	0	1	0