This is the story of participating in the Kaggle </ b> competition for the first time. In Last "First Kaggle", ・ How to participate in Kaggle ・ How to participate in the competition ・ Until you join and write the code ・ How to submit the results I mainly wrote. This time, I would like to proceed to the point of studying at the "Titanic Competition". Can the accuracy rate of the sample code exceed "76%"?
It's from how much a person who knows machine learning describes it. About half a year ago (April 2019), I became interested in machine learning and learned mainly from the following books. ・ [Theory and practice by a Python machine learning programming expert data scientist](https://www.amazon.co.jp/Python-%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0-% E9% 81 % 94% E4% BA% BA% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3 % E3% 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% 90% 86% E8 % AB% 96% E3% 81% A8% E5% AE% 9F% E8% B7% B5-impress-gear / dp / 4295003379 / ref = dp_ob_title_bk) ・ [Detailed explanation Deep learning ~ Time series data processing by TensorFlow ・ Keras ~](https://www.amazon.co.jp/%E8%A9%B3%E8%A7%A3-%E3%83%87%E3 % 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0-TensorFlow% E3 % 83% BBKeras% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 99% 82% E7% B3% BB% E5% 88% 97% E3% 83% 87% E3% 83 % BC% E3% 82% BF% E5% 87% A6% E7% 90% 86-% E5% B7% A3% E7% B1% A0-% E6% 82% A0% E8% BC% 94 / dp / 4839962510 / ref = sr_1_2? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords =% E8% A9% B3% E8% A7% A3 +% E3% 83% 87 % E3% 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0 +% 7ETensorFlow % E3% 83% BBKeras% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 99% 82% E7% B3% BB% E5% 88% 97% E3% 83% 87% E3 % 83% BC% E3% 82% BF% E5% 87% A6% E7% 90% 86 & qid = 1575853564 & s = books & sr = 1-2)
The situation is that you don't know what "scikit-learn", "tensorflow", and "keras" are.
The image I understand is as follows.
At my own level, I wonder if I can write learning code using scikit-learn or keras.
The flow of machine learning is as follows.
Check and maintain the data.
First of all, since we will start with a new Notebook different from the previous one, click "New Notebook" and select the language "Paython" and Type "Notebook" as before.
Check train.csv. Since you can write the code, you can output the data with the pandas.head () command, but you can also download it, so let's download it. Click train.csv and you will see 100 lines of data on the screen. You can download it with the download button.
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
2 | 1 | 1 | Cumings, | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | S | |
4 | 1 | 1 | Futrelle, | female | 35 | 1 | 0 | 113803 | 53.1 | C123 | S |
5 | 0 | 3 | Allen, | male | 35 | 0 | 0 | 373450 | 8.05 | S | |
6 | 0 | 3 | Moran, | male | 0 | 0 | 330877 | 8.4583 | Q | ||
7 | 0 | 1 | McCarthy | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, | male | 2 | 3 | 1 | 349909 | 21.075 | S | |
9 | 1 | 3 | Johnson, | female | 27 | 0 | 2 | 347742 | 11.1333 | S | |
10 | 1 | 2 | Nasser, | female | 14 | 1 | 0 | 237736 | 30.0708 | C |
Check the CSV with Excel etc. There are some items that I don't understand, but there is a description in the data of the competition. As an aside, as explained in OverView, the sample "gender_submission.csv" seems to consider "only women survived". Certainly, the values of "Sex" in "test.csv" and "Survived" in "gender_submission.csv" match. That's why the correct answer rate of "76%" is quite formidable.
Data Dictionary
Variable | Definition | Translation | Key |
---|---|---|---|
survival | Survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | sex | |
Age | Age in years | age | |
sibsp | # of siblings / spouses aboard the Titanic | Brothers riding the Titanic/Number of spouses | |
parch | # of parents / children aboard the Titanic | Parents riding the Titanic/Number of children | |
ticket | Ticket number | Ticket number | |
fare | Passenger fare | Passenger fare | |
cabin | Cabin number | Room number | |
embarked | Port of Embarkation | Boarding port | C = Cherbourg, Q = Queenstown, S = Southampton |
Consider which items to use for learning. Since "Survival" is the part that is being asked, I will use it as an answer for learning. Since women and children are more likely to get on lifeboats preferentially, "gender" and "age" are used. Also, depending on the situation, wealth may have had an effect. Let's also use "ticket class" and "passenger fare". "Name", "ticket number" and "port of embarkation" do not seem to be related, so they are excluded. The problem is "sibsp" and "parch". When "sibsp" and "parch" are aggregated by Excel etc., it is as follows. It seems to be related, but this time I excluded it for the sake of simplicity.
value of sibsp | Survival=1 | Survival=0 | Survival rate |
---|---|---|---|
0 | 210 | 608 | 26% |
1 | 112 | 209 | 35% |
2 | 13 | 28 | 32% |
3 | 4 | 16 | 20% |
4 | 3 | 18 | 14% |
5 | 0 | 5 | 0% |
8 | 0 | 7 | 0% |
value of parch | Survival=1 | Survival=0 | Survival rate |
---|---|---|---|
0 | 233 | 678 | 26% |
1 | 65 | 118 | 36% |
2 | 40 | 80 | 33% |
3 | 3 | 5 | 38% |
4 | 0 | 4 | 0% |
5 | 1 | 5 | 17% |
6 | 0 | 1 | 0% |
Delete the sample code and write the following code. Load train.csv and extract only the required items ('Survived','Pclass','Sex','Age','Fare').
import numpy
import pandas
##############################
#Data preprocessing 1
#Extract the required items
##############################
# train.load csv
df_train = pandas.read_csv('/kaggle/input/titanic/train.csv')
df_train = df_train.loc[:, ['Survived', 'Pclass', 'Sex', 'Age', 'Fare']]
df_train.head()
index | Survived | Pclass | Sex | Age | Fare |
---|---|---|---|---|---|
0 | 0 | 3 | male | 22 | 7.25 |
1 | 1 | 1 | female | 38 | 71.2833 |
2 | 1 | 3 | female | 26 | 7.925 |
3 | 1 | 1 | female | 35 | 53.1 |
4 | 0 | 3 | male | 35 | 8.05 |
I was able to extract only the required items.
Check for missing values.
##############################
#Data preprocessing 2
#Handle missing values
##############################
#Check for missing values
df_train.isnull().sum()
Column | count |
---|---|
Survived | 0 |
Pclass | 0 |
Sex | 0 |
Age | 177 |
Fare | 0 |
There are many data without age. If possible, fill in the missing values, but this time delete them.
#Delete rows with null age
# Delete rows with null age
df_train = df_train.dropna(subset=['Age']).reset_index(drop=True)
len(df_train)
count |
---|
714 |
Lines with null age have been removed.
Gender "male" and "female" are difficult to handle as they are, so digitize them. Since there are only two types, male and female, you can convert them yourself, but scikit-learn has a convenient class called LabelEncoder </ b>, so let's use it. LabelEncoder: The fit method and fit_transform method replace the character string with an integer from 0 to N-1 when there are N types of character strings appearing in the input.
##############################
#Data preprocessing 3
#Quantify the label (name)
##############################
from sklearn.preprocessing import LabelEncoder
#Quantify gender using Label Encoder
encoder = LabelEncoder()
df_train['Sex'] = encoder.fit_transform(df_train['Sex'].values)
df_train.head()
index | Survived | Pclass | Sex | Age | Fare |
---|---|---|---|---|---|
0 | 0 | 3 | 1 | 22 | 7.25 |
1 | 1 | 1 | 0 | 38 | 71.2833 |
2 | 1 | 3 | 0 | 26 | 7.925 |
3 | 1 | 1 | 0 | 35 | 53.1 |
4 | 0 | 3 | 1 | 35 | 8.05 |
"Sex" has been quantified. This encoder will also be used later when quantifying sex in test.csv.
It seems that there are many cases where learning can be done better by adjusting the scale (standardization) rather than inputting the numerical values as learning data as they are. For example, when analyzing test results, is it easier to understand by analyzing the deviation value than by analyzing the points (out of 100 points, out of 200 points)? Let's standardize "age" and "fare". As with label encoding, standardization has a useful class in scikit-learn. Standard Scaler </ b>.
##############################
#Data preprocessing 4
#Standardize numbers
# Data preprocessing 4
# Standardize numbers
##############################
from sklearn.preprocessing import StandardScaler
#Standardization
# Standardize numbers
standard = StandardScaler()
df_train_std = pandas.DataFrame(standard.fit_transform(df_train.loc[:, ['Age', 'Fare']]), columns=['Age', 'Fare'])
#Standardize Age
# Standardize Age
df_train['Age'] = df_train_std['Age']
#Standardize Fare
# Standardize Fare
df_train['Fare'] = df_train_std['Fare']
df_train.head()
index | Survived | Pclass | Sex | Age | Fare |
---|---|---|---|---|---|
0 | 0 | 3 | 1 | -0.530376641 | -0.518977865 |
1 | 1 | 1 | 0 | 0.571830994 | 0.69189675 |
2 | 1 | 3 | 0 | -0.254824732 | -0.506213563 |
3 | 1 | 1 | 0 | 0.365167062 | 0.348049152 |
4 | 0 | 3 | 1 | 0.365167062 | -0.503849804 |
Age and fares have been standardized. At this point, data maintenance is complete.
Once the data is ready, it's time to build the model. For the time being, let's build it with scikit-learn. Below is a flowchart of algorithm selection on the sckit-learn site.
Let's select a model according to this flowchart. "Category" YES ⇒ "With label data" Yes, proceed to "classification" on the upper left. I think it corresponds to "classification supervised learning". It became "Linear SVC" on the chart.
When learning, the data to be learned (= x_train) and the answer (= y_train) are passed to the model separately. The image is as follows.
y_train | x_train | ||||
---|---|---|---|---|---|
index | Survived | Pclass | Sex | Age | Fare |
0 | 0 | 3 | 1 | -0.530376641 | -0.518977865 |
1 | 1 | 1 | 0 | 0.571830994 | 0.69189675 |
2 | 1 | 3 | 0 | -0.254824732 | -0.506213563 |
3 | 1 | 1 | 0 | 0.365167062 | 0.348049152 |
4 | 0 | 3 | 1 | 0.365167062 | -0.503849804 |
The code is below.
##############################
#Model building
##############################
from sklearn.svm import LinearSVC
#Prepare training data
x_train = df_train.loc[:, ['Pclass', 'Sex', 'Age', 'Fare']].values
y_train = df_train.loc[:, ['Survived']].values
# y_Remove train dimension
y_train = numpy.reshape(y_train,(-1))
#Generate a model
model = LinearSVC(random_state=1)
Training simply passes the training data to the model.
##############################
#Learning
##############################
model.fit(x_train, y_train)
Let's see the learning result with test data. test.csv should be similar to the training data (x_train). There is a deficiency in age and fare, but even if it is deficient, the result must be predicted. If it is test data, it will be converted to "0" without being deleted.
##############################
# test.Convert csv
# convert test.csv
##############################
# test.load csv
# Load test.csv
df_test = pandas.read_csv('/kaggle/input/titanic/test.csv')
# 'PassengerId'To extract(To combine with the result)
df_test_index = df_test.loc[:, ['PassengerId']]
# 'Survived', 'Pclass', 'Sex', 'Age', 'Fare'To extract
# Extract 'Survived', 'Pclass', 'Sex', 'Age', 'Fare'
df_test = df_test.loc[:, ['Pclass', 'Sex', 'Age', 'Fare']]
#Quantify gender using Label Encoder
# Digitize gender using LabelEncoder
df_test['Sex'] = encoder.transform(df_test['Sex'].values)
df_test_std = pandas.DataFrame(standard.transform(df_test.loc[:, ['Age', 'Fare']]), columns=['Age', 'Fare'])
#Standardize Age
# Standardize Age
df_test['Age'] = df_test_std['Age']
#Standardize Fare
# Standardize Fare
df_test['Fare'] = df_test_std['Fare']
# Age,Convert Fare Nan to 0
# Convert Age and Fare Nan to 0
df_test = df_test.fillna({'Age':0, 'Fare':0})
df_test.head()
Index | Pclass | Sex | Age | Fare |
---|---|---|---|---|
0 | 3 | 1 | 0.298549339 | -0.497810518 |
1 | 3 | 0 | 1.181327932 | -0.512659955 |
2 | 2 | 1 | 2.240662243 | -0.464531805 |
3 | 3 | 1 | -0.231117817 | -0.482887658 |
4 | 3 | 0 | -0.584229254 | -0.417970618 |
I was able to convert the test data in the same way. Predict the result. Just pass in the test data and predict.
##############################
#Predict results
# Predict results
##############################
x_test = df_test.values
y_test = model.predict(x_test)
The result is in y_test. Save the result in the same format as gender_submission.csv.
#Combine the result with the DataFrame of the PassengerId
# Combine the data frame of PassengerId and the result
df_output = pandas.concat([df_test_index, pandas.DataFrame(y_test, columns=['Survived'])], axis=1)
# result.Write csv to current directory
# Write result.csv to the current directory
df_output.to_csv('result.csv', index=False)
With the above, we were able to obtain the results. I will try to execute it with "Commit" as before. After the execution is completed, click "Open Viersion". You can see that result.csv has been created.
Click "Submit to Competition" to submit. What will happen ...
The result was "0.75119". 75%. It's worse than the sample data ^^;
How was that. I didn't adjust the learning parameters at all, but I understood the learning flow. Next time will examine the data and look at the learning parameters so that the score will be a little better.
2019/12/11 First edition released 2019/12/26 Next link installation 2020/01/03 Source comment correction 2020/03/26 Partially revised the source code of "6. Predict the result with test data"
Recommended Posts