1. Purpose

I started studying programming (python) around December 2018 and started working on kaggle in the last few months.

Among them, there were many things that I wondered "how do I do this?", And I proceeded while investigating various things, so this time I will focus on that "preprocessing" and summarize it. think.

2. Import data to use and what you need

It is a familiar Titanic at kaggle. https://www.kaggle.com/c/titanic

import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import seaborn as sns

import numpy as np
import pandas as pd

I will read the data.

df_train =  pd.read_csv(r"C:///train.csv")
df_test = pd.read_csv(r"C:///test.csv")

3. Data appearance (statistics)

(1) Confirmation of the number of data [.shape]

df_train.shape
df_test.shape

Now you can see the training data as (891, 12) and the test data as (418, 11).

(2) Take a look at the data [.head ()]

Let's put out the first 5 lines to see what kind of data is in it.

df_train.head()
df_test.head()

It seems that the data looks like this. キャプチャ1.PNG

As you can see by df_test.head (), the column shows the column in which the objective variable "Survived" has disappeared from the column of training data.

(3) Look at the data type [.info ()]

Let's combine the training data and test data information into one.

df_train.info()
print("-"*48)
df_test.info()

You can roughly check the number of data and data type in each column.

(4) Look at the information of numerical data and category data [.describe ()]

◆ Numerical data

df_train.describe()

Numerical data information is displayed.

◆ Category data

df_train.describe(include=['O'])

The categorical variables Name, Sex, Ticket, Cabin, and Embarked are displayed for each number / unique value, top frequency category, and the number of occurrences.

4. Integrate training data and test data

This is important to me as it stumbled so much in the early days.

In the end, we will learn separately for training data and test data, but since it is complicated to perform pre-processing such as missing value processing and categorical variable processing for each training data and test data, we will summarize them first and then later. I will divide it again from.

#Create a new column called TrainFlag and set it to True for training data and False for test data.
df_train["TrainFlag"] = True
df_test["TrainFlag"] = False

#Combine training and test data
df_all = df_train.append(df_test)

#PassengerId is probably not used for features, so I want to delete it.
#However, since it is necessary when submitting test data later, it will not be completely deleted.
#Keep as an index
df_all.index = df_all["PassengerId"]
df_all.drop("PassengerId", axis = 1, inplace = True)

Now, if you look at df_all, it looks like this. The index will be PassengerId, and on the far right is the TrainFlag column we just added. True is the training data and False (not shown here) is the test data.

5. Missing value processing

(1) Check the number of missing values

This will sort them in descending order.

df_all.isnull().sum().sort_values(ascending=False)

The number of variables should be about this time, but when the number of variables increases, it is very difficult to see if the numerical values of the missing values of all the explanatory variables are given.

Therefore, let's narrow down the variables that have "missing values" and sort them in descending order.

df_all.isnull().sum()[df_train.isnull().sum()>0].sort_values(ascending = False)

Then only the variables with missing values were sorted in descending order!

(2) Perform missing value processing

◆Cabin With df_all.shape, the number of data is 1,309 when the training data and test data are combined. Of these, there are 1,014 missing Cabins, so this time I will exclude each column from the analysis target, so I will not perform missing value processing here.

◆Age Age also has some missing values, but there are not so many, and although I will not touch on this time, age seems to affect the model, so we will process missing values.

There are several ways to do it, but this time I will fill the orthodox with the average value.

df_all["Age"] = df_all["Age"].fillna(df_all["Age"].mean())

◆Embarked If you do df_all.describe (include = ['O']), you can see that Embarked has only 3 unique values, and most of them are "S", so this time we will fill in the missing values with S.

Originally, it is better to visualize and analyze the data a little more firmly.

df_all["Embarked"] = df_all["Embarked"].fillna("S")

(3) Finally, check if there are no missing values

df_all.isnull().sum()[df_train.isnull().sum()>0].sort_values(ascending = False)

Then, you can see that only Cabin has a missing value, so the missing value processing is now complete.

6. Delete unnecessary lines

I will omit detailed examination this time, but as a result of data analysis, it is assumed that Cabin, Name, PassengerId, Ticket are unnecessary for this model construction.

Let's erase these columns.

df_all = df_train.drop(["Cabin", 'Name','PassengerId','Ticket'], axis = 1)

7. Conversion of categorical variables

df_all = pd.get_dummies(df_all, drop_first=True)

If you check with df_all.head (), you can see that the categorical variable could be processed like this. キャプチャ9.PNG

With the above, the insanely orthodox preprocessing is completed, and after that, we will proceed to full-scale model construction.

8. Conclusion

It is a very rudimentary content for intermediate and above, but at first it was very difficult to proceed while examining these, and each time I was stressed.

We hope that it will help such people to deepen their understanding.

[Kaggle] Summary of pre-processing (statistics, missing value processing, etc.)