[Kaggle] From data reading to preprocessing and encoding

0. First

When trying to analyze data using machine learning, suddenly from the obtained csv data
"OK! Let's go back!" "OK! Let's classify!"
In such a way, it is rare to suddenly put it in model generation. Rather, the barriers up to that point are quite high for beginners. So, this time I tried to summarize the data preprocessing.

1. Read data

This time I borrowed data from Kaggle's Titanic. By the way, I use jupyter. (If you know how to combine the input and output of jupyter into Qiita, please let me know ...)


%matplotlib inline
import matplotlib as plt
import pandas as pd
import numpy as np
import seaborn as sns

df = pd.read_csv('./train.csv')
df = df.set_index('PassengerId') #Set a unique column to index


2. Delete columns that are not needed for analysis


df = df.drop(['Name', 'Ticket'], axis=1) #Drop columns that are not needed for analysis


3. Check the data type and loss


#print(df.dtypes) #Click here if you want to check only the data type
#df.isnull().any(axis=0) #Click here to check only the presence or absence of null


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 9 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 69.6+ KB
Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

4. Count the types of elements on the nominal scale


#Count the types of elements on the nominal scale
import collections
c = collections.Counter(df['Sex'])
c = collections.Counter(df['Cabin'])
c = collections.Counter(df['Embarked'])


Sex: Counter({'male': 577, 'female': 314})
Cabin: 148
Embarked: Counter({'S': 644, 'C': 168, 'Q': 77, nan: 2})

5. Delete / complement defects


df = df.drop(['Cabin'], axis=1) #Deleted because it seems difficult to use for analysis
df = df.dropna(subset = ['Embarked']) #Cabin has few defects, so delete it with dropna in a line
df = df.fillna(method = 'ffill') #Other columns complement from previous data


Survived    False
Pclass      False
Sex         False
Age         False
SibSp       False
Parch       False
Fare        False
Embarked    False
dtype: bool
(889, 8)

6. Label encoding

Converts object (character) type data to numerical (numerical) type data using label encoding.


from sklearn.preprocessing import LabelEncoder
for column in ['Sex','Embarked']:
    le = LabelEncoder()
    df[column] = le.transform(df[column])


You can see that there is label encoding for Sex and Embarked.

By label encoding, you can also see the outline of the data using seaborn's pair plot etc.


I think it's good to select only continuous variables and look at them. (Although the following includes some that are not strictly continuous variables ...)

df_continuous = df[['Age','SibSp','Parch','Fare']]


7. One hot encoding

One-hot encodes numeric data and other nominal scale data using one-hot encoding. I didn't know how to use scikit-learn's OneHotEncoder well, so I used get_dummies from pandas.


df = pd.get_dummies(df, columns = ['Pclass','Embarked'])


You can see that there is one hot encoding for Pclass and Embarked.

8. Summary

I think this trend is common to many data analyses. Please refer to the flow of pretreatment.

