[Kaggle] From data reading to preprocessing and encoding

0. First

When trying to analyze data using machine learning, suddenly from the obtained csv data
"OK! Let's go back!" "OK! Let's classify!"
In such a way, it is rare to suddenly put it in model generation. Rather, the barriers up to that point are quite high for beginners. So, this time I tried to summarize the data preprocessing.

1. Read data

This time I borrowed data from Kaggle's Titanic. By the way, I use jupyter. (If you know how to combine the input and output of jupyter into Qiita, please let me know ...)

In[1]


%matplotlib inline
import matplotlib as plt
import pandas as pd
import numpy as np
import seaborn as sns

df = pd.read_csv('./train.csv')
df = df.set_index('PassengerId') #Set a unique column to index
print(df.shape)
df.head()

image.png

2. Delete columns that are not needed for analysis

In[2]


df = df.drop(['Name', 'Ticket'], axis=1) #Drop columns that are not needed for analysis
df.head()

image.png

3. Check the data type and loss

In[3]


print(df.info())
#print(df.dtypes) #Click here if you want to check only the data type
df.isnull().sum(axis=0)
#df.isnull().any(axis=0) #Click here to check only the presence or absence of null

out[3]


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 9 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 69.6+ KB
None
Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

4. Count the types of elements on the nominal scale

In[4]


#Count the types of elements on the nominal scale
import collections
c = collections.Counter(df['Sex'])
print('Sex:',c)
c = collections.Counter(df['Cabin'])
print('Cabin:',len(c))
c = collections.Counter(df['Embarked'])
print('Embarked:',c)

out[4]


Sex: Counter({'male': 577, 'female': 314})
Cabin: 148
Embarked: Counter({'S': 644, 'C': 168, 'Q': 77, nan: 2})

5. Delete / complement defects

In[5]


df = df.drop(['Cabin'], axis=1) #Deleted because it seems difficult to use for analysis
df = df.dropna(subset = ['Embarked']) #Cabin has few defects, so delete it with dropna in a line
df = df.fillna(method = 'ffill') #Other columns complement from previous data
print(df.isnull().any(axis=0))
df.shape

out[5]


Survived    False
Pclass      False
Sex         False
Age         False
SibSp       False
Parch       False
Fare        False
Embarked    False
dtype: bool
(889, 8)

6. Label encoding

Converts object (character) type data to numerical (numerical) type data using label encoding.

In[6]


from sklearn.preprocessing import LabelEncoder
for column in ['Sex','Embarked']:
    le = LabelEncoder()
    le.fit(df[column])
    df[column] = le.transform(df[column])
df.head()

image.png

You can see that there is label encoding for Sex and Embarked.

By label encoding, you can also see the outline of the data using seaborn's pair plot etc.

sns.pairplot(df);

I think it's good to select only continuous variables and look at them. (Although the following includes some that are not strictly continuous variables ...)

df_continuous = df[['Age','SibSp','Parch','Fare']]
sns.pairplot(df_continuous);

image.png

7. One hot encoding

One-hot encodes numeric data and other nominal scale data using one-hot encoding. I didn't know how to use scikit-learn's OneHotEncoder well, so I used get_dummies from pandas.

In[7]


df = pd.get_dummies(df, columns = ['Pclass','Embarked'])
df.head()

image.png

You can see that there is one hot encoding for Pclass and Embarked.

8. Summary

I think this trend is common to many data analyses. Please refer to the flow of pretreatment.

We are looking for comments, article material, etc.

Recommended Posts

[Kaggle] From data reading to preprocessing and encoding
SIGNATE Quest ① From data reading to preprocessing
[Python] How to read data from CIFAR-10 and CIFAR-100
Data preprocessing (2) Data is changed from Categorical to Numerical.
Data retrieval from MacNote3 and migration to Write
Try to process Titanic data with preprocessing library DataLiner (Encoding)
From Elasticsearch installation to data entry
Scraping, preprocessing and writing to postgreSQL
From Python to using MeCab (and CaboCha)
Python: Reading JSON data from web API
Porting and modifying doublet-solver from python2 to python3.
Dump SQLite3 data and migrate to MySQL
Compress python data and write to sqlite
Reading OpenFOAM time series data and sets data
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
[Updated from time to time] LetCode algorithm and library
Multivariate LSTM and data preprocessing in TensorFlow 2.x
Study from Python Reading and writing Hour9 files
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
[Python] Flow from web scraping to data analysis
Data cleansing 3 Use of OpenCV and preprocessing of image data
Data cleaning How to handle missing and outliers
Basic visualization techniques learned from Kaggle Titanic data
[AWS] Migrate data from DynamoDB to Aurora MySQL
Reading, summarizing, visualizing, and exporting time series data to an Excel file with Python
Sum from 1 to 10
Give latitude and longitude point sequence data and try to identify the road from OpenStreetMap data
Qiskit Source Code Reading ~ Terra: Read from circuit creation to adding gates and measurements
Automatic data migration from yahoo root lab to Strava
Python-Read data from a numeric data file and calculate covariance
Send log data from the server to Splunk Cloud
Reading Note: An Introduction to Data Analysis with Python
Send data from Python to Processing via socket communication
Try to divide twitter data into SPAM and HAM
DataNitro, implementation of function to read data from sheet
I tried reading data from a file using Node.js.
I want to say that there is data preprocessing ~
Overview of natural language processing and its data preprocessing
Python --Read data from a numeric data file to find the covariance matrix, eigenvalues, and eigenvectors
Python canonical notation: How to determine and extract only valid date representations from input data
An introduction to statistical modeling for data analysis (Midorimoto) reading notes (in Python and Stan)
[Python / Ruby] Understanding with code How to get data from online and write it to CSV