Basic visualization techniques learned from Kaggle Titanic data

Introduction

[Updated from time to time] Mainly using the snippets in EDA / Feature Engineering Snippets Used in Kaggle Table Data Competition [Kaggle Titanic Data] Use (https://www.kaggle.com/c/titanic/data) to visualize basic data.

Premise

import numpy as np 
import pandas as pd
import pandas_profiling as pdp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
cmap = plt.get_cmap("tab10")
plt.style.use('fivethirtyeight')
%matplotlib inline

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option("display.max_colwidth", 10000)
target_col = "Survived"
data_dir = "/kaggle/input/titanic/"

Check the folder

!ls -GFlash /kaggle/input/titanic/
total 100K
4.0K drwxr-xr-x 2 nobody 4.0K Jan  7  2020 ./
4.0K drwxr-xr-x 5 root   4.0K Jul 12 00:15 ../
4.0K -rw-r--r-- 1 nobody 3.2K Jan  7  2020 gender_submission.csv
 28K -rw-r--r-- 1 nobody  28K Jan  7  2020 test.csv
 60K -rw-r--r-- 1 nobody  60K Jan  7  2020 train.csv

Read data

train = pd.read_csv(data_dir + "train.csv")
test = pd.read_csv(data_dir + "test.csv")
submit = pd.read_csv(data_dir + "gender_submission.csv")

Check the data

train.head()
スクリーンショット 2020-07-12 9.17.32.png

Check the number of records and columns

print("{} rows and {} features in train set".format(train.shape[0], train.shape[1]))
print("{} rows and {} features in test set".format(test.shape[0], test.shape[1]))
print("{} rows and {} features in submit set".format(submit.shape[0], submit.shape[1]))
891 rows and 12 features in train set
418 rows and 11 features in test set
418 rows and 2 features in submit set

Check the number of defects for each column

Check how many defects are in each column.

train.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Visualization of missing values

Check if the defect has regularity.

plt.figure(figsize=(18,9))
sns.heatmap(train.isnull(), cbar=False)
スクリーンショット 2020-07-12 10.52.09.png

Check the summary statistics for each column

Check the summary statistics such as mean, standard deviation, maximum, minimum, and mode for each column to get a rough idea of the data.

train.describe()
スクリーンショット 2020-07-12 11.58.21.png

Aggregate the number (frequency) of data

Check the target percentage

sns.countplot(x=target_col, data=train)
スクリーンショット 2020-07-12 11.44.12.png

Check the percentage of category values

col = "Pclass"
sns.countplot(x=col, data=train)
スクリーンショット 2020-07-12 12.41.46.png

Check the percentage of a column for each target value

col = "Pclass"
sns.countplot(x=col, hue=target_col, data=train)
スクリーンショット 2020-07-12 10.32.41.png
col = "Sex"
sns.countplot(x=col, hue=target_col, data=train)
スクリーンショット 2020-07-12 10.35.00.png

histogram

The vertical axis is frequency and the horizontal axis is class, which visualizes the distribution of data. Try some to show different data characteristics for different bin sizes.

col = "Age"
train[col].plot(kind="hist", bins=10, title='Distribution of {}'.format(col))
スクリーンショット 2020-07-12 10.15.22.png
col = "Fare"
train[col].plot(kind="hist", bins=50, title='Distribution of {}'.format(col))
スクリーンショット 2020-07-12 11.01.08.png

Histogram by category

f, ax = plt.subplots(1, 3, figsize=(15, 4))
sns.distplot(train[train['Pclass']==1]["Fare"], ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(train[train['Pclass']==2]["Fare"], ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(train[train['Pclass']==3]["Fare"], ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()
スクリーンショット 2020-07-12 11.15.58.png

Histogram of columns by target category

col = "Age"
fig, ax = plt.subplots(1, 2, figsize=(15, 6))
train[train[target_col]==1][col].plot(kind="hist", bins=50, title='{} - {} 1'.format(col, target_col), color=cmap(0), ax=ax[0])
train[train[target_col]==0][col].plot(kind="hist", bins=50, title='{} - {} 0'.format(col, target_col), color=cmap(1), ax=ax[1])
plt.show()
スクリーンショット 2020-07-12 9.51.20.png

Histogram of a column for each target value (when overlapping)

col = "Age"
train[train[target_col]==1][col].plot(kind="hist", bins=50, alpha=0.3, color=cmap(0))
train[train[target_col]==0][col].plot(kind="hist", bins=50, alpha=0.3, color=cmap(1))
plt.title("histgram for {}".format(col))
plt.xlabel(col)
plt.show()
スクリーンショット 2020-07-12 12.19.53.png

Kernel density estimation

Roughly speaking, it is a curved histogram. You can get Y for X.

sns.kdeplot(label="Age", data=train["Age"], shade=True)
スクリーンショット 2020-07-12 13.06.27.png

Cross tabulation

Calculate the number of occurrences of each category of category data.

pd.crosstab(train["Sex"], train["Pclass"])
スクリーンショット 2020-07-12 12.06.55.png
pd.crosstab([train["Sex"], train["Survived"]], train["Pclass"])
スクリーンショット 2020-07-12 12.10.17.png

Pivot table

Average of quantitative data by category

pd.pivot_table(index="Pclass", columns="Sex", data=train[["Age", "Fare", "Survived", "Pclass", "Sex"]])
スクリーンショット 2020-07-12 15.37.00.png

Minimum value of quantitative data for each category

pd.pivot_table(index="Pclass", columns="Sex", data=train[["Age", "Fare", "Pclass", "Sex"]], aggfunc=np.min)
スクリーンショット 2020-07-12 15.41.22.png

Scatter plot

Check the relationship between the two columns.

Scatter plot

sns.scatterplot(x="Age", y="Fare", data=train)
スクリーンショット 2020-07-12 12.29.39.png

Scatter plot (color-coded by category)

sns.scatterplot(x="Age", y="Fare", hue=target_col, data=train)
スクリーンショット 2020-07-12 12.30.53.png

Scatterplot matrix

sns.pairplot(data=train[["Fare", "Survived", "Age", "Pclass"]], hue="Survived", dropna=True)
スクリーンショット 2020-07-12 16.11.47.png

Box plot

Visualize data variability.

Box plot by category

Check the variation of data for each category.

sns.boxplot(x='Pclass', y='Age', data=train)
スクリーンショット 2020-07-12 12.57.48.png

Strip chart

The figure which represented the data by a dot. It is used when one of the two data is categorical.

sns.stripplot(x="Survived", y="Fare", data=train)
スクリーンショット 2020-07-12 10.58.21.png
sns.stripplot(x='Pclass', y='Age', data=train)
スクリーンショット 2020-07-12 13.15.33.png

Heat map

Heat map of correlation coefficient for each column

sns.heatmap(train.corr(), annot=True)
スクリーンショット 2020-07-12 11.09.56.png

reference

Recommended Posts

Basic visualization techniques learned from Kaggle Titanic data
Python application: data visualization part 1: basic
Check raw data with Kaggle's Titanic (kaggle ⑥)
[Kaggle] From data reading to preprocessing and encoding
Machine learning starting from scratch (machine learning learned with Kaggle)
Overview of machine learning techniques learned from scikit-learn
Challenge Kaggle Titanic
Data analysis Titanic 1