Introduction

[Updated from time to time] Mainly using the snippets in EDA / Feature Engineering Snippets Used in Kaggle Table Data Competition [Kaggle Titanic Data] Use (https://www.kaggle.com/c/titanic/data) to visualize basic data.

Premise

import numpy as np 
import pandas as pd
import pandas_profiling as pdp
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
cmap = plt.get_cmap("tab10")
plt.style.use('fivethirtyeight')
%matplotlib inline

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option("display.max_colwidth", 10000)

target_col = "Survived"
data_dir = "/kaggle/input/titanic/"

Check the folder

!ls -GFlash /kaggle/input/titanic/

total 100K
4.0K drwxr-xr-x 2 nobody 4.0K Jan  7  2020 ./
4.0K drwxr-xr-x 5 root   4.0K Jul 12 00:15 ../
4.0K -rw-r--r-- 1 nobody 3.2K Jan  7  2020 gender_submission.csv
 28K -rw-r--r-- 1 nobody  28K Jan  7  2020 test.csv
 60K -rw-r--r-- 1 nobody  60K Jan  7  2020 train.csv

Read data

train = pd.read_csv(data_dir + "train.csv")
test = pd.read_csv(data_dir + "test.csv")
submit = pd.read_csv(data_dir + "gender_submission.csv")

Check the data

train.head()

Check the number of records and columns

print("{} rows and {} features in train set".format(train.shape[0], train.shape[1]))
print("{} rows and {} features in test set".format(test.shape[0], test.shape[1]))
print("{} rows and {} features in submit set".format(submit.shape[0], submit.shape[1]))

891 rows and 12 features in train set
418 rows and 11 features in test set
418 rows and 2 features in submit set

Check the number of defects for each column

Check how many defects are in each column.

train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Visualization of missing values

Check if the defect has regularity.

plt.figure(figsize=(18,9))
sns.heatmap(train.isnull(), cbar=False)

Check the summary statistics for each column

Check the summary statistics such as mean, standard deviation, maximum, minimum, and mode for each column to get a rough idea of the data.

train.describe()

Aggregate the number (frequency) of data

Check the target percentage

sns.countplot(x=target_col, data=train)

Check the percentage of category values

col = "Pclass"
sns.countplot(x=col, data=train)

Check the percentage of a column for each target value

col = "Pclass"
sns.countplot(x=col, hue=target_col, data=train)

col = "Sex"
sns.countplot(x=col, hue=target_col, data=train)

histogram

The vertical axis is frequency and the horizontal axis is class, which visualizes the distribution of data. Try some to show different data characteristics for different bin sizes.

col = "Age"
train[col].plot(kind="hist", bins=10, title='Distribution of {}'.format(col))

col = "Fare"
train[col].plot(kind="hist", bins=50, title='Distribution of {}'.format(col))

Histogram by category

f, ax = plt.subplots(1, 3, figsize=(15, 4))
sns.distplot(train[train['Pclass']==1]["Fare"], ax=ax[0])
ax[0].set_title('Fares in Pclass 1')
sns.distplot(train[train['Pclass']==2]["Fare"], ax=ax[1])
ax[1].set_title('Fares in Pclass 2')
sns.distplot(train[train['Pclass']==3]["Fare"], ax=ax[2])
ax[2].set_title('Fares in Pclass 3')
plt.show()

Histogram of columns by target category

col = "Age"
fig, ax = plt.subplots(1, 2, figsize=(15, 6))
train[train[target_col]==1][col].plot(kind="hist", bins=50, title='{} - {} 1'.format(col, target_col), color=cmap(0), ax=ax[0])
train[train[target_col]==0][col].plot(kind="hist", bins=50, title='{} - {} 0'.format(col, target_col), color=cmap(1), ax=ax[1])
plt.show()

Histogram of a column for each target value (when overlapping)

col = "Age"
train[train[target_col]==1][col].plot(kind="hist", bins=50, alpha=0.3, color=cmap(0))
train[train[target_col]==0][col].plot(kind="hist", bins=50, alpha=0.3, color=cmap(1))
plt.title("histgram for {}".format(col))
plt.xlabel(col)
plt.show()

Kernel density estimation

Roughly speaking, it is a curved histogram. You can get Y for X.

sns.kdeplot(label="Age", data=train["Age"], shade=True)

Cross tabulation

Calculate the number of occurrences of each category of category data.

pd.crosstab(train["Sex"], train["Pclass"])

pd.crosstab([train["Sex"], train["Survived"]], train["Pclass"])

Pivot table

Average of quantitative data by category

pd.pivot_table(index="Pclass", columns="Sex", data=train[["Age", "Fare", "Survived", "Pclass", "Sex"]])

Minimum value of quantitative data for each category

pd.pivot_table(index="Pclass", columns="Sex", data=train[["Age", "Fare", "Pclass", "Sex"]], aggfunc=np.min)

Scatter plot

Check the relationship between the two columns.

Scatter plot

sns.scatterplot(x="Age", y="Fare", data=train)

Scatter plot (color-coded by category)

sns.scatterplot(x="Age", y="Fare", hue=target_col, data=train)

Scatterplot matrix

sns.pairplot(data=train[["Fare", "Survived", "Age", "Pclass"]], hue="Survived", dropna=True)

Box plot

Visualize data variability.

Box plot by category

Check the variation of data for each category.

sns.boxplot(x='Pclass', y='Age', data=train)

Strip chart

The figure which represented the data by a dot. It is used when one of the two data is categorical.

sns.stripplot(x="Survived", y="Fare", data=train)

sns.stripplot(x='Pclass', y='Age', data=train)

Heat map

Heat map of correlation coefficient for each column

sns.heatmap(train.corr(), annot=True)

reference

EDA To Prediction(DieTanic)
A Simple Tutorial To Data Visualization -[Python: Try visualization using seaborn](https://blog.amedama.jp/entry/seaborn-plot#scatter-plot-%E6%95%A3%E5%B8%83%E5% 9B% B3) -Calculate statistics for each category with pandas pivot table

Basic visualization techniques learned from Kaggle Titanic data

Introduction

Premise

Check the folder

Read data

Check the data

Check the number of records and columns

Check the number of defects for each column

Visualization of missing values

Check the summary statistics for each column

Aggregate the number (frequency) of data

Check the target percentage

Check the percentage of category values

Check the percentage of a column for each target value

histogram

Histogram by category

Histogram of columns by target category

Histogram of a column for each target value (when overlapping)

Kernel density estimation

Cross tabulation

Pivot table

Average of quantitative data by category

Minimum value of quantitative data for each category

Scatter plot

Scatter plot

Scatter plot (color-coded by category)

Scatterplot matrix

Box plot

Box plot by category

Strip chart

Heat map

Heat map of correlation coefficient for each column

reference