This time I would like to organize using the Auto MPG dataset. This dataset is data showing the fuel economy of automobiles from the late 1970s to the early 1980s.
#Installation of required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import os
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
file_name = os.path.splitext(os.path.basename(file_path))[0]
column_names = ['MPG','Cylinders', 'Displacement', 'Horsepower', 'Weight',
'Acceleration', 'Model Year', 'Origin']
df = pd.read_csv(
file_path, #File Path
names = column_names, #Specify column name
na_values ='?', # ?Read as missing value
comment = '\t', #Skip right after TAB
sep = ' ', #Use blank lines as delimiters
skipinitialspace = True, #Skip the blank after the comma
encoding = 'utf-8'
)
df.head()
#Check the number of records and columns
df.shape
#Check the number of missing values
df.isnull().sum()
#Check the attributes of each column of DataFrame
df.dtypes
It is used when there is regularity in missing values. It is useful because it is easy to understand when explaining to the site.
plt.figure(figsize=(14,7))
sns.heatmap(df.isnull())
#Summary statistics
df.describe()
#histogram
df['MPG'].plot(kind='hist', bins=12)
The histogram looks different when you change the size of the bin, so the graph created by kernel density estimation is used more often.
#Kernel density estimation
sns.kdeplot(data=df['MPG'], shade=True)
#Scatter plot+histogram
sns.jointplot(x='Model Year', y='MPG', data=df, alpha=0.3)
# hexagonal bins
sns.jointplot(x='Model Year', y='MPG', data=df, kind='hex')
A slightly modern and fashionable scatter plot.
# hexagonal bins
sns.jointplot(x='Model Year', y='MPG', data=df, kind='hex')
Generate contour-like graphs.
# density estimates
sns.jointplot(x='Model Year', y='MPG', data=df, kind='kde', shade=True)
#Scatterplot matrix
sns.pairplot(df[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")
Visualize data variability.
countplot
#Count plot by age
ax = sns.countplot(x='Model Year', data=df, color='cornflowerblue')
#Box plot(boxplot)
sns.boxplot(x='Model Year', y='MPG', data=df.sort_values('Model Year'), color='cornflowerblue')
violin plot A graph that allows you to check the density of the data distribution.
# violin plot
sns.violinplot(x='Model Year', y='MPG', data=df.sort_values('Model Year'), color='cornflowerblue')
swarm plot A graph that can be confirmed by the dots of the data distribution.
# swarm plot
fig, ax = plt.subplots(figsize=(20, 5))
ax.tick_params(labelsize=20)
sns.swarmplot(x='Model Year', y='MPG', data=df.sort_values('Model Year'))
#Correlation coefficient matrix (excluding rows with a value of 0)
df = df[(df!=0).all(axis=1)]
corr = df.corr()
corr
I personally like the "cool warm" shades of cmap. If you do not specify anything, the color will be subtle and it will be difficult to see in the materials.
#Correlation coefficient heat map
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
Thank you for reading to the end. This time, I tried to organize the basic visualization methods. I will update it as my memo as appropriate.
If you have a request for correction, we would appreciate it if you could contact us.
Recommended Posts