[Updated as appropriate] I tried to organize the basic visualization methods

Introduction

This time I would like to organize using the Auto MPG dataset. This dataset is data showing the fuel economy of automobiles from the late 1970s to the early 1980s.

Data confirmation

#Installation of required libraries
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import os

file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
file_name = os.path.splitext(os.path.basename(file_path))[0]
column_names = ['MPG','Cylinders', 'Displacement', 'Horsepower', 'Weight',
                  'Acceleration', 'Model Year', 'Origin'] 

df = pd.read_csv(
    file_path, #File Path
    names = column_names, #Specify column name
    na_values ='?', # ?Read as missing value
    comment = '\t', #Skip right after TAB
    sep = ' ', #Use blank lines as delimiters
    skipinitialspace = True, #Skip the blank after the comma
    encoding = 'utf-8'
) 
df.head()

スクリーンショット 2021-01-07 9.52.59.png

Check the number of records and columns

#Check the number of records and columns
df.shape

Confirmation of missing values

#Check the number of missing values
df.isnull().sum()

Check the attributes of each column of DataFrame

#Check the attributes of each column of DataFrame
df.dtypes

Visualization of missing values

It is used when there is regularity in missing values. It is useful because it is easy to understand when explaining to the site.

plt.figure(figsize=(14,7))
sns.heatmap(df.isnull())

欠損値.png

Checking summary statistics

#Summary statistics
df.describe()

スクリーンショット 2021-01-07 9.57.43.png

Histogram creation

#histogram
df['MPG'].plot(kind='hist', bins=12)

ヒストグラム.png

Creating a kernel density estimate

The histogram looks different when you change the size of the bin, so the graph created by kernel density estimation is used more often.

#Kernel density estimation
sns.kdeplot(data=df['MPG'], shade=True)

カーネル密度推定.png

Creating a scatter plot

Scatter plot + histogram

#Scatter plot+histogram
sns.jointplot(x='Model Year', y='MPG', data=df, alpha=0.3)

散布図+ヒストグラム.png

Hexagonal scatterplot matrix

# hexagonal bins
sns.jointplot(x='Model Year', y='MPG', data=df, kind='hex')

六角形.png

Hexagonal scatter plot

A slightly modern and fashionable scatter plot.

# hexagonal bins
sns.jointplot(x='Model Year', y='MPG', data=df, kind='hex')

Scatter plot of kernel density estimation

Generate contour-like graphs.

# density estimates
sns.jointplot(x='Model Year', y='MPG', data=df, kind='kde', shade=True)

kde.png

Scatterplot matrix

#Scatterplot matrix
sns.pairplot(df[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")

散布図行列.png

Creating a boxplot

Visualize data variability.

countplot

#Count plot by age
ax = sns.countplot(x='Model Year', data=df, color='cornflowerblue')

countplot.png

Box plot

#Box plot(boxplot)
sns.boxplot(x='Model Year', y='MPG', data=df.sort_values('Model Year'), color='cornflowerblue')

boxplot.png

violin plot A graph that allows you to check the density of the data distribution.

# violin plot 
sns.violinplot(x='Model Year', y='MPG', data=df.sort_values('Model Year'), color='cornflowerblue')

violin plot.png

swarm plot A graph that can be confirmed by the dots of the data distribution.

# swarm plot
fig, ax = plt.subplots(figsize=(20, 5))
ax.tick_params(labelsize=20)
sns.swarmplot(x='Model Year', y='MPG', data=df.sort_values('Model Year'))

swarm plot.png

Heat map

Correlation coefficient matrix

#Correlation coefficient matrix (excluding rows with a value of 0)
df = df[(df!=0).all(axis=1)]
corr = df.corr()
corr

swarm plot.png

Heat map of correlation coefficient matrix

I personally like the "cool warm" shades of cmap. If you do not specify anything, the color will be subtle and it will be difficult to see in the materials.

#Correlation coefficient heat map
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

ヒートマップ.png

at the end

Thank you for reading to the end. This time, I tried to organize the basic visualization methods. I will update it as my memo as appropriate.

If you have a request for correction, we would appreciate it if you could contact us.

Recommended Posts

[Updated as appropriate] I tried to organize the basic visualization methods
I tried to summarize the basic form of GPLVM
I tried to organize SVM.
I tried to organize about MCMC.
I tried to move the ball
I tried to estimate the interval.
I tried to solve the shift scheduling problem by various methods
I tried to summarize the umask command
[Slack api + Python] I tried to summarize the methods such as status confirmation and message sending
I tried the OSS visualization tool, superset
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
I tried to touch Python (basic syntax)
I tried to summarize the methods that are often used when implementing basic algo in Quantx Factory
I tried web scraping to analyze the lyrics.
I tried to optimize while drying the laundry
I tried to save the data with discord
I tried to touch the API of ebay
I tried to correct the keystone of the image
Qiita Job I tried to analyze the job offer
LeetCode I tried to summarize the simple ones
I tried to implement the traveling salesman problem
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to organize the evaluation indexes used in machine learning (regression model)
[Python] I tried to get the type name as a string from the type function
I tried to graph the packages installed in Python
(complex) It depends on how to name the coefficient of the morlet wavelet, the appropriate setting value, and the material, so I tried to organize it as much as possible.
I tried to detect the iris from the camera image
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I tried to debug.
I tried to approximate the sin function using chainer
I tried to summarize four neural network optimization methods
I tried to put pytest into the actual battle
[Python] I tried to graph the top 10 eyeshadow rankings
I tried to visualize the spacha information of VTuber
I tried to paste
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I want to handle the rhyme part6 (organize once)
I tried to simulate the dollar cost averaging method
I tried to redo the non-negative matrix factorization (NMF)
I read the Chainer reference (updated from time to time)
I tried to identify the language using CNN + Melspectogram
I tried to notify the honeypot report on LINE
I tried to complement the knowledge graph using OpenKE
I tried to classify the voices of voice actors
I tried to compress the image using machine learning
I tried to summarize the string operations of Python
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to implement a basic Recurrent Neural Network model
I tried to find the entropy of the image with python
I tried to find out the outline about Big Gorilla
I tried to introduce the block diagram generation tool blockdiag
I tried porting the code written for TensorFlow to Theano
[Horse Racing] I tried to quantify the strength of racehorses
I tried to simulate how the infection spreads with Python
I tried to analyze the whole novel "Weathering with You" ☔️
[First COTOHA API] I tried to summarize the old story