Visualization method of data by explanatory variable and objective variable

Introduction

When doing machine learning such as kaggle's competition, the first thing to do is to visualize the data. And I think that seaborn is often used for data visualization. But are there various types of graphs and you may be wondering which one to use? (I have) There are many explanations that "which method can be used to draw such a graph", but I feel that there are few explanations that "in what circumstances this graph is good". Therefore, here I have summarized which method of seaborn should be used for each type of explanatory variable and objective variable.

environment

python: 3.6.6 seaborn: 0.10.0

Explanatory variable: Discrete quantity (category) Objective variable: Discrete quantity

First, when both the explanatory variable and the objective variable are discrete quantities (categories). Use seaborn count plot. Draw how many each category of objective variables exists. Pass the explanatory variable to the argument x of countplot and the objective variable to hue. The data is titanic.

import pandas as pd
import seaborn as sns

data=pd.read_csv("train.csv")
sns.countplot(x='Embarked', data=data, hue='Survived')

countplot.png You can also reverse x and hue (which is a matter of taste?).

sns.countplot(x='Survived', data=data, hue='Embarked')

countplot2.png

Explanatory variable: continuous quantity Objective variable: discrete quantity

Next, when the explanatory variable is a continuous quantity and the objective variable is a discrete quantity. Draw the distribution of explanatory variables for each category of objective variables with seaborn's distroplot.

g=sns.FacetGrid(data=data, hue='Survived', size=5)
g.map(sns.distplot, 'Fare')
g.add_legend()

distplot.png Please refer to the separate article for how to color-code with methods that do not have hue as an argument (How to color-code even methods that do not have hue as arguments in Seaborn. mr160 / items / 112477ae98990216dae4)).

Explanatory variable: Discrete quantity Objective variable: Continuous quantity

Next, when the explanatory variable is a discrete quantity and the objective variable is a continuous quantity. Draw the distribution of the objective variable for each category of explanatory variables with the seaborn violin plot. Use kaggle's House Prices for the data.

train_data=pd.read_csv("train.csv")
sns.violinplot(x="MSZoning", y="SalePrice", data=train_data)

violinplot.png

Explanatory variable: continuous amount Objective variable: continuous amount

Finally, when both the explanatory variable and the objective variable are continuous quantities. Draw the correlation between the explanatory variable and the objective variable with seaborn's joint plot.

sns.jointplot(x="LotArea", y="SalePrice", data=train_data)

jointplot.png This joint plot is excellent because you can see the correlation between two variables and their distribution at the same time.

Summary

The above is summarized in the table below. sns_summary.png

Please point out any mistakes or more appropriate methods.

Recommended Posts

Visualization method of data by explanatory variable and objective variable
Visualization of data by prefecture
Analysis of financial data by pandas and its visualization (2)
Analysis of financial data by pandas and its visualization (1)
Correlation visualization of features and objective variables
Overview and tips of seaborn with statistical data visualization
[Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile
Classify data by k-means method
Data visualization method using matplotlib (1)
Data visualization method using matplotlib (2)
Negative / positive judgment of sentences and visualization of grounds by Transformer
Negative / positive judgment of sentences by BERT and visualization of grounds
Visualization of matrix created by numpy
Data visualization method using matplotlib (+ pandas) (5)
Automatic acquisition of gene expression level data by python and R
Java compilation and execution understood by CLI
Notify error and execution completion by LINE [Python]
Command execution triggered by file update (python edition)
pytube execution and error
Visualization method of data by explanatory variable and objective variable
[Road to Python intermediate] Dynamically specify execution method by variable name
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Data visualization method using matplotlib (+ pandas) (3)
Impressions of touching Dash, a data visualization tool made by python
[Python] Implementation of Nelder–Mead method and saving of GIF images by matplotlib
10 selections of data extraction by pandas.DataFrame.query
Animation of geographic data by geopandas
Data visualization method using matplotlib (+ pandas) (4)
Aggregation and visualization of accumulated numbers
Preprocessing of Wikipedia dump files and word-separation of large amounts of data by MeCab
Implementation and experiment of convex clustering method
Starbucks Twitter Data Location Visualization and Analysis
Separation of design and data in matplotlib
Summary of SQLAlchemy connection method by DB
Recommendation of Altair! Data visualization with Python
Visualization of CNN feature maps and filters (Tensorflow 2.0)
Real-time visualization of thermography AMG8833 data in Python
Low-rank approximation of images by HOSVD and HOOI
Calculation of technical indicators by TA-Lib and pandas
Smoothing of time series and waveform data 3 methods (smoothing)
Sentiment analysis of large-scale tweet data by NLTK
Data cleansing 3 Use of OpenCV and preprocessing of image data
Visualization of Produce 101 Japan trainee ranking by scraping
[Scientific / technical calculation by Python] Plot, visualization, matplotlib of 2D data read from file
Data Langling PDF on the outbreak of influenza by the Ministry of Health, Labor and Welfare
[Python beginner's memo] Importance and method of confirming missing value NaN before data analysis