When doing machine learning such as kaggle's competition, the first thing to do is to visualize the data. And I think that seaborn is often used for data visualization. But are there various types of graphs and you may be wondering which one to use? (I have) There are many explanations that "which method can be used to draw such a graph", but I feel that there are few explanations that "in what circumstances this graph is good". Therefore, here I have summarized which method of seaborn should be used for each type of explanatory variable and objective variable.
python: 3.6.6 seaborn: 0.10.0
First, when both the explanatory variable and the objective variable are discrete quantities (categories). Use seaborn count plot. Draw how many each category of objective variables exists. Pass the explanatory variable to the argument x of countplot and the objective variable to hue. The data is titanic.
import pandas as pd
import seaborn as sns
data=pd.read_csv("train.csv")
sns.countplot(x='Embarked', data=data, hue='Survived')
You can also reverse x and hue (which is a matter of taste?).
sns.countplot(x='Survived', data=data, hue='Embarked')
Next, when the explanatory variable is a continuous quantity and the objective variable is a discrete quantity. Draw the distribution of explanatory variables for each category of objective variables with seaborn's distroplot.
g=sns.FacetGrid(data=data, hue='Survived', size=5)
g.map(sns.distplot, 'Fare')
g.add_legend()
Please refer to the separate article for how to color-code with methods that do not have hue as an argument (How to color-code even methods that do not have hue as arguments in Seaborn. mr160 / items / 112477ae98990216dae4)).
Next, when the explanatory variable is a discrete quantity and the objective variable is a continuous quantity. Draw the distribution of the objective variable for each category of explanatory variables with the seaborn violin plot. Use kaggle's House Prices for the data.
train_data=pd.read_csv("train.csv")
sns.violinplot(x="MSZoning", y="SalePrice", data=train_data)
Finally, when both the explanatory variable and the objective variable are continuous quantities. Draw the correlation between the explanatory variable and the objective variable with seaborn's joint plot.
sns.jointplot(x="LotArea", y="SalePrice", data=train_data)
This joint plot is excellent because you can see the correlation between two variables and their distribution at the same time.
The above is summarized in the table below.
Please point out any mistakes or more appropriate methods.
Recommended Posts