I often wonder which graph to use when visualizing data. Therefore, last time, I summarized the graphs suitable for each type of explanatory variable and objective variable (Visualization method of data by explanatory variable and objective variable). However, I thought that I was writing "I'll forget this soon!" Therefore, I created a method that automatically determines the type of variable and draws a suitable graph.
The appropriate seaborn methods for each type of explanatory variable and objective variable (discrete quantity or not) are as follows. For details, please refer to the previous article from the above link.
Below is the code for my own method.
import matplotlib.pyplot as plt
import seaborn as sns
def visualize_data(data, target_col):
for key in data.keys():
if key==target_col:
continue
length=10
subplot_size=(length, length/2)
if is_categorical(data, key) and is_categorical(data, target_col):
fig, axes=plt.subplots(1, 2, figsize=subplot_size)
sns.countplot(x=key, data=data, ax=axes[0])
sns.countplot(x=key, data=data, hue=target_col, ax=axes[1])
plt.tight_layout()
plt.show()
elif is_categorical(data, key) and not is_categorical(data, target_col):
fig, axes=plt.subplots(1, 2, figsize=subplot_size)
sns.countplot(x=key, data=data, ax=axes[0])
sns.violinplot(x=key, y=target_col, data=data, ax=axes[1])
plt.tight_layout()
plt.show()
elif not is_categorical(data, key) and is_categorical(data, target_col):
fig, axes=plt.subplots(1, 2, figsize=subplot_size)
sns.distplot(data[key], ax=axes[0], kde=False)
g=sns.FacetGrid(data, hue=target_col)
g.map(sns.distplot, key, ax=axes[1], kde=False)
axes[1].legend()
plt.tight_layout()
plt.close()
plt.show()
else:
sg=sns.jointplot(x=key, y=target_col, data=data, height=length*2/3)
plt.show()
The is_categorical is as follows.
def is_categorical(data, key):
col_type=data[key].dtype
if col_type=='int':
nunique=data[key].nunique()
return nunique<6
elif col_type=="float":
return False
else:
return True
The outline is
-Pass the data you want to visualize (pandas.DataFrame) to data and the key of the objective variable to target_col. -Use the is_categorical method to determine whether the explanatory variable and objective variable are discrete or continuous, and visualize them with the appropriate seaborn method.
It has become. When the data type is int, if there are 6 or more types of values, it is a continuous quantity, and if there are only 5 or less types of values, it is a discrete quantity. To be honest, there is room for improvement in the judgment here.
Apply it to titanic data (only one copy because the result is long).
import pandas as pd
data=pd.read_csv("train.csv")
data=data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1) #Excludes eigenvalues
visualize_data(data, "Survived")
I was able to automatically draw an appropriate graph for each type!
In the previously posted Method to get an overview of data with Pandas and GitHub raised. Please use it! I want to automate various preprocessing in the future.
Recommended Posts