There are several libraries that visualize data in Python, but Pandas alone is pretty good. Visualization with Pandas can be completed in a method chain, which can slightly prevent the clutter of temporary variables. In this article, I will introduce visualization recipes, focusing on the ones that I often use in practice.
This time, I will borrow the following two data.
Make the DataFrames titanic
and crime
respectively.
import pandas as pd
import zipfile
with zipfile.ZipFile('titanic.zip') as myzip:
with myzip.open('train.csv') as myfile:
titanic = pd.read_csv(myfile)
with zipfile.ZipFile('crimes-in-boston.zip') as myzip:
with myzip.open('crime.csv') as myfile:
crime = pd.read_csv(myfile, encoding='latin-1', parse_dates=['OCCURRED_ON_DATE'])
This is the quickest way to see the distribution of numerical data. Bar charts may be more appropriate when there are few unique values.
titanic['Age'].plot.hist()
Used when looking at quartiles. Points outside the box length x 1.5 are marked as outliers. Violin plots cannot be drawn with Pandas, so give up and use Seaborn.
titanic['Age'].plot.box()
It is a method to estimate PDF from data, but if it is one-dimensional, a histogram may be enough.
For more information on Python kernel density estimation, see here.
Since it uses scipy, if it is not installed, install it with pip install scipy
.
titanic['Age'].plot.kde()
It is used to see the relationship between real numbers. If the points overlap too much, the density will not be known, so I think it is standard to make it transparent. If either one is a category or has few unique values, it is better to use the grouped histograms and boxplots described below.
titanic.plot.scatter(x='Age', y='Fare', alpha=0.3)
I have never used it, but I will introduce it for the time being.
titanic.plot.hexbin(x='Age', y='Fare', gridsize=30)
It is often used to see aggregated values for each category.
titanic['Embarked'].value_counts(dropna=False).plot.bar()
I tried to lie down.
titanic['Embarked'].value_counts(dropna=False).plot.barh()
Horizontal Bar Plot with DataFrame Styling
You can make the DataFrame look like a bar graph. I use it a lot because it allows me to search by text.
titanic['Embarked'].value_counts(dropna=False).to_frame().style.bar(vmin=0)
It is often used to see changes in the series.
crime['OCCURRED_ON_DATE'].dt.date.value_counts().plot.line(figsize=(16, 4))
As with the line graph, we see the changes in the series, but we see the magnitude from zero. However, if it is too fine, it will be difficult to see the valley, so it is better to discretize it a little.
crime['OCCURRED_ON_DATE'].dt.date.value_counts().plot.area(figsize=(16, 4), linewidth=0)
I don't use pie charts because they are difficult to understand, but I will introduce them for the time being. The reasons why pie charts are difficult to understand are summarized in the following article.
-Do you still use pie charts? --Data Visualization Ideabook
titanic['Embarked'].value_counts(dropna=False).plot.pie()
Grouped Histogram
Often used to compare the distribution between two groups. (It doesn't have to be 2 groups)
titanic.groupby('Survived')['Age'].plot.hist(alpha=0.5, legend=True)
Or
titanic['Age'].groupby(titanic['Survived']).plot.hist(alpha=0.5, legend=True)
So, in the latter case, you can use an external Series.
Grouped Box Plot
It doesn't work with groupby
, so write as follows.
titanic.boxplot(column='Age', by='Survived')
Grouped Kernel Density Estimation
It may be used to compare the distribution between two groups as well as the histogram.
titanic['Age'].groupby(titanic['Survived']).plot.kde(legend=True)
Grouped Scatter Plot
I think I use it often, but I can't write smartly.
If it is group by
, it will be returned as a list.
titanic.groupby('Survived').plot.scatter(x='Age', y='Fare', alpha=0.3)
It cannot be used unless the key is numerical data, but if you write it as follows, it will be a scatter plot of different colors for each group.
titanic.plot.scatter(x='Age', y='Fare', c='Survived', cmap='viridis', alpha=0.3)
Pandas Official Documentation shows how to share Axis and draw two graphs. ..
ax = titanic[titanic['Survived'] == 0].plot.scatter(x='Age', y='Fare', label=0, alpha=0.3)
titanic[titanic['Survived'] == 1].plot.scatter(x='Age', y='Fare', c='tab:orange', label=1, alpha=0.3, ax=ax)
Grouped Hexagonal Binning Plot
titanic.groupby('Survived').plot.hexbin(x='Age', y='Fare', gridsize=30)
Grouped Bar Plot
titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack().plot.bar()
titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).plot.bar()
Grouped Horizontal Bar Plot
titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack().plot.barh()
titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).plot.barh()
Grouped Horizontal Bar Plot with DataFrame Styling
titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).style.bar(vmin=0, axis=None)
Grouped Line Plot
crime['OCCURRED_ON_DATE'].dt.date.groupby(crime['DISTRICT']).value_counts().unstack(0).plot.line(figsize=(16, 4), alpha=0.5)
crime['OCCURRED_ON_DATE'].dt.date.groupby(crime['DISTRICT']).value_counts().unstack(0).iloc[:, :4].plot.line(figsize=(16, 4), alpha=0.5)
Stacked Area Plot
crime['OCCURRED_ON_DATE'].dt.date.groupby(crime['DISTRICT']).value_counts().unstack(0).plot.area(figsize=(16, 4), linewidth=0)
crime['OCCURRED_ON_DATE'].dt.date.groupby(crime['DISTRICT']).value_counts().unstack(0).iloc[:, :4].plot.area(figsize=(16, 4), linewidth=0)
Grouped Pie Plot
titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).plot.pie(subplots=True)
Stacked Bar Plot
titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack().plot.bar(stacked=True)
titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).plot.bar(stacked=True)
Stacked Horizontal Bar Plot
titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack().plot.barh(stacked=True)
titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).plot.barh(stacked=True)
Percent Stacked Bar Plot
To draw a 100% stacked bar chart, you have to calculate the percentage.
(titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack()
.div(titanic['Survived'].value_counts(dropna=False), axis=0)
.plot.bar(stacked=True))
(titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0)
.div(titanic['Embarked'].value_counts(dropna=False), axis=0)
.plot.bar(stacked=True))
Percent Stacked Horizontal Bar Plot
(titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack()
.div(titanic['Survived'].value_counts(dropna=False), axis=0)
.plot.barh(stacked=True))
(titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0)
.div(titanic['Embarked'].value_counts(dropna=False), axis=0)
.plot.barh(stacked=True))
Overlay Plots
Overlay the histogram and the kernel density estimation graph.
titanic['Age'].groupby(titanic['Survived']).plot.hist(alpha=0.5, legend=True)
titanic['Age'].groupby(titanic['Survived']).plot.kde(legend=True, secondary_y=True)
Grouped Bar Plot with Error Bars
You have to calculate the standard error to draw the error bar.
yerr = titanic.groupby(['Survived', 'Pclass'])['Fare'].std().unstack(0)
titanic.groupby(['Survived', 'Pclass'])['Fare'].mean().unstack(0).plot.bar(yerr=yerr)
Heat Map with DataFrame Styling
(pd.crosstab(crime['DAY_OF_WEEK'], crime['HOUR'].div(3).map(int).mul(3), normalize=True)
.reindex(['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
.style.background_gradient(axis=None).format('{:.3%}'))
If you change the color map, it will look like a lawn.
(pd.crosstab(crime['DAY_OF_WEEK'], crime['MONTH'], normalize=True)
.reindex(['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
.style.background_gradient(axis=None, cmap='YlGn').format('{:.3%}'))
Correlation Heat Map with DataFrame Styling
I will introduce it in the next article.
-[One line] Heatmap the correlation matrix with Pandas only
corr = titanic.corr()
low = (1 + corr.values.min()) / (1 - corr.values.min())
corr.style.background_gradient(axis=None, cmap='viridis', low=low).format('{:.6f}')
I introduced the ones that seem to be relatively easy to use. There is also such a thing! Please let me know. If you want to draw a more elaborate graph, the next page will be helpful.
Recommended Posts