Adjust the bin width crisply and neatly with the histogram of matplotlib and seaborn

Introduction

unsplash-logoIcons8 Team

Histograms are often overlaid to check the distribution of data by label, but there are cases where the difference in bin width is noticeable depending on the data. It does not occur even if you use BI tools such as Tableau, but matplotlib and seaborn do not adjust it arbitrarily, so you need to handle it yourself.

Method

Use the argument bins.

bins : int or sequence or str, optional matplotlib.pyplot.hist

Since bins can receive not only integer values but also sequences, All you have to do is specify the maximum and minimum values in the range function and set the desired number of divisions.


import numpy as np
import matplotlib.pyplot as plt

#Prepare a DataFrame with two types of labels and different data distribution
df_1st = pd.DataFrame(np.random.normal(loc=20, scale=10, size=100), columns=["val"])
df_1st["target"] = "class_1"
df_2nd = pd.DataFrame(np.random.normal(loc=15, scale=20, size=100), columns=["val"])
df_2nd["target"] = "class_2"

df = pd.concat([df_1st, df_2nd])

Before bin width correction

import matplotlib as plt
import seaborn as sns

#Plot by target
for val in df["target"].unique():
    ax = sns.distplot(df.query('target == @val')["val"], kde=False, label=f"target is {val}")

ax.legend()

After bin width correction

#minimum value
x_min = int(df["val"].min())

#Maximum value
x_max = int(df["val"].max())

#5 intervals in the range from the minimum value to the maximum value
range_bin_width = range(x_min, x_max, 5)

#Plot by target
for val in df["target"].unique():
    ax = sns.distplot(df.query('target == @val')["val"], bins=range_bin_width, kde=False, label=f"target is {val}")

ax.legend()

Supplement

If bins is not set, the number of bins is determined by a method called ** Freedman-Diaconis rule **. This technique is reasonably good, and when plotting a single piece of data, it is generally plotted without problems.

distributions.py


def _freedman_diaconis_bins(a):
    """Calculate number of hist bins using Freedman-Diaconis rule."""
    # From https://stats.stackexchange.com/questions/798/
    a = np.asarray(a)
    if len(a) < 2:
        return 1
    h = 2 * iqr(a) / (len(a) ** (1 / 3))
    # fall back to sqrt(a) bins if iqr is 0
    if h == 0:
        return int(np.sqrt(a.size))
    else:
        return int(np.ceil((a.max() - a.min()) / h))

https://github.com/mwaskom/seaborn/blob/master/seaborn/distributions.py#L24

in conclusion

The plot isn't beautiful because the plot is rude to the person you show it to I think it's polite to make it at least clean.

Recommended Posts

Adjust the bin width crisply and neatly with the histogram of matplotlib and seaborn
Adjust the ratio of multiple figures with the matplotlib gridspec
Adjust the spacing between figures with Matplotlib
Align the size of the colorbar with matplotlib
Set the vertical axis of the histogram to relative frequency (total height of columns = 1) and relative frequency density (area of the entire histogram = 1) with matplotlib.
Increase the font size of the graph with matplotlib
Fill the browser with the width of Jupyter Notebook
The basis of graph theory with matplotlib animation
Visualize the behavior of the sorting algorithm with matplotlib
Histogram with matplotlib
Add information to the bottom of the figure with Matplotlib
Visualize the range of interpolation and extrapolation with python
Overview and tips of seaborn with statistical data visualization
Adjust axes with matplotlib
The vertical and horizontal axes of the matplotlib histogram are unpleasant, so make it feel good
[Graph drawing] I tried to write a bar graph of multiple series with matplotlib and seaborn
Animate the alpha and beta values of the world's top market cap stocks with pandas + matplotlib
Perform isocurrent analysis of open channels with Python and matplotlib
[Python] Read the csv file and display the figure with matplotlib
See the power of speeding up with NumPy and SciPy
Reformat the timeline of the pandas time series plot with matplotlib
Let's visualize the number of people infected with coronavirus with matplotlib
I wrote the basic operation of matplotlib with Jupyter Lab
Play with the password mechanism of GitHub Webhook and Python
How to unify the bin width when displaying multiple histograms on top of each other (matplotlib)
Precautions when drawing the probability density function and the histogram on top of each other in matplotlib
Japanese display of matplotlib, seaborn
behavior of matplotlib: histogram normed
Change the style of matplotlib
I compared the speed of Hash with Topaz, Ruby and Python
[Required subject DI] Implement and understand the mechanism of DI with Go
To improve the reusability and maintainability of workflows created with Luigi