Histograms are often overlaid to check the distribution of data by label, but there are cases where the difference in bin width is noticeable depending on the data. It does not occur even if you use BI tools such as Tableau, but matplotlib and seaborn do not adjust it arbitrarily, so you need to handle it yourself.
Use the argument bins
.
bins : int or sequence or str, optional matplotlib.pyplot.hist
Since bins
can receive not only integer values but also sequences
,
All you have to do is specify the maximum and minimum values in the range function and set the desired number of divisions.
import numpy as np
import matplotlib.pyplot as plt
#Prepare a DataFrame with two types of labels and different data distribution
df_1st = pd.DataFrame(np.random.normal(loc=20, scale=10, size=100), columns=["val"])
df_1st["target"] = "class_1"
df_2nd = pd.DataFrame(np.random.normal(loc=15, scale=20, size=100), columns=["val"])
df_2nd["target"] = "class_2"
df = pd.concat([df_1st, df_2nd])
import matplotlib as plt
import seaborn as sns
#Plot by target
for val in df["target"].unique():
ax = sns.distplot(df.query('target == @val')["val"], kde=False, label=f"target is {val}")
ax.legend()
#minimum value
x_min = int(df["val"].min())
#Maximum value
x_max = int(df["val"].max())
#5 intervals in the range from the minimum value to the maximum value
range_bin_width = range(x_min, x_max, 5)
#Plot by target
for val in df["target"].unique():
ax = sns.distplot(df.query('target == @val')["val"], bins=range_bin_width, kde=False, label=f"target is {val}")
ax.legend()
If bins
is not set, the number of bins is determined by a method called ** Freedman-Diaconis rule **.
This technique is reasonably good, and when plotting a single piece of data, it is generally plotted without problems.
distributions.py
def _freedman_diaconis_bins(a):
"""Calculate number of hist bins using Freedman-Diaconis rule."""
# From https://stats.stackexchange.com/questions/798/
a = np.asarray(a)
if len(a) < 2:
return 1
h = 2 * iqr(a) / (len(a) ** (1 / 3))
# fall back to sqrt(a) bins if iqr is 0
if h == 0:
return int(np.sqrt(a.size))
else:
return int(np.ceil((a.max() - a.min()) / h))
https://github.com/mwaskom/seaborn/blob/master/seaborn/distributions.py#L24
The plot isn't beautiful because the plot is rude to the person you show it to I think it's polite to make it at least clean.
Recommended Posts