In the process of dividing the data into bins and drawing the histogram etc., if you pass array to hist () of matplotlib, it will be divided into bins that look good according to the data and drawn, but the data is divided into arbitrary bins. You may want to know what data is in which bin.
This time, we will create a bin and get the correspondence between the data and the bin.
First, create an array with an appropriate distribution.
import numpy
n = 100
dist = numpy.random.normal(0, 1, n)
There seems to be some debate about the validity of the number of bins, but in Microsoft Excel etc., it seems that the number of bins k is the square number of n for the number of data n as standard.
k=\sqrt{n}
This time, we will use this method to determine the number of bins and create an array as bins that divides the range of data into k.
import math
bin_num = math.sqrt(n)
bins = numpy.linspace(min(dist), max(dist), bin_num)
An array similar to the following was created.
[-2.28875045 -1.72785426 -1.16695807 -0.60606188 -0.0451657 0.51573049 1.07662668 1.63752287 2.19841906 2.75931524]
Let's see how the data is plotted using the created bin.
import matplotlib.pyplot as plt
plt.hist(dist, bins=bins)
plt.show()
You can get a list of bin location information corresponding to the data with numpy.digitize ().
bin_indice = numpy.digitize(dist, bins)
For the following results, dist [0] corresponds to the 4th bin and dist [1] corresponds to the 5th bin.
[ 4 5 8 6 4 6 8 1 6 6 8 2 6 3 5 4 5 4 5 3 8 2 5 5 4 4 4 4 2 3 5 6 5 3 4 3 7 6 4 3 4 4 8 2 4 4 8 6 6 3 6 2 9 5 5 4 4 1 8 6 5 5 5 5 4 1 10 3 1 8 7 3 4 3 8 2 6 5 6 3 6 7 5 3 3 5 5 5 4 1 3 6 5 6 7 3 4 7 8 4]
I will try to attach it with zip ().
bin_data_map = zip(dist, bin_indice)
[(-0.16840296791127732, 4), (0.43715458127052381, 5), (1.8635306330264274, 8), (0.89273121368100206, 6),...
Recommended Posts