In the field of mathematics and statistics, you may see a table of classes, class values, frequencies, cumulative frequencies, relative frequencies, and cumulative relative frequencies. This is what it is like.
class | class値 | frequency | 累積frequency | 相対frequency | 累積相対frequency |
---|---|---|---|---|---|
0 or more and less than 3 | 1.5 | 1 | 1 | 0.07143 | 0.0714 |
3 or more and less than 6 | 4.5 | 6 | 7 | 0.42857 | 0.5000 |
6 or more and less than 9 | 7.5 | 2 | 9 | 0.14286 | 0.6429 |
9 or more and less than 12 | 10.5 | 2 | 11 | 0.14286 | 0.7857 |
12 or more and less than 15 | 13.5 | 3 | 14 | 0.21429 | 1.0000 |
total | - | 14 | - | 1.00000 | - |
I made it because I couldn't find a function that would give this out in one shot in Python.
There is no function to create a complete table, but the following convenient functions can be used to partially retrieve the necessary information. In addition to that, you can get all the necessary values by doing some calculations.
#numpy cumsum()Get the cumulative frequency with
data.cumsum()
#pandas value_counts()Count the frequency of appearance of each value with
pd.Series(data).value_counts()
There are no clear rules for determining the number of classes or the width of classes. However, there is a Starges formula to get an idea, so I will use it.
** Sturges' formula **
A formula that gives you a guide to determine the number of classes when creating frequency distribution tables and histograms. Assuming that N is the sample size and k is the number of classes, it can be calculated as follows. The width of the class is calculated by dividing the maximum value by k from the minimum value of the data.
k=log_2N+1
#Find the number of classes from the Starges formula
class_size = 1 + np.log2(len(data))
class_size = int(round(class_size))
#Find the class width
class_width = (max(data) - min(data)) / class_size #The denominator is the number of classes and the numerator is the range.
class_width = round(class_width)
However, there are no clear rules, and I think that there are times when you want to set the width of the class to a good value, such as 5, so I will make it compatible with that.
If you want to use the value given by the Starges formula, specify None
as the second argument of the function.
If you want to use an arbitrary value, specify that arbitrary value in the second argument. The number of classes will be changed accordingly.
def Frequency_Distribution(data, class_width):
if class_width == None:
#Find the class width
class_width = (max(data) - min(data)) / class_size #The denominator is the number of classes and the numerator is the range.
class_width = round(class_width) #Rounding
else:
class_width = class_width
class_size = max(x) // class_width
The class is "more than ... less than ...". When creating a frequency distribution table, I would like to set a class as an index and describe it in the table, but it is difficult to manually enter it according to the input data. Therefore, the index label can be generated by turning the for statement in list comprehension notation using the class width, the number of classes, and the format operator.
class_width = 5 #Class width
class_size = 10 #Number of classes
['%s or more%Less than s'%(w, w+class_width) for w in range(0, class_size*class_width*2, class_width)]
# ['0 or more and less than 5','5 or more and less than 10','10 or more and less than 15','15 or more and less than 20','20 or more and less than 25','25 or more and less than 30']
All you have to do now is add rows and columns and update column names and index names using pandas.
import pandas as pd
import numpy as np
#Make a frequency distribution table
def Frequency_Distribution(data, class_width):
#Find the number of classes from the Starges formula
class_size = 1 + np.log2(len(data))
class_size = int(round(class_size))
if class_width == None:
#Find the class width
class_width = (max(data) - min(data)) / class_size #The denominator is the number of classes and the numerator is the range.
class_width = round(class_width) #Rounding
else:
class_width = class_width
class_size = max(x) // class_width
# print('Number of classes:', class_size)
# print('Class width:', class_width)
#Sort by class
#Make each observation a class value
cut_data = []
for row in data:
cut = row // class_width
cut_data.append(cut)
#Count the frequency
Frequency_data = pd.Series(cut_data).value_counts()
Frequency_data = pd.DataFrame(Frequency_data)
#I want to sort by index and insert a row at any position, so I transpose it once.
F_data = Frequency_data.sort_index().T
#If there is a class with 0 frequency, insert it in the data frame
for i in range(0, max(F_data.columns)):
if (i in F_data) == False:
F_data.insert(i, i, 0)
F_data = F_data.T.sort_index()
#Rename indexes and columns
F_data.index = ['%s or more%Less than s'%(w, w + class_width) for w in range(0, class_size * class_width * 2, class_width)][:len(F_data)]
F_data.columns = ['frequency']
F_data.insert(0, 'Class value', [((w + (w + class_width)) / 2) for w in range(0, class_size * class_width * 2, class_width)][:len(F_data)])
F_data['Cumulative frequency'] = F_data['frequency'].cumsum()
F_data['Relative frequency'] = F_data['frequency'] / sum(F_data['frequency'])
F_data['Cumulative relative frequency'] = F_data['Cumulative frequency'] / max(F_data['Cumulative frequency'])
F_data.loc['total'] = [None, sum(F_data['frequency']), None, sum(F_data['相対frequency']), None]
return F_data
#Sample data
x = [0, 3, 3, 5, 5, 5, 5, 7, 7, 10, 11, 14, 14, 14]
Frequency_Distribution(x, None)
class | class値 | frequency | 累積frequency | 相対frequency | 累積相対frequency |
---|---|---|---|---|---|
0 or more and less than 3 | 1.5 | 1 | 1 | 0.07143 | 0.0714 |
3 or more and less than 6 | 4.5 | 6 | 7 | 0.42857 | 0.5000 |
6 or more and less than 9 | 7.5 | 2 | 9 | 0.14286 | 0.6429 |
9 or more and less than 12 | 10.5 | 2 | 11 | 0.14286 | 0.7857 |
12 or more and less than 15 | 13.5 | 3 | 14 | 0.21429 | 1.0000 |
total | - | 14 | - | 1.00000 | - |
This code, which was commented by @nkay, is recommended because it can be written very smartly.
def Frequency_Distribution(data, class_width=None):
data = np.asarray(data)
if class_width is None:
class_size = int(np.log2(data.size).round()) + 1
class_width = round((data.max() - data.min()) / class_size)
bins = np.arange(0, data.max()+class_width+1, class_width)
hist = np.histogram(data, bins)[0]
cumsum = hist.cumsum()
return pd.DataFrame({'Class value': (bins[1:] + bins[:-1]) / 2,
'frequency': hist,
'Cumulative frequency': cumsum,
'Relative frequency': hist / cumsum[-1],
'Cumulative relative frequency': cumsum / cumsum[-1]},
index=pd.Index([f'{bins[i]}that's all{bins[i+1]}Less than'
for i in range(hist.size)],
name='class'))
x = [0, 3, 3, 5, 5, 5, 5, 7, 7, 10, 11, 14, 14, 14]
Frequency_Distribution(x)
In creating the above code, I mainly referred to the following sites. Going to the Data Scientist Statistical Glossary
Recommended Posts