First, create two types of data with different distributions.
Data generation
import numpy as np
np.random.seed(seed=32)
groupA = np.random.normal(100, 20, 100000) #Average 100,Randomly generate 100000 values with a standard deviation of 20
groupB = np.random.normal(100, 50, 100000) #Average 100,Randomly generate 100000 values with a standard deviation of 50
print("groupA sample: {}".format(groupA[0:5]))
print("groupA sample: {}".format(groupB[0:5]))
groupA sample: [ 93.02211098 119.67406867 111.61845661 101.40568882 115.55065353]
groupA sample: [ 122.71129387 134.76068077 88.52182502 92.52887435 107.2470646 ]
First of all, looking at the average, as shown below, the values of both groups are around 100 on average (it is natural because 100 is specified on average).
mean
meanA, meanB = np.mean(groupA), np.mean(groupB)
print("group A average= {},group B average= {}".format(meanA, meanB))
group A average= 100.0881959959255,group B average= 100.13663565328969
However, when writing the histogram as follows, the distribution is clearly different. Group B has a wider mountain hem than group A and feels gentle (it is natural because the standard deviation value is different).
import seaborn as sns
sns.distplot(groupA, bins=100, label='groupA', kde=False)
sns.distplot(groupB, bins=100, label='groupB', kde=False)
plt.legend()
plt.show()
In this way, when comparing only the average, Group A and Group B seem to be equivalent groups, but the information that group B has a larger variation in value than group A is discarded. The summary statistics that express these variations in numbers are the variance and standard deviation.
The variance can be calculated with the following formula
S^2 = \frac{1}{n}{\sum_{i=1}^n(x_i-\bar{x})^2}
In other words, (each value-mean value) squared is added together and divided by the number of data
.
By subtracting the average value from each value, it seems to be an index showing how much it deviates from the average, but it may be a negative value, so it is squared.
After that, by adding them up and dividing by the number of data, the degree of variation in the data can be expressed.
By the way, the result of each value-mean value is the deviation, and the value obtained by squaring it and adding them all is called the sum of squared deviations.
Writing in python looks like this.
deviation
groupA[0] - groupA.mean()
Deviation square sum
s = np.sum((xi - groupA.mean())**2 for xi in groupA)
print(s)
40178555.707617663
Distributed
var = sum / len(groupA)
print(var)
401.78555707618096
groupA,B Try to get the variance of both
a = ((groupA - groupA.mean())**2).sum()/len(groupA)
b = ((groupB - groupB.mean())**2).sum()/len(groupB)
print("Distribution of groupA:{:.2f}\distribution of ngroupB:{:.2f}".format(a, b))
Distribution of groupA:401.79
groupB distribution:2496.21
So, we can see the variance of each, group B has a larger variance than A, and the numerical values are not concentrated near the mean value. In other words, it can be seen that there are variations.
I found that B was more variable, but I'm not sure if it was said that the variance was 2496, which was a set of numbers with an average of around 100 for both group A and group B. In such a case, it is better to give the standard deviation.
S=\sqrt{S^2}
The standard deviation is the square root of the variance. Since it was squared when calculating the variance, it returns to the original dimension by taking the square root here. Therefore, it becomes easier to understand how much the numbers vary.
standard deviation
print("standard deviation of groupA:{:.2f}\standard deviation of ngroupB:{:.2f}".format(math.sqrt(a), math.sqrt(b)))
standard deviation of groupA:20.04
group B standard deviation:49.96
numpy is so convenient that you can easily calculate the mean, variance, and standard deviation.
mean = groupA.mean()
var = groupA.var()
std = groupA.std()
print("average:{:.2f}Distributed:{:.2f}standard deviation:{:.2f}".format(mean, var, std))
average:100.09 Dispersion:401.79 standard deviation:20.04
Recommended Posts