Statistics for Programmers-Table of Contents
I have the following data for A and B. In both cases, the total data is 15 and the average is 3, but the variability of the data in A and B is not similar.
A | B |
---|---|
1 | 3 |
2 | 3 |
3 | 3 |
4 | 3 |
5 | 3 |
--Total 15 --Average 3
To check for such data variability, we use something called variance.
In order to understand the variance, it is also necessary to understand the deviation and mean deviation. So, before distribution, I will explain the two.
The deviation is the average of the average value plus all the differences between the data. In the case of the above example, it would be:
A | Difference from average | B | Difference from average |
---|---|---|---|
1 | 2 | 3 | 0 |
2 | 1 | 3 | 0 |
3 | 0 | 3 | 0 |
4 | -1 | 3 | 0 |
5 | -2 | 3 | 0 |
total | 0 | - | 0 |
average | 0 | - | 0 |
The total deviation is always 0. Therefore, since the average is also 0, it is not possible to check the variation of the data by the deviation.
Mean deviation is the sum of the mean and the absolute value of the difference between each data. In the case of the above example, it would be:
A | Difference from average | B | Difference from average |
---|---|---|---|
1 | 2 | 3 | 0 |
2 | 1 | 3 | 0 |
3 | 0 | 3 | 0 |
4 | 1 | 3 | 0 |
5 | 2 | 3 | 0 |
total | 6 | - | 0 |
average | 1.2 | - | 0 |
Since it is the average of the total absolute values of the differences between each data, the average value is 0 or more, and you can see how the data vary. However, it is troublesome when the number of data increases because all the values must be replaced with absolute values before calculation.
The variance is the average of the sum of the mean and the square of the difference between each data.
V =Distributed
n =Number of data
\bar{x} =Average value
Then, the following holds.
V = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2
I will actually calculate it.
2 = \frac{1}{5} \{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2\}
In a table
A | Difference from average | Squared difference from average |
---|---|---|
1 | 2 | 4 |
2 | 1 | 1 |
3 | 0 | 0 |
4 | -1 | 1 |
5 | -2 | 4 |
total | 0 | 10 |
average | 0 | Variance value=2 |
B | Difference from average | Squared difference from average |
---|---|---|
1 | 0 | 0 |
2 | 0 | 0 |
3 | 0 | 0 |
4 | 0 | 0 |
5 | 0 | 0 |
total | 0 | 0 |
average | 0 | Variance value=0 |
In this case, the variance of A is 2
and B is 0
.
V_A = 2
V_B = 0
You can see that the smaller the variance value, the closer each data is to the mean, the less variability, and the larger the value, the greater the variability.
The variance in this example is 2
for A and 0
for B, so A has a larger variation.
The standard deviation, like the variance, is an indicator of data variability, This is the variance calculated by the square root.
Since the variance value is calculated after each data is squared, You can compare variances, but you cannot compare or calculate variances and means.
For example, if you want to distribute data with meters in units, Since the unit is also squared, you can compare and calculate the variances, but you cannot compare and calculate the variance and the mean.
The unit of the original data is meters,
m
Because the variance is the square of the meter
m^2
Cannot be compared with the original data or mean.
Therefore, by using the square root for the variance, the squared unit is also restored, and it becomes possible to compare and calculate with the mean. The standard deviation can be calculated using the following formula.
\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}
Let's actually calculate the standard deviation of the data in A.
Since the data for A is 1, 2, 3, 4, 5
and the mean value is 3
, the standard deviation can be calculated by the following formula.
\sqrt{2} = \sqrt{\frac{1}{5} \{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2\}}
In a table
A | Difference from average | Squared difference from average |
---|---|---|
1 | 2 | 4 |
2 | 1 | 1 |
3 | 0 | 0 |
4 | -1 | 1 |
5 | -2 | 4 |
total | 0 | 10 |
average | 0 | Variance value=2 |
- | - | standard deviation=√2 |
The solution is √2
, so the standard deviation is about 1.4
.
B is 0
without any need to calculate.
In other words
\sigma_A \simeq 1.4
\sigma_B = 0
Therefore, it can be seen that the data variation is larger in A.
The standard deviation divided by the mean.
Check the prices of 500ml PET bottles of water and cars (same model) by visiting 10 stores. I tried to find out how much each price varies from store to store. Below is a table summarizing their means and standard deviations.
Product | Average price(Circle) | standard deviation(Circle) |
---|---|---|
water | 89 | 9 |
car | 3,136,500 | 284,869 |
Cars have an overwhelmingly larger standard deviation, which means that car prices vary more. However, since the unit price of water and car is too different, it is natural that the standard deviation of the car is larger, and it is not a comparison of the rate of price variation.
Therefore, we use the coefficient of variation.
The coefficient of variation allows you to compare variability by relative value rather than absolute value. The coefficient of variation is calculated by dividing the standard deviation by the mean.
The formula is as follows.
CV = \frac{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}}{\bar{x}}
I will actually calculate it.
Product | Average price(Circle) | standard deviation(Circle) |
---|---|---|
water | 89 | 14 |
car | 3,136,500 | 284,869 |
For each, divide the standard deviation by the average price. Then
Coefficient of variation of water
0.15 = 14 \div 89
Coefficient of variation of the car
0.09 = 284,869 \div 3,136,500
The coefficient of variation of water is 0.15
The coefficient of variation of the car is 0.09
Therefore, we can see that the price of water is relatively more variable.
that's all
-Statistics web-Distributed -Statistics web-Standard deviation
Recommended Posts