[Statistics for programmers] Variance, standard deviation and coefficient of variation

table of contents

Statistics for Programmers-Table of Contents

What is variance?

I have the following data for A and B. In both cases, the total data is 15 and the average is 3, but the variability of the data in A and B is not similar.

A B
1 3
2 3
3 3
4 3
5 3

--Total 15 --Average 3

To check for such data variability, we use something called variance.

In order to understand the variance, it is also necessary to understand the deviation and mean deviation. So, before distribution, I will explain the two.

deviation

The deviation is the average of the average value plus all the differences between the data. In the case of the above example, it would be:

A Difference from average B Difference from average
1 2 3 0
2 1 3 0
3 0 3 0
4 -1 3 0
5 -2 3 0
total 0 - 0
average 0 - 0

The total deviation is always 0. Therefore, since the average is also 0, it is not possible to check the variation of the data by the deviation.

Mean deviation

Mean deviation is the sum of the mean and the absolute value of the difference between each data. In the case of the above example, it would be:

A Difference from average B Difference from average
1 2 3 0
2 1 3 0
3 0 3 0
4 1 3 0
5 2 3 0
total 6 - 0
average 1.2 - 0

Since it is the average of the total absolute values of the differences between each data, the average value is 0 or more, and you can see how the data vary. However, it is troublesome when the number of data increases because all the values must be replaced with absolute values before calculation.

Distributed

The variance is the average of the sum of the mean and the square of the difference between each data.

V =Distributed
n =Number of data
\bar{x} =Average value

Then, the following holds.

V = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2

I will actually calculate it.

2 = \frac{1}{5} \{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2\}

In a table

A Difference from average Squared difference from average
1 2 4
2 1 1
3 0 0
4 -1 1
5 -2 4
total 0 10
average 0 Variance value=2
B Difference from average Squared difference from average
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
total 0 0
average 0 Variance value=0

In this case, the variance of A is 2 and B is 0.

V_A = 2
V_B = 0

What you can see from the variance value

You can see that the smaller the variance value, the closer each data is to the mean, the less variability, and the larger the value, the greater the variability. The variance in this example is 2 for A and 0 for B, so A has a larger variation.

standard deviation

The standard deviation, like the variance, is an indicator of data variability, This is the variance calculated by the square root.

Why standard deviation is needed

Since the variance value is calculated after each data is squared, You can compare variances, but you cannot compare or calculate variances and means.

For example, if you want to distribute data with meters in units, Since the unit is also squared, you can compare and calculate the variances, but you cannot compare and calculate the variance and the mean.

The unit of the original data is meters,

m

Because the variance is the square of the meter

m^2

Cannot be compared with the original data or mean.

How to calculate standard deviation

Therefore, by using the square root for the variance, the squared unit is also restored, and it becomes possible to compare and calculate with the mean. The standard deviation can be calculated using the following formula.

\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}

Let's actually calculate the standard deviation of the data in A. Since the data for A is 1, 2, 3, 4, 5 and the mean value is 3, the standard deviation can be calculated by the following formula.

\sqrt{2} = \sqrt{\frac{1}{5} \{(1-3)^2 + (2-3)^2 + (3-3)^2 + (4-3)^2 + (5-3)^2\}} 

In a table

A Difference from average Squared difference from average
1 2 4
2 1 1
3 0 0
4 -1 1
5 -2 4
total 0 10
average 0 Variance value=2
- - standard deviation=√2

The solution is √2, so the standard deviation is about 1.4. B is 0 without any need to calculate.

In other words

\sigma_A \simeq 1.4
\sigma_B = 0

Therefore, it can be seen that the data variation is larger in A.

Coefficient of variation

The standard deviation divided by the mean.

example

Check the prices of 500ml PET bottles of water and cars (same model) by visiting 10 stores. I tried to find out how much each price varies from store to store. Below is a table summarizing their means and standard deviations.

Product Average price(Circle) standard deviation(Circle)
water 89 9
car 3,136,500 284,869

Cars have an overwhelmingly larger standard deviation, which means that car prices vary more. However, since the unit price of water and car is too different, it is natural that the standard deviation of the car is larger, and it is not a comparison of the rate of price variation.

Therefore, we use the coefficient of variation.

Coefficient of variation formula

The coefficient of variation allows you to compare variability by relative value rather than absolute value. The coefficient of variation is calculated by dividing the standard deviation by the mean.

The formula is as follows.

CV = \frac{\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}}{\bar{x}}

I will actually calculate it.

Product Average price(Circle) standard deviation(Circle)
water 89 14
car 3,136,500 284,869

For each, divide the standard deviation by the average price. Then

Coefficient of variation of water

0.15 = 14 \div 89

Coefficient of variation of the car

0.09 = 284,869 \div 3,136,500

The coefficient of variation of water is 0.15 The coefficient of variation of the car is 0.09

Therefore, we can see that the price of water is relatively more variable.

that's all

reference

-Statistics web-Distributed -Statistics web-Standard deviation

Recommended Posts

[Statistics for programmers] Variance, standard deviation and coefficient of variation
[Algorithm x Python] Calculation of basic statistics Part3 (range, variance, standard deviation, coefficient of variation)
[Statistics for programmers] Lorenz curve and Gini coefficient
Calculation of standard deviation and correlation coefficient in Python
Variance, statistics up to standard deviation
[Statistics for Programmers] Table of Contents-Data Science
Stock price and statistics (mean, standard deviation)
[Statistics for programmers] Conditional probabilities and multiplication theorems
[Statistics for programmers] Bayes' theorem
[Statistics for programmers] Random variables, probability distributions, and probability density functions
[Statistics for programmers] Mean, median, mode
[Statistics for programmers] What is an event?
2. Mean and standard deviation with neural network!