Statistics for Programmers-Table of Contents
When there is numerical data, the value that represents the data is called the representative value. There are the following three typical values. Which one is the representative value depends on the shape of the data distribution.
The average value is the sum of all the data divided by the number of data.
\bar{x} = \frac{(x_1+x_2+x_3+・ ・ ・+x_n)}{n}
For frequency distribution tables, you can use "class value" and "frequency" to get the average value.
If you have n classes, the class value is v
, and the frequency is f
, you can calculate with the following formula.
\bar{X} = \frac{(f_1v_1 + f_2v_2+・ ・ ・+ f_3v_3)}{(f_1 + f_2 +・ ・ ・+ f_n)}
As an example, let's calculate the average value based on the frequency distribution table of the test scores of 10 students.
class | Class value | frequency |
---|---|---|
0 points or more and less than 25 points | 12.5 | 1 |
25 points or more and less than 50 points | 37.5 | 3 |
50 points or more and less than 75 points | 62.5 | 4 |
75 points or more | 87.5 | 2 |
The average score for this test is calculated below.
\bar{X}=\frac{({1\times12.5}) + ({3\times37.5}) + ({4\times62.5}) + ({2\times87.5})}{(1+3+4+2)}
By the way, although it is a little off topic, there are multiple methods for calculating the average value depending on the application. Please refer to this as well. ** Related article: There is more than one way to calculate the average value **
The median is the value that is in the middle when the data is arranged in ascending or descending order. If the number of data is even, the median is two, and the median is the sum of them and divided by two.
1, 3, 4, 5, 7
In this case, the median is 4
.
1, 3, 4, 5, 7, 10
In this case, the median is 4
and 5
, so it can be calculated by the following formula, and the median is 4.5
.
4.5 = \frac{4+5}{2}
The mode is the value with the largest number of data.
1, 3, 4, 5, 7, 7, 10
For example, the mode in the above case would be 7
.
In the case of the frequency distribution table, the class value with the highest frequency is the mode.
In the frequency distribution table of the scores of the previous test, the one with the highest frequency is 4
of" 50 points or more and less than 75 points ", so the mode value is the class value 62.5
. ..
class | Class value | frequency |
---|---|---|
0 points or more and less than 25 points | 12.5 | 1 |
25 points or more and less than 50 points | 37.5 | 3 |
50 points or more and less than 75 points | 62.5 | 4 |
75 points or more | 87.5 | 2 |
Also, if there are the same number of 5
and 7
, the mode will be 5
and 7
, as shown below.
1, 3, 4, 5, 5, 7, 7, 10
Also, in the following cases, the mode does not exist.
1, 3, 4, 5, 7, 10
In the histogram distribution, if there is one peak in the peak, the following is often true. This is called Pearson's rule of thumb.
Of the following three, it always holds if it is symmetrical, but the other two are empirical rules and do not always hold.
If the distribution of the histogram is symmetrical as shown below, the mean, median, and mode are all the same at the position of the red line.
If the distribution is not symmetrical but biased to the left (tailed to the right) As shown below, the mode, median, and mean are often arranged in that order. (The line is drawn at the approximate position)
If the distribution is not symmetrical but biased to the right (tailed to the left) As shown below, the average value, median value, and mode value are often arranged in this order. (The line is drawn at the approximate position)
Which of the mean, median, and mode should be the representative value depends on the distribution of the data. The advantages and disadvantages of each are summarized.
Representative value | merit | Demerit |
---|---|---|
Average value | Can reflect all data | Will be dragged if there is an extreme value |
Median | Less susceptible to extreme values | Hard to notice changes other than the middle value |
Mode | Less susceptible to extreme values | It is difficult to refer to when the number of data is small |
Which one should be the representative value depends on how the data is distributed. Basically, if the difference between the average value and the median value is small, I think it is better to use the average value as the representative value. If the difference between the two is large, I think it is safe to look at the median and mode as well.
In the histogram in the example above, all had one mountain, but there can be multiple mountains. In such a case, it is difficult to determine the representative value, but it may be necessary to devise the method of collecting data in the first place.
that's all
-There is more than one way to calculate the average value
-Statistics web-Average / Median / Mode -How to find the mean, median, mode and some examples -[Basic] How to use the average value, median value, and mode value properly?
Recommended Posts