The Tokyo Governor's election was held, but Yuriko Koike's preliminary report came out quite early. Suddenly, I thought, "How many votes should I count?", So I'll give a rough estimate. The first thing to know is that ** the number of votes required to know the accuracy depends on how close the game is **. If there are people who have overwhelmingly won the number of votes, it seems that it will be decided immediately, and if the 1st and 2nd place are close, I think that it can be imagined that you will not know unless you count a lot of votes. This time, considering the Tokyo Governor's election held in July 2020, we will focus on Yuriko Koike, who is ranked first, and Kenji Utsunomiya, who is ranked second.
Now let's take a look at the results. I referred to the following page. https://www3.nhk.or.jp/news/html/20200705/k10012497581000.html Looking at this, you can see that Yuriko Koike in 1st place is about 60% and Kenji Utsunomiya in 2nd place is about 15%.
What is "** Probably **" in the first place? Let's think from that point. To determine this, we need to consider the "error" or "confidence interval". For example, if 100 votes are counted and Yuriko Koike has 60 votes, you can see that "Yuriko Koike is about 60%." However, it is dangerous to decide exactly 60% with this alone. If all the votes are counted, it may actually be a little more, 62%, or 59%. When it is decided that it will be between 59-61%, we say "60% error ± 1%". This is a writing style that you can learn even in elementary school mathematics. On the other hand, if it is a method using a method called a statistical test It is said that there is a 95% chance that Yuriko Koike's vote rate will be between 59 and 61%. At this time, the interval 59-61 is called the confidence interval. This time I implemented this in Python. If the confidence interval is XX% and the vote counting rate is r% as a result of counting N votes, the upper and lower limits of the confidence interval can be obtained by the following method. I will leave the detailed mathematical formulas for the time being, but since they can be derived from almost any statistics textbook, I think that those who are interested will deepen their understanding of statistics. For example, the following sites may be helpful:
def getR(r, N):
"""
return:Confidence x%Lower and upper limits of the voting rate in,Returns in the order of lower limit and upper limit
r:Vote counting rate calculated from the ballot counting results
N:Number of votes counted
Confidence interval(x)How to decide
k = 1.96 :Confidence interval 95%in the case of
k = 2.58 :Confidence interval 99%in the case of
k = 3.29 :Confidence interval 99.9%in the case of
"""
k = 3.29 #99.9%
#Lower and upper limits
lower_limit = r - k * math.sqrt(r*(1-r)/N)
upper_limit = r + k * math.sqrt(r*(1-r)/N)
return lower_limit, upper_limit
Now that we have defined the function in Python, let's visualize it.
Let's plot the approximate vote rate and confidence interval of Yuriko Koike (Yuriko) and Kenji Utsunomiya (Kenji).
Let's look at the horizontal axis as the number of votes to count.
The average vote counting rate is fixed at Yuriko Koike 0.6
and Kenji Utsunomiya 0.15
regardless of the number of votes counted.
(Originally, this value should change every time the votes are counted, but there is no way to know ...)
Well, it shouldn't be that far off.
The confidence interval is usually calculated in the case of 95%, but let's calculate it in 99.9%.
It's just "winning ** confirmed **", so it's a little scary to remove it with a 5% chance.
You can easily change the percentage of this by changing the value of k
in the function defined above.
By the way, this k
value is pulled from the standard normal distribution table.
https://www.koka.ac.jp/morigiwa/sjs/standard_normal_distribution.htm
import numpy as np
import matplotlib.pyplot as plt
import math
#Approximate vote rate
yuriko_rate = 0.6
kenji_rate = 0.15
yuriko_upper = []
yuriko_lower = []
kenji_upper = []
kenji_lower = []
#100 people up to 1000 people
N_open = [i for i in range(100,1000, 100)]
for n_open in N_open:
yuriko_lower.append( getR(yuriko_rate, n_open)[0])
yuriko_upper.append( getR(yuriko_rate, n_open)[1])
kenji_lower.append( getR(kenji_rate, n_open) [0])
kenji_upper.append( getR(kenji_rate, n_open) [1])
yuriko_upper = np.array(yuriko_upper)
yuriko_lower = np.array(yuriko_lower)
yuriko_mean = (yuriko_lower + yuriko_upper) / 2
kenji_upper = np.array(kenji_upper)
kenji_lower = np.array(kenji_lower)
kenji_mean = (kenji_lower + kenji_upper) / 2
plt.plot(N_open, yuriko_mean,
color='blue', marker='o',
markersize=5, label='Yuriko')
plt.fill_between(N_open,
yuriko_upper,
yuriko_lower,
alpha=0.15, color='blue')
plt.plot(N_open, kenji_mean,
color='green', linestyle='--',
marker='s', markersize=5,
label='Kenji')
plt.fill_between(N_open,
kenji_upper,
kenji_lower,
alpha=0.15, color='green')
plt.grid()
plt.xlabel('Number of votes')
plt.ylabel('Rates')
plt.legend(loc='upper right')
plt.ylim([0., 1.0])
plt.tight_layout()
plt.show()
The output result is as follows. Despite the strict confidence interval of 99.9%, Yuriko's lower limit is higher than Kenji's upper limit when only 100 votes are counted. As the value on the horizontal axis increases, the accuracy will increase, so this section will also converge, but you can see that Yuriko is confirmed to win with a small number of votes.
Let's see what happens if Yuriko and Kenji are a little closer. This time, at the time of the preliminary report, Yuriko was 60%, Kenji 15% and Yuriko won overwhelmingly, but let's consider the case where Yuriko was 40% and Kenji was 30%, even if it was a little closer.
Looking at the graph, the confidence intervals of the two people overlap even when 1000 votes are counted. This shows that even if Yuriko has more votes than Kenji as a result of 1000 votes, it is still not statistically sufficient.
In the Tokyo Governor's election this time, it was reported that Yuriko Koike was elected at a fairly early stage, but considering the difference in the voting rate between the 1st and 2nd place, it can be seen that the number of votes required is quite small. There are people out there who say, "The election results can't come out so quickly! It's an unfair election! How many people do you think there are in Tokyo!" I want you to do it. The wording that I saw on Twitter before and was convinced "Do I have to drink everything to taste the miso soup?" That is. You can tell if miso soup is salty by tasting a bite, right? No matter how many citizens there are, it is not necessary to count all the votes in order to send the correct bulletin.
The above calculation contains one important assumption. That is, ** the votes counted are not biased **. For example, even if the same 100 votes are counted, the correct result will not be obtained if there is a bias such as "open the votes from the local votes of Kenji Utsunomiya" or "open the votes only in the 20s". It is difficult to completely eliminate the bias, but it should be randomly extracted to reduce it as much as possible. Considering such bias, the number of votes actually required for this preliminary report may be a little more, but as you can see from the above figure, there is a significant difference even with 100 votes, so in any case it is not necessary to count so many votes. Let's do it. To explain the bias with the example of miso soup, after putting miso in hot water with dobon, ** if you do not stir it properly, even the same bite will be very salty or will not taste **. Mix as much as possible to make it uniform before tasting, right? The ballot counting and breaking news are the same.
It's a little advanced, but the lower and upper limits of the confidence interval that came out in the above getR ()
lower_limit = r - k * math.sqrt(r*(1-r)/N)
upper_limit = r + k * math.sqrt(r*(1-r)/N)
I would like to touch on the derivation of. Convert the code into a formula. Let R be the true (when all votes are counted) vote rate Confidence intervals are multiplied by (lower limit) <R <(upper limit). In other words
r - k\sqrt{r(1-r)/N}<R<r + k\sqrt{r(1-r)/N}
Can be written. I will transform it a little.
- k\sqrt{r(1-r)/N}< R-r < k\sqrt{r(1-r)/N} \\
- k< \frac{R-r}{\sqrt{r(1-r)/N}} < k
Can be written. Consider the meaning of this formula. "Between -k and k" actually makes sense. This k is taken from the ** standard normal distribution table **, as mentioned a bit above. The standard normal distribution represents a normal distribution with a mean value of 0 and a variance (= σ ^ 2) of 1. In other words, does this k fall within the confidence interval x% in the standard normal distribution when the value on the horizontal axis is inside? It is a calculation. In other words, looking at the above formula
\frac{R-r}{\sqrt{r(1-r)/N}}
Will follow a standard normal distribution. You can see that. Now let's consider whether this value really follows a standard normal distribution. In general, if the variance is σ ^ 2, the expected value μ, and the observed value X, the following equation follows the standard normal distribution.
\frac{X-\mu}{\sigma}
By the way, the normal distribution is symmetrical, so the molecules are the same even if they are replaced. Considering this, r of R-r is the percentage of votes counted when the votes are counted halfway (only N votes), and R is the true percentage of votes, so it corresponds to μ-X as it is. Then the remaining √r (1-r) / N part seems to correspond to σ. Let's guide this as well.
The distribution by repeating the two choices of "whether the disclosed vote is Yuriko Koike" like this time can be considered as a binomial distribution. In general, the mean and variance in the binomial distribution are
\mu = r \\
\sigma^2 = r(1-r)
Can be expressed as. r is the probability of success in an attempt. In this case, it corresponds to "the probability that the vote counted is Yuriko Koike". This time, we will calculate the variance when N votes are counted. In general, the variance according to the mean value (standard error) can be calculated by σ / √N. So in this case
\sqrt{r(1-r)/N}
And it matched the above formula. From the above
\frac{R-r}{\sqrt{r(1-r)/N}}
Is derived to follow a standard normal distribution, and the definition shows that the confidence interval can be calculated with -k <R <k.
Recommended Posts