I tried to confirm whether the unbiased estimator of standard deviation is really unbiased by "throwing a coin 10,000 times"

When I was a student [^ major], when I touched the field of statistics in mathematics, I learned something like this.

When you pick some samples from the population and infer population trends from them ** (sampling) **

  • Probabilistically ** the sample mean represents the population mean **, but
  • A measure of ** "variation", such as standard deviation, ** tends to be stochastically less than that of the population ** of the sample.

in short,

** The standard deviation of the sample is smaller than the population **! ΩΩΩ

about it.

Immediately after that, I was taught the formula of ** "unbiased variance" ** by dividing by $ n -1 $ instead of the number of samples $ n $. Have you ever wanted to make sure?

Even if the proof of proper algebra says that "the expected value of unbiased variance is equal to the variance of the population", that is certainly the case, but after all ** I do not see concrete data [^ abstraction]. Haven't you experienced that it's not so refreshing **?

However, ** there is no data to confirm it **, and considering the bias of the sample, ** I can't be sure if I calculate it once or twice ** …….

wait a minute. Isn't the box (or board) in front of you ** for that **? ~~ I feel different. But at least it shouldn't be there to instigate each other on ** 5ch **. ~~ If you have a computer, you can throw coins 10,000 times ** because it's ** before breakfast **.

A simple question that was stolen in childhood, ** Let's hit it with computer science **.

[^ major]: My major was ghost studies. [^ abstraction]: Mathematics is a discipline that emphasizes abstraction (generalization) rather than concreteness (example), so it can't be helped.

Live in the virtual world

This time, we will use ** Python + JupyterLab ** to run the simulation. However, it doesn't matter what you use **. For the time being ** I'm using it for business ** Recently I've been posting strangely ** Python articles to Qiita **, so it doesn't mean anything more than that. So ** Stop mount battles and other crazy things in programming languages right now **. Especially in natural language [^ English].

In[]


import math

import matplotlib.pyplot as plt
import numpy as np

This time, the population is assumed to be a population that follows a normal distribution with a mean of 50 and a standard deviation of 10. As an aside, this $ \ mathcal {N} (50, 10 ^ 2) $ is also a ** definition of deviation value **. Generally, only the part that "transformed so that the average becomes 50" is known, but the part that ** standard deviation 10 ** is ** quite important . Thanks to this, " how outstanding the results are **" can be discussed in a unified manner. In addition, there are many people who make a ridiculous misunderstanding ** because the lower limit is 0 and the upper limit is 100 ** because they choose a meaningful number such as 50 [^ deviation].

In[]


mu = 50.
sigma = 10.

The number of specimens to be extracted is ** 10 **. It depends on the size of the population, but I think this is ** quite small ** for the survey. At this time, it was said that there were at least 2000 people even in trivia seeds ** Hey, is this article different from too many derailments **? Does even such a small sample properly match the population when viewed at the expected value? I'm looking forward to it.

In[]


samples_num = 10

[^ English]: I wonder why there are a certain number of people who diss the irregularity of English as if they were the heads of demons. It's unpleasant. [^ deviation]: When I was a kid, I was worried that if I transformed the below average from 0 to 50 and the above average from 50 to 100, it would be a very distorted distribution. I'm alone.

average

First, let's take a look at the ** average **, which is an index that is said to be "no problem (probabilistically) used as it is in a sample survey".

In[]


np.random.seed(555)

mean_list = []
for _ in range(10000):
    samples = np.random.normal(mu, sigma, samples_num)
    mean_list.append(np.mean(samples))
mean = np.array(mean_list)

plt.hist(mean, bins=30, density=True, histtype='stepfilled')
plt.title(F'Mean: $\\mu = {np.mean(mean):.3f}$, $\\sigma = {np.std(mean):.3f}$')
plt.show()

mean.png

As a result, the average of the results of 10000 sample surveys was ** 50.022 **. Since the average of the population is ** 50 **, it can be estimated ** properly **. I don't know if the standard deviation 3.218 is more or less, but it's probably ** more samples and less variability **.

standard deviation

Next is today's protagonist, ** Standard Deviation **.

In[]


np.random.seed(913)

std_list = []
for _ in range(10000):
    samples = np.random.normal(mu, sigma, samples_num)
    std_list.append(np.std(samples))
std = np.array(std_list)

plt.hist(std, bins=30, density=True, histtype='stepfilled')
plt.title(F'Standard Deviation: $\\mu = {np.mean(std):.3f}$, $\\sigma = {np.std(std):.3f}$')
plt.show()

std.png

As a result, the average of 10000 sample surveys was ** 9.235 **. The standard deviation of the population is ** 10 **, so it is certainly ** underestimated **.

Bias of an estimator of standard deviation [^ unbiased]

The ** standard deviation ** is the ** (non-negative) ** square root ** of the variance. ** Variance ** is the ** average ** of the ** "difference from the mean squared" of each data. $ s = \sqrt{s^2} = \sqrt{\frac{1}{n}\sum_{i = 1}^n(x_i - \overline{x})^2} $ When randomly selecting a sample, ** it is unlikely that a large or small value that deviates too much from the mean will be selected ** (aside from variability such as uniform distribution and shit distribution). It will be underestimated.

The reason for this is the value obtained by dividing by ** $ n -1 $ instead of dividing by $ n $ to obtain the average, assuming that you refer to books and websites that are written more seriously about various statistics. ** matches the expected value ** of the population variance **. $ u^2 = \frac{1}{n - 1}\sum_{i = 1}^{n}(x_i - \overline{x})^2 = \frac{n}{n - 1}s^2 $

E(u^2) = \sigma^2

Then ** does the expected value of the square root ** match the standard deviation of the population **? ** Slightly different **, it seems to be calculated by the following formula. $ D = \sqrt{\frac{n - 1}{2}}\frac{\Gamma\left(\frac{n - 1}{2}\right)}{\Gamma\left(\frac{n}{2}\right)}u = \sqrt{\frac{n}{2}}\frac{\Gamma\left(\frac{n - 1}{2}\right)}{\Gamma\left(\frac{n}{2}\right)}s $

E(D) = \sigma

** What is this formula **. By the way, it seems that $ \ Gamma (n) = (n -1)! $ In the range of natural numbers. ** I don't know **.

Anyway, referring to this description, I will write a function to find the coefficient of $ s $.

In[]


def ustd_coefficient(n):
    try:
        return math.sqrt(n / 2) * math.gamma((n - 1) / 2) / math.gamma(n / 2)
    except OverflowError:
        return math.sqrt(n / (n - 1.5))

In[]


np.random.seed(333)

D_list = []
for _ in range(10000):
    samples = np.random.normal(mu, sigma, samples_num)
    D_list.append(np.std(samples) * ustd_coefficient(len(samples)))
D = np.array(D_list)

plt.hist(D, bins=30, density=True, histtype='stepfilled')
plt.title(F'Unbiased Standard Deviation: $\\mu = {np.mean(D):.3f}$, $\\sigma = {np.std(D):.3f}$')
plt.show()

unbiased_std.png

As a result, the average of 10000 sample surveys was ** 10.024 **. The standard deviation of the population is ** 10 **, so it seems that you can certainly estimate the standard deviation of the population **.

[^ unbiased]: This ridiculously lengthy phrase is due to the confusion of terms in this area. Is the sample variance (standard deviation) called the "sample variance (sample standard deviation)", or is the unbiased estimator called the "sample variance (sample standard deviation)" because it is used for sample surveys? Is the unbiased standard deviation the "unbiased estimator of the standard deviation of the population" or the "square root of the unbiased estimator of the population variance"? Ah, it's messed up.

Median

From here on, there is a ** bonus **.

People who bite a little bit of statistics seem to be dyed with ideas like ** "The average is shit! Let's get out of average addiction!" ** [citation needed] </ sup>, but such people have been decided The ** median ** is what you bring out. At this time, let's try this too.

** In the normal distribution, the mean and median match **, so the median of the population is also 50 **.

I will try to predict what will happen to me. First, ** median ** is the value that divides all values into 1: 1 **. So, when taking a sample, the probability that it will be ** below the median of the population ** and the probability that it will be ** above ** is clearly ** equal ** by definition. .. So the value that divides the sample into 1: 1 would still ** converge to that of the population **.

Let's check it.

In[]


np.random.seed(753)

median_list = []
for _ in range(10000):
    samples = np.random.normal(mu, sigma, samples_num)
    median_list.append(np.median(samples))
median = np.array(median_list)

plt.hist(median, bins=30, density=True, histtype='stepfilled')
plt.title(F'Median: $\\mu = {np.mean(median):.3f}$, $\\sigma = {np.std(median):.3f}$')
plt.show()

median.png

As a result, the average of 10000 sample surveys was ** 50.040 **. Since the median of the population is ** 50 **, as expected, it can be estimated correctly from the ** sample **.

Regular interquartile range

Finally, let's look at the ** regular interquartile range **. As confirmed in Previous article, ** the standard deviation and the normal interquartile range match in the normal distribution **, so ** population The normal interquartile range of is also 10 **.

Ah, NumPy doesn't have a function to calculate the normal interquartile range, so I'll make it up.

In[]


def np_iqr(a, axis=None, out=None, overwrite_input=False, keepdims=False):
    q1 = np.quantile(a, q=0.25, axis=axis, overwrite_input=overwrite_input, keepdims=keepdims)
    q3 = np.quantile(a, q=0.75, axis=axis, overwrite_input=overwrite_input, keepdims=keepdims)
    return np.subtract(q3, q1, out=out)

def np_niqr(a, axis=None, out=None, overwrite_input=False, keepdims=False):
    return np.divide(np.iqr(a, axis=axis, overwrite_input=overwrite_input, keepdims=keepdims), 1.3489795003921634, out=out)

np.iqr = np_iqr
np.niqr = np_niqr

I will try to predict what will happen to me. First, the ** regular interquartile range ** is the ** difference between the top and bottom quantiles **. As the extreme of the quantiles, when considering the ** minimum and maximum **, the ** sample minimum ** is often ** greater than the population minimum ** * * Probably, the ** sample maximum ** will most likely be ** smaller ** than the population maximum **. That is, ** sample quantiles tend to be biased inward of that of the population ** (the median was just the middle quantile and was unaffected). As a result, we can expect that the ** regular interquartile range is underestimated ** than that of the population **.

Let's check it.

In[]


np.random.seed(315)

niqr_list = []
for _ in range(10000):
    samples = np.random.normal(mu, sigma, samples_num)
    niqr_list.append(np.niqr(samples))
niqr = np.array(niqr_list)

plt.hist(niqr, bins=30, density=True, histtype='stepfilled')
plt.title(F'Normalized Interquartile Range: $\\mu = {np.mean(niqr):.3f}$, $\\sigma = {np.std(niqr):.3f}$')
plt.show()

niqr.png

As a result, the average of 10000 sample surveys was ** 8.640 **. Since the normal interquartile range of the population is ** 10 **, it is still ** underestimated ** as expected **.

Afterword

How do you find the unbiased estimator for the normal interquartile range?

Recommended Posts

I tried to confirm whether the unbiased estimator of standard deviation is really unbiased by "throwing a coin 10,000 times"
I tried to analyze emotions whether Hinatazaka46 is really a "happy aura"
I tried to verify the result of A / B test by chi-square test
I tried to check with the help of neural networks whether "Japanese" only "unreadable fonts" can really be read only by Japanese.
A super introduction to Django by Python beginners! Part 2 I tried using the convenient functions of the template
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to verify and analyze the acceleration of Python by Cython
I tried to streamline the standard role of new employees with Python
I tried to display the altitude value of DTM in a graph
I tried to easily create a high-precision 3D image with one photo [-1]. (Is the hidden area really visible?)
I tried to predict the presence or absence of snow by machine learning.
I tried to rescue the data of the laptop by booting it on Ubuntu
I tried to create a model with the sample of Amazon SageMaker Autopilot
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to summarize the basic form of GPLVM
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
A super introduction to Django by Python beginners! Part 6 I tried to implement the login function
I tried to automatically generate OGP of a blog made with Hugo with tcardgen made by Go
I thought a little because the Trace Plot of the stan parameter is hard to see.
I tried to create a RESTful API by connecting the explosive Python framework FastAPI to MySQL.
I tried to tabulate the number of deaths per capita of COVID-19 (new coronavirus) by country
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to create a Python script to get the value of a cell in Microsoft Excel
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried to automate the construction of a hands-on environment using IBM Cloud's SoftLayer API
I tried to predict the sales of game software with VARISTA by referring to the article of Codexa
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried to find the average of the sequence with TensorFlow
I made a function to check the model of DCGAN
[Python] I tried to visualize the follow relationship of Twitter
I tried a little bit of the behavior of the zip function
[Machine learning] I tried to summarize the theory of Adaboost
I tried to fight the Local Minimum of Goldstein-Price Function
Is there a secret to the frequency of pi numbers?
I tried to unlock the entrance 2 lock sesame with a single push of the AWS IoT button
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2
I tried to predict the number of domestically infected people of the new corona with a mathematical model
I tried to summarize the contents of each package saved by Python pip in one line
A story that is a little addicted to the authority of the directory specified by expdp (for beginners)
[You have to know it! ] I tried to set up a Python environment profitably by making full use of the privileges of university students.
[Linux] I tried to summarize the command of resource confirmation system
I tried to get a database of horse racing using Pandas
I tried to create a simple credit score by logistic regression.
I tried to get the index of the list using the enumerate function
I tried to automate the watering of the planter with Raspberry Pi
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
I tried to build the SD boot image of LicheePi Nano
[Introduction to StyleGAN] I played with "The Life of a Man" ♬