1. What is the Central Limit Theorem?

When studying statistics, a theorem with a rather rigid name, the central limit theorem, comes up. According to Wikipedia teacher

According to the law of large numbers, the sample mean randomly sampled from a population approaches the true mean as the sample size increases. The central limit theorem, on the other hand, discusses the error between the sample mean and the true mean. In many cases, whatever the distribution of the population, the error will approximately follow a normal distribution when the sample size is increased. http://ja.wikipedia.org/wiki/中心極限定理

It is written, but I do not understand well ^ ^; Whatever the shape of the original distribution, the sample mean of the samples taken from it will be close to the normal distribution. It seems that the sample variance will also be close to the normal distribution. (To be precise, if there are many N according to the chi-square distribution, it can be approximated by a normal distribution) Even if I explain it in words, even if I prove it with a mathematical formula (such as when the moment generating function matches), I think that it is not intuitively understandable, so the purpose of this article is to draw a graph and understand it. is.

2. Preparation for graph drawing

I will draw a graph using Python, but the preparatory process for that is as follows. We are preparing functions for importing various libraries and drawing graphs.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rd
import matplotlib.mlab as mlab
import scipy.stats as st

#Sample parameters
n = 10000
sample_size = 10000

#Function to calculate mean and variance for each sample
def sample_to_mean_var(sample):
    mean = np.mean(sample)
    var  = np.var(sample)
    return [mean, var]
    
#A function that draws a histogram of the mean and variance
def plot_mean_var(stats, dist_name=""):
    mu = stats[:,0]
    var = stats[:,1]
    bins = 40
    
    #Histogram of sample mean
    plt.figure(figsize=(7,5))
    plt.hist(mu, bins=bins, normed=True, color="plum")
    plt.title("mu from %s distribution"%(dist_name))
    plt.show()
    
    #Histogram of sample variance
    plt.figure(figsize=(7,5))
    plt.hist(var, bins=bins, color="lightblue", normed=True)
    plt.title("var from %s distribution"%(dist_name))
    plt.show()
    
def plot_dist(data, bins, title =""):
    plt.figure(figsize=(7,5))
    plt.title(title)
    plt.hist(data, bins, color="lightgreen", normed=True)
    plt.show()

3. Draw

3-1. Exponential distribution

First, try [Exponential distribution](http://qiita.com/kenmatsu4/items/c1a64cf69bc8c9e07aa2#geometricp-sizenone --- geometric distribution). The following is a graph with the exponential distribution parameter $ \ lambda $ set to 0.1 and 10,000 samples generated. It is a completely asymmetrical distribution with a long hem to the right.

#Graph drawing of exponential distribution
lam = 0.1  
x = rd.exponential(1./lam, size=sample_size)
plot_dist(x, 100, "exponential dist")

The sample mean and sample variance are calculated from this 10,000 samples as one set. Repeat this 10,000 times and write a histogram of the sample mean and sample variance as shown below.

#Generate a lot of exponential distributions and draw a histogram of sample mean and sample variance
lam = 0.1
stats = np.array([sample_to_mean_var(rd.exponential(1./lam, size=sample_size)) for i in range(n)])
plot_mean_var(stats, dist_name="exponential")

I wonder if the original distribution was quite distorted, but the sample mean and sample variance seem to be a beautiful symmetrical bell shape. The central limit theorem is that this follows a normal distribution.

Below, I will try other distorted graphs.

3-1. Chi-square distribution

Next is the [chi](http://qiita.com/kenmatsu4/items/c1a64cf69bc8c9e07aa2#chisquaredf-sizenone --- chi-square distribution) squared distribution. This is also quite distorted.

#Chi-square distribution with 5 degrees of freedom
df = 5
x = rd.chisquare(df, sample_size)
plot_dist(x, 50, "chi square dist")

#Histogram of mean and variance of chi-square distribution
df = 5   #Degree of freedom

#Generate a lot of chi-square distributions
chi_stats = np.array([sample_to_mean_var(rd.chisquare(df, sample_size)) for i in range(n)])
plot_mean_var(chi_stats, dist_name="chi square")

Again, you can see that a symmetrical bell-shaped histogram can be written.

3-1. Futamine normal distribution

I will also try a strangely shaped distribution with two mountains.

#Futamine normal distribution
def generate_bimodal_norm():
    x = np.random.normal(0, 4, sample_size)
    y = np.random.normal(25, 8, sample_size)
    return np.append(x,y)

z = generate_bimodal_norm()
plot_dist(z, 70, "bi-modal normal dist")

#Histogram of mean and variance of bimodal normal distribution

#Generate a lot of bimodal normal distributions
binorm_stats = np.array([sample_to_mean_var(generate_bimodal_norm()) for i in range(n)])
plot_mean_var(binorm_stats, dist_name="bi-modal normal")

Even with such a distribution, the sample mean and sample variance are normally distributed. It's amazing, the central limit theorem w

4. Conclusion

So, it is a central limit theorem that seems difficult when looking at mathematical formulas and proofs, but I tried to understand it intuitively by looking at the graph. This seems to be the reason why the normal distribution is important in statistics: smile:

[Statistics] Grasp the image of the central limit theorem with a graph