** Notes on this article ** It is an amateur graffiti. I think there are many mistakes, improper terms, and no proof. Thank you for your understanding.

Purpose of this article

I want to find the average amount of information of the original probability distribution from the sample generated by the continuous probability distribution.

theory

The average information amount $ , h (X) , $ of a continuous probability distribution with a probability density function $ , , f , $ is expressed by Equation 1, but the number of samples $ , N , $ is If it is large enough, it can be obtained from the amount of information of each sample $ , , x_i , $ as shown in Equation 2 (should).

\begin{align}
&h(X) = \int_{\chi}^{}f(x)\,log\,f(x)\,dx\qquad\qquad ・ ・ ・ Equation 1\\
&h(X) \approx \frac{1}{N}\sum_{i=1}^{N}\,-logP(\,X = x_i\,)\,\qquad ・ ・ ・ Equation 2
\end{align}

In order to find $ , h (X) $ from Equation 2, it is necessary to find $ , P (, X = x_i ,) , $ for each sample $ , , x_i , $. .. From here, I will explain a little about how to find $ P (, X = x_i ,) , $ and $ , h (X) , $.

First, we define some quantities. -Set the distance between each sample as $ , d (x_i, , x_j) , $ $ ^ * $. -For each sample $ , , x_i , $, the total number of samples whose distance to $ x_i , $ is less than or equal to $ , , d , $ is $ , n_i. Set , $ (including yourself). ・ The volume of the area where the distance from a certain $ , , x , $ is less than $ , r , $ is $ , V (r) , $. .. By determining an appropriate $ , r , $ according to this definition, $ , P (, X = x_i ,) , $ can be approximately obtained from Equation 3.

P(\,X=x_i\,)\, \approx \frac{n_i}{NV(r)}\qquad ・ ・ ・ Equation 3

By substituting Equation 3 into Equation 2, Equation 4 is obtained.

\begin{align}
h(X,r) &\approx \frac{1}{N}\sum_{i=1}^{N}\,-log\frac{n_i}{NV(r)}\\
&＝ \,logV(r) + logN - \frac{1}{N}\sum_{i=1}^{N}\,log\,n_i\qquad ・ ・ ・ Equation 4
\end{align}

In order to establish the approximation of Equation 3, it is preferable that $ r , $ take as small a value as possible. However, as long as the number of samples is finite, if $ r , $ is extremely small, the law of large numbers cannot be satisfied, and the approximation of Equation 3 breaks down. Therefore, it is necessary to think about how to properly determine $ r , $ while looking at the actual data. 　　 $ ^ * \ lim_ {d (x_i, , x_j) \ to 0} P (, X = x_i ,) = lim_ {d (x_i, , x_j) \ to 0} P (, X = x_j) Anything should be fine as long as it meets ,) , , $

application

Generate a sample from an appropriate two-dimensional Gaussian distribution and find $ h (X) , $.

import numpy as np
from matplotlib import pyplot as plt


def calc_d(x):
    N = len(x)
    x_tiled = np.tile(x, (N, 1, 1))
    d = np.linalg.norm(x_tiled - x_tiled.transpose((1, 0, 2)), axis=2)
    return d


#Apply the area of a circle formula because the number of dimensions is 2.
def calc_v(r):
    v = np.pi * np.power(r, 2)
    return v


def calc_h(d, v, N, r):
    n = np.sum(d <= r, axis=0)
    h = np.log(v) + np.log(N) - np.sum(np.log(n)) / N
    return h


#Generate data from a suitable 2D Gaussian distribution
data = np.random.normal(0, 1, (1000, 2))
#h while changing r(X)To calculate
r_list = [(i + 1) * 0.01 for i in range(10000)]  #The range of r was decided appropriately
d = calc_d(data)
N = len(data)
h_list = [calc_h(d, calc_v(r), N, r) for r in r_list]
#Draw a graph
#Plot the calculated value with a solid blue line
plt.figure(0)
plt.plot(r_list, h_list, color='blue', linestyle='solid')
#Plot the value calculated from the sample variance with a blue dotted line
Z = np.cov(data[:, 0], data[:, 1])
h_s = 0.5 * np.log(np.linalg.det(2 * np.pi * np.e * Z))
plt.plot(r_list, [h_s for _ in range(len(r_list))], color='blue', linestyle='dotted')
#Plot the value calculated from the population variance with an orange dotted line
h_u = np.log(2 * np.pi * np.e)
plt.plot(r_list, [h_u for _ in range(len(r_list))], color='orange', linestyle='dotted')
plt.xlim([0, 3])
plt.ylim([0, 5])
plt.show()

When executed, such a graph is obtained.

The horizontal axis represents $ , r , $, and the vertical axis represents $ , h (X, r) , $. As explained in theory, the smaller $ r , $, the closer $ , h (X, r) , $ is to the true value, but if it is too small, it will in turn diverge toward negative infinity. I will go. Looking at the graph, it seems that it is somehow good to decide $ , r , $ so that the slope is the smallest. In fact, considering that $ , \ frac {\ partial} {\ partial r} h (X, r) = 0 , $ holds when the approximation of Equation 3 holds, this method of determination is not so strange. I think, but there is no proof, so overconfidence is prohibited.

How to find the average amount of information (entropy) of the original probability distribution from a sample

Purpose of this article

theory

application