I want to calculate the mutual information $ I (X; Y) $ of continuous random variables $ X $ and $ Y $ in Python.
import numpy
def mutual_information(X, Y, bins=10):
#Joint probability distribution p(x,y)Calculation
p_xy, xedges, yedges = np.histogram2d(X, Y, bins=bins, density=True)
# p(x)p(y)Calculation
p_x, _ = np.histogram(X, bins=xedges, density=True)
p_y, _ = np.histogram(Y, bins=yedges, density=True)
p_x_y = p_x[:, np.newaxis] * p_y
#dx and dy
dx = xedges[1] - xedges[0]
dy = yedges[1] - yedges[0]
#Integral element
elem = p_xy * np.ma.log(p_xy / p_x_y)
#Mutual information and p(x, y), p(x)p(y)Output
return np.sum(elem * dx * dy), p_xy, p_x_y
If you want to calculate the mutual information for the time being, you can use the above function. Incidentally, I will leave some important points for implementation.
Density
of np.histogram2d
When I vaguely thought that the probability would be returned if density = True
was set,np.sum (p_xy)
did not become 1 and I was a little impatient.
The point to note is that p_xy
is ** probability density **, not probability.
Since $ X $ and $ Y $ are continuous variables, the histogram approximates the probability density. Therefore, if you add them together considering the width of the bin, it will be 1.
np.histogram
and np.histogram2d
return the probability density and bins (edges in the code).
It is necessary to calculate dx
and dy
from this bin.
import numpy as np
N = 1000
X = np.random.normal(loc=0, scale=1, size=N)
p_x, edges = np.histogram(X, bins=10, density=True)
#If you take the sum of the probability densities without thinking, it will not be 1 as a matter of course.
print(np.sum(p_x)) #Output example: 1.580769264599771
#If you take the sum in consideration of the bin width, it becomes 1.
dx = edges[1] - edges[0]
print(np.sum(p_x * dx)) #Output example: 1.0000000000000002
p_x_y
P_x_y
in the code is trying to calculate $ p (x) p (y) $.
Actually, I calculated it with the following code first, and it didn't work.
p_x_y = p_x * p_y
Correctly
p_x_y = p_x[:, np.newaxis] * p_y
is. In the former, p_x_y
is the primary array, and in the latter, p_x_y
is the secondary array.
Since they are not independent, there is a difference between $ p (x, y) $ and $ p (x) p (y) $, and the mutual information increases.
import matplotlib.pyplot as plt
#sine wave and cos wave
t = np.linspace(-5, 5, num=1000)
X = np.sin(2 * np.pi * t)
Y = np.cos(3 * np.pi * t)
#Calculation of mutual information
mi, p_xy, p_x_y = mutual_information(X, Y, bins=30)
#Result output
plt.figure(dpi=100)
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)
ax1.set_title(r'$P_{XY}(x, y)$')
ax1.imshow(p_xy)
ax2.set_title(r'$P_{X}(x) P_{Y}(y)$')
ax2.imshow(p_x_y)
plt.suptitle('MI = {}'.format(mi))
plt.show()
When the two variables are independent, $ p (x, y) $ and $ p (x) p (y) $ match, and the mutual information becomes small.
import matplotlib.pyplot as plt
#Two independent normal distributions
N = 10000
X = np.random.normal(size=N)
Y = np.random.normal(size=N)
#Calculation of mutual information
mi, p_xy, p_x_y = mutual_information(X, Y, bins=30)
Execution example
#Result output
plt.figure(dpi=100)
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)
ax1.set_title(r'$P_{XY}(x, y)$')
ax1.imshow(p_xy)
ax2.set_title(r'$P_{X}(x) P_{Y}(y)$')
ax2.imshow(p_x_y)
plt.suptitle('MI = {}'.format(mi))
plt.show()
Recommended Posts