When the distribution obtained from the sample is multimodal, it is not appropriate to model with a simple Gaussian distribution. A multimodal distribution can be modeled using a ** mixed Gaussian distribution ** that combines multiple Gaussian distributions. This article provides an example of using the EM algorithm to determine the parameters of a ** mixed Gaussian distribution **.
Let the sample be $ x_n (n = 1,…, N) $. By maximum likelihood estimation of the Gaussian distribution, the mean and variance can be obtained in the following forms.
\mu_{ML}=\frac{1}{N}\sum_{n=1}^N x_n \\\
\sigma^2_{ML}=\frac{1}{N-1}\sum_{n=1}^N (x_n-\mu_{ML})^2
The variance value uses an unbiased estimator.
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import math
#Function to plot the histogram
def draw_hist(xs, bins):
plt.hist(xs, bins=bins, normed=True, alpha=0.5)
#A function that finds the mean and variance from a given sample using maximum likelihood estimation
def predict(data):
mu = np.mean(data)
var = np.var(data, ddof=1) #Use an unbiased estimator
return mu, var
def main():
#Average mu,Generate N samples that follow a Gaussian distribution with standard deviation std
mu = 3.0
v = 2.0
std = math.sqrt(v)
N = 10000
data = np.random.normal(mu, std, N)
#Maximum likelihood estimation,Find the mean and variance
mu_predicted, var_predicted = predict(data)
#Find the standard deviation from the variance value
std_predicted = math.sqrt(var_predicted)
print("original: mu={0}, var={1}".format(mu, v))
print(" predict: mu={0}, var={1}".format(mu_predicted, var_predicted))
#Result plot
draw_hist(data, bins=40)
xs = np.linspace(min(data), max(data), 200)
norm = mlab.normpdf(xs, mu_predicted, std_predicted)
plt.plot(xs, norm, color="red")
plt.xlim(min(xs), max(xs))
plt.xlabel("x")
plt.ylabel("Probability")
plt.show()
if __name__ == '__main__':
main()
The histogram (blue) represents the sample and the red line represents the Gaussian distribution modeled using the estimated values.
original: mu=3.0, var=2.0
predict: mu=2.98719564872, var=2.00297779707
We were able to find a model of a suitable Gaussian distribution.
Old Faithful-Uses geyser data. You can download it from the link below. Old Faithful The contents of the dataset are as follows.
This time, we will use only the most recent eruption duration (first column) as a specimen. From this specimen, the following bimodal distribution was obtained.
The feeling of seeing the distribution It seems that a simple Gaussian distribution cannot be modeled. Now, let's model using a mixed Gaussian distribution that combines two Gaussian distributions. When modeling, it is necessary to determine the parameters of mean $ \ mu_1, \ mu_2 $, variance $ \ sigma_1 ^ 2, \ sigma_2 ^ 2 $, and mixing probability $ \ pi $. Assuming that the Gaussian distribution is $ \ phi (x | \ mu, \ sigma ^ 2) $, the mixed Gaussian distribution is given by the following equation.
y=(1-\pi)\phi(x|\mu_1, \sigma^2_1)+\pi\phi(x|\mu_2, \sigma^2_2)
The mean and variance of the Gaussian distribution can be found by analytically solving the maximization of the likelihood function (PRML. See PRML /) §2.3.4). However, the maximization of the likelihood function of the mixed Gaussian distribution is difficult to solve analytically, so the maximization is performed using the EM algorithm, which is one of the optimization methods. The EM algorithm is an algorithm that repeats two steps, E step and M step.
See §8.5 of The Elements of Statistical Learning for detailed algorithm content. The English version of the PDF can be downloaded for free.
# -*- coding: utf-8 -*-
import numpy as np
import math
import random
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
#Average m,Gaussian distribution of variance v
def gaussian(x, m, v):
p = math.exp(- pow(x - m, 2) / (2 * v)) / math.sqrt(2 * math.pi * v)
return p
#E step
def e_step(xs, ms, vs, p):
burden_rates = []
for x in xs:
d = (1 - p) * gaussian(x, ms[0], vs[0]) + p * gaussian(x, ms[1], vs[1])
n = p * gaussian(x, ms[1], vs[1])
burden_rate = n / d
burden_rates.append(burden_rate)
return burden_rates
#M step
def m_step(xs, burden_rates):
d = sum([1 - r for r in burden_rates])
n = sum([(1 - r) * x for x, r in zip(xs, burden_rates)])
mu1 = n / d
n = sum([(1 - r) * pow(x - mu1, 2) for x, r in zip(xs, burden_rates)])
var1 = n / d
d = sum(burden_rates)
n = sum([r * x for x, r in zip(xs, burden_rates)])
mu2 = n / d
n = sum(r * pow(x - mu2, 2) for x, r in zip(xs, burden_rates))
var2 = n / d
N = len(xs)
p = sum(burden_rates) / N
return [mu1, mu2], [var1, var2], p
#Log-likelihood function
def calc_log_likelihood(xs, ms, vs, p):
s = 0
for x in xs:
g1 = gaussian(x, ms[0], vs[0])
g2 = gaussian(x, ms[1], vs[1])
s += math.log((1 - p) * g1 + p * g2)
return s
#Function to plot the histogram
def draw_hist(xs, bins):
plt.hist(xs, bins=bins, normed=True, alpha=0.5)
def main():
#Read the first column of the dataset
fp = open("faithful.txt")
data = []
for row in fp:
data.append(float((row.split()[0])))
fp.close()
#mu, vs,Set the initial value of p
p = 0.5
ms = [random.choice(data), random.choice(data)]
vs = [np.var(data), np.var(data)]
T = 50 #Number of iterations
ls = [] #Save the calculation result of the log-likelihood function
#EM algorithm
for t in range(T):
burden_rates = e_step(data, ms, vs, p)
ms, vs, p = m_step(data, burden_rates)
ls.append(calc_log_likelihood(data, ms, vs, p))
print("predict: mu1={0}, mu2={1}, v1={2}, v2={3}, p={4}".format(
ms[0], ms[1], vs[0], vs[1], p))
#Result plot
plt.subplot(211)
xs = np.linspace(min(data), max(data), 200)
norm1 = mlab.normpdf(xs, ms[0], math.sqrt(vs[0]))
norm2 = mlab.normpdf(xs, ms[1], math.sqrt(vs[1]))
draw_hist(data, 20)
plt.plot(xs, (1 - p) * norm1 + p * norm2, color="red", lw=3)
plt.xlim(min(data), max(data))
plt.xlabel("x")
plt.ylabel("Probability")
plt.subplot(212)
plt.plot(np.arange(len(ls)), ls)
plt.xlabel("step")
plt.ylabel("log_likelihood")
plt.show()
if __name__ == '__main__':
main()
predict: mu1=2.01860781706, mu2=4.27334342119, v1=0.0555176191851, v2=0.191024193785, p=0.651595365985
The figure above shows the results of modeling a sample from a dataset with a mixed Gaussian distribution. The figure below shows how the log-likelihood increases as the EM algorithm is repeated.
The mean was a randomly selected sample, the variance was the variance of the sample, and the mixing probability was 0.5 as the initial value. There are many ways to choose the initial value, but it seems that this alone will result in a dissertation (?).
Since this sample has a bimodal distribution, we combined two Gaussian distributions. Of course, you can combine three or more Gaussian distributions, but the number of parameters that need to be determined will increase.
This time we used a one-dimensional sample, but the EM algorithm can also be used for multivariate Gaussian distributions. The multivariate Gaussian distribution has significantly more parameters to determine than the one-dimensional Gaussian distribution (mean, covariance matrix).
Recommended Posts