I implemented a mixed normal distribution in python. I used ["First pattern recognition"](https://www.amazon.co.jp/First pattern recognition-Hirai-Yuzo / dp / 4627849710) as a textbook.
Structure of this article
By applying a probabilistic model to the data distribution, it is possible to stochastically determine which cluster each data belongs to. Since many probability models can only represent monomodal probability distributions, it is necessary to model the overall probability distribution with a weighted linear sum of multiple probability models. Let the number of clusters be $ K $ and the probability model of the $ k $ th cluster be $ p_k (\ boldsymbol x) $, and the overall probability distribution is expressed as follows.
p(\boldsymbol x) = \sum_{k = 1}^K \pi_kp_k(\boldsymbol x)
$ \ Pi_k $ is the weight of the $ k $ th stochastic model. A model that uses a normal distribution as a probabilistic model is called a ** mixed normal distribution model **.
The $ d $ dimensional normal distribution function representing the $ k $ th cluster is expressed as follows.
N(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k) = \frac{1}{(2\pi)^{1/d} \ |\boldsymbol \Sigma_k|^{1/2}}\exp\bigl(-\frac{1}{2}(\boldsymbol x - \boldsymbol \mu_k)^T \boldsymbol \Sigma_k^{-1}(\boldsymbol x - \boldsymbol \mu_k)\bigr)
The overall distribution is expressed as a linear sum of these.
p(\boldsymbol x) = \sum_k^K \pi_k N(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k), \ 0 \leq \pi_k \leq 1, \ \sum_{k = 1}^K \pi_k = 1
$ \ Pi_k $ is called the mixture ratio, and the parameters of the mixed normal distribution model are as follows.
\boldsymbol \pi = (\pi_1, ..., \pi_K), \ \boldsymbol \mu = (\boldsymbol \mu_1, ..., \boldsymbol \mu_K), \ \boldsymbol \Sigma = (\boldsymbol \Sigma_1, ..., \boldsymbol \Sigma_K)
You will need $ \ boldsymbol \ pi, \ \ boldsymbol \ mu, \ \ boldsymbol \ Sigma $ to represent the entire data well. However, since it is unknown which cluster each data belongs to, it is not possible to obtain each parameter directly.
In the mixed normal distribution model, in order to estimate $ \ boldsymbol \ pi, \ \ boldsymbol \ mu, \ \ boldsymbol \ Sigma $ from the whole data, Introduces a $ K $ dimensional hidden variable $ \ boldsymbol z = (z_1, ..., z_K) ^ T $ that represents which cluster each data $ \ boldsymbol x $ belongs to. $ z_k $ takes $ 1 $ if the data belongs to the $ k $ th cluster, $ 0 $ if it does not.
\sum_{k = 1}^K z_k = 1, \ \boldsymbol z = (0, ..., 0, 1, 0, ..., 0)^T
The joint distribution $ p (\ boldsymbol x, \ boldsymbol z) $ of the observed $ \ boldsymbol x $ and the hidden variable $ \ boldsymbol z $ is as follows.
p(\boldsymbol x, \boldsymbol z) = p(\boldsymbol z)p(\boldsymbol x \ | \ \boldsymbol z)
The hidden variable distribution $ p (\ boldsymbol z) $ and the conditional distribution $ p (\ boldsymbol x \ | \ \ boldsymbol z) $ due to the hidden variables of the observed data can be expressed as follows.
\begin{align}
& p(\boldsymbol z) = \prod_{k = 1}^K \pi_k^{z_k} \\
& p(\boldsymbol x \ | \ \boldsymbol z) = \prod_{k = 1}^K \big[N(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)\bigr]^{z_k}
\end{align}
Furthermore, the joint distribution $ p (\ boldsymbol x, \ boldsymbol z) $ can be added to all $ \ boldsymbol z $ to obtain $ p (\ boldsymbol x) $.
\begin{align}
p(\boldsymbol x) &= \sum_{\substack{k = 1 \\ (z_k = 1, z_{j \neq k} = 0)}}^K p(\boldsymbol x, \boldsymbol z) \\
&= \sum_{\substack{k = 1 \\ (z_k = 1, z_{j \neq k} = 0)}}^K p(\boldsymbol z)p(\boldsymbol x \ | \ \boldsymbol z) \\
&= \sum_{\substack{k = 1 \\ (z_k = 1, z_{j \neq k} = 0)}}^K \prod_{k = 1}^K \pi_k^{z_k} \prod_{k = 1}^K \big[N(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)\bigr]^{z_k} \\
&= \sum_{k = 1}^K \pi_kN(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)
\end{align}
Using the above distribution, the posterior probability $ \ gamma (z_k) $ of the hidden variable is expressed as follows.
\begin{align}
\gamma(z_k) &\equiv p(z_k = 1 \ | \ \boldsymbol x) \\
&\\
&= \frac{p(z_k = 1)p(\boldsymbol x \ | \ z_k = 1)}{\sum_{j = 1}^K p(z_j = 1)p(\boldsymbol x \ | \ z_j = 1)} \\
&\\
&= \frac{\pi_kN(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)}{\sum_{j = 1}^K \pi_jN(\boldsymbol x \ | \ \boldsymbol \mu_j, \boldsymbol \Sigma_j)} \tag{*}
\end{align}
The observed data $ \ boldsymbol X $ and the hidden variable $ \ boldsymbol Z $ are represented as follows. $ C_k $ represents the cluster $ k $.
\begin{align}
& \boldsymbol X = (\boldsymbol x_1, ..., \boldsymbol x_N), \ \boldsymbol x_i = (x_{i1}, ..., x_{id})^T \\
& \boldsymbol Z = (\boldsymbol z_1, ..., \boldsymbol z_N), \ \boldsymbol z_i = (z_{i1}, ..., z_{iK})^T \\
& z_{ik} = \begin{cases}
1 \ (x_i \in C_k) \\
0 \ (x_i \notin C_k)
\end{cases}
\end{align}
The set of observed data and hidden variables is called complete data.
\boldsymbol Y = (\boldsymbol x_1, ..., \boldsymbol x_N, \boldsymbol z_1, ..., \boldsymbol z_N) = (\boldsymbol X, \boldsymbol Z)
The parameters of the mixed normal distribution $ \ boldsymbol \ pi, \ \ boldsymbol \ mu, \ \ boldsymbol \ Sigma $ are the parameters that maximize the likelihood of the complete data. The likelihood of complete data can be expressed as follows.
\begin{align}
p(\boldsymbol Y \ | \ \boldsymbol \pi, \ \boldsymbol \mu, \ \boldsymbol \Sigma) &= p(\boldsymbol Z \ | \ \boldsymbol \pi, \ \boldsymbol \mu, \ \boldsymbol \Sigma) p(\boldsymbol X \ | \ \boldsymbol Z, \ \boldsymbol \pi, \ \boldsymbol \mu, \ \boldsymbol \Sigma) \\
&\\
&= \prod_{i = 1}^N \prod_{k = 1}^K \bigl[\pi_kN(\boldsymbol x_i \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k) \bigr]^{z_{ik}}
\end{align}
It transforms to log-likelihood to find the maximum likelihood estimate.
\begin{align}
\tilde{L} &= \ln p(\boldsymbol Y \ | \ \boldsymbol \pi, \ \boldsymbol \mu, \ \boldsymbol \Sigma) \\
&\\
&= \sum_{i = 1}^N \sum_{k = 1}^K z_{ik} \ln \pi_k + \sum_{i = 1}^N \sum_{k = 1}^K z_{ik} \ln N(\boldsymbol x_i \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k) \\
&\\
&= \sum_{i = 1}^N \sum_{k = 1}^K z_{ik} \ln \pi_k + \sum_{i = 1}^N \sum_{k = 1}^K z_{ik} \bigl(-\frac{d}{2}\ln (2\pi) + \frac{1}{2} \ln \bigl|\boldsymbol \Sigma_k \bigr|^{-1} - \frac{1}{2}(\boldsymbol x_i - \boldsymbol \mu_k)^T \boldsymbol \Sigma_k^{-1}(\boldsymbol x_i - \boldsymbol \mu_k) \bigr)
\end{align}
Takes the expected value for the hidden variable of log-likelihood.
\begin{align}
L &= E_z \bigl\{\tilde{L} \bigr\} \\
&\\
&= \sum_{i = 1}^N \sum_{k = 1}^K E_{z_{ik}} \bigl\{z_{ik} \bigr\} \ln \pi_k + \sum_{i = 1}^N \sum_{k = 1}^K E_{z_{ik}} \bigl\{z_{ik} \bigr\} \bigl(-\frac{d}{2}\ln (2\pi) + \frac{1}{2} \ln \bigl|\boldsymbol \Sigma_k \bigr|^{-1} - \frac{1}{2}(\boldsymbol x_i - \boldsymbol \mu_k)^T \boldsymbol \Sigma_k^{-1}(\boldsymbol x_i - \boldsymbol \mu_k) \bigr)
\end{align}
The expected value for the hidden variable is as follows from the equation (*), which is the posterior probability of the hidden variable.
\begin{align}
E_{z_{ik}} \bigl\{z_{ik} \bigr\} &= \sum_{z_{ik} = \{0, 1\}} z_{ik}p(z_{ik} \ | \ \boldsymbol x_i, \pi_k, \boldsymbol \mu_k, \boldsymbol \Sigma_k) \\
&\\
&= 1 \times p(z_{ik} = 1 \ | \ \boldsymbol x_i) \\
&\\
& = \frac{p(z_{ik} = 1)p(\boldsymbol x_i \ | \ z_{ik} = 1)}{\sum_{j = 1}^K p(z_{ij} = 1)p(\boldsymbol x \ | \ z_{ij} = 1)} \\
&\\
&= \frac{\pi_kN(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)}{\sum_{j = 1}^K \pi_jN(\boldsymbol x \ | \ \boldsymbol \mu_j, \boldsymbol \Sigma_j)} \\
&\\
&= \gamma(z_{ik})
\end{align}
The function that replaces the expected value of the hidden variable of $ L $ with posterior probability is called ** Q function **.
Q = \sum_{i = 1}^N \sum_{k = 1}^K \gamma(z_{ik}) \ln \pi_k + \sum_{i = 1}^N \sum_{k = 1}^K \gamma(z_{ik}) \ln \pi_k \bigl(-\frac{d}{2}\ln (2\pi) + \frac{1}{2} \ln \bigl|\boldsymbol \Sigma_k \bigr|^{-1} - \frac{1}{2}(\boldsymbol x_i - \boldsymbol \mu_k)^T \boldsymbol \Sigma_k^{-1}(\boldsymbol x_i - \boldsymbol \mu_k) \bigr)
The ** EM algorithm ** is used as a method for finding the maximum likelihood estimation of the parameters of a probability model with hidden variables. The EM algorithm consists of two steps.
The EM algorithm repeats E and M steps until there is no change in the log-likelihood of the complete data. Since the convergence value depends on the initial value, only the local optimum solution can be obtained. It is necessary to change the initial value and use the EM algorithm multiple times to obtain the maximum likelihood estimation value and select the best one from them. The EM algorithm can be summarized as follows.
I implemented it as follows.
mixturesofgaussian.py
import numpy
from matplotlib import pyplot
import sys
def create_data(N, K):
X, mu_star, sigma_star = [], [], []
for i in range(K):
loc = (numpy.random.rand() - 0.5) * 10.0 # range: -5.0 - 5.0
scale = numpy.random.rand() * 3.0 # range: 0.0 - 3.0
X = numpy.append(X, numpy.random.normal(loc = loc, scale = scale, size = N / K))
mu_star = numpy.append(mu_star, loc)
sigma_star = numpy.append(sigma_star, scale)
return (X, mu_star, sigma_star)
def gaussian(mu, sigma):
def f(x):
return numpy.exp(-0.5 * (x - mu) ** 2 / sigma) / numpy.sqrt(2 * numpy.pi * sigma)
return f
def estimate_posterior_likelihood(X, pi, gf):
l = numpy.zeros((X.size, pi.size))
for (i, x) in enumerate(X):
l[i, :] = gf(x)
return pi * l * numpy.vectorize(lambda y: 1 / y)(l.sum(axis = 1).reshape(-1, 1))
def estimate_gmm_parameter(X, gamma):
N = gamma.sum(axis = 0)
mu = (gamma * X.reshape((-1, 1))).sum(axis = 0) / N
sigma = (gamma * (X.reshape(-1, 1) - mu) ** 2).sum(axis = 0) / N
pi = N / X.size
return (mu, sigma, pi)
def calc_Q(X, mu, sigma, pi, gamma):
Q = (gamma * (numpy.log(pi * (2 * numpy.pi * sigma) ** (-0.5)))).sum()
for (i, x) in enumerate(X):
Q += (gamma[i, :] * (-0.5 * (x - mu) ** 2 / sigma)).sum()
return Q
if __name__ == '__main__':
# data
K = 2
N = 1000 * K
X, mu_star, sigma_star = create_data(N, K)
# termination condition
epsilon = 0.000001
# initialize gmm parameter
pi = numpy.random.rand(K)
mu = numpy.random.randn(K)
sigma = numpy.abs(numpy.random.randn(K))
Q = -sys.float_info.max
delta = None
# EM algorithm
while delta == None or delta >= epsilon:
gf = gaussian(mu, sigma)
# E step: estimate posterior probability of hidden variable gamma
gamma = estimate_posterior_likelihood(X, pi, gf)
# M step: miximize Q function by estimating mu, sigma and pi
mu, sigma, pi = estimate_gmm_parameter(X, gamma)
# calculate Q function
Q_new = calc_Q(X, mu, sigma, pi, gamma)
delta = Q_new - Q
Q = Q_new
# result
print u'mu*: %s, simga*: %s' % (str(numpy.sort(numpy.around(mu_star, 3))), str(numpy.sort(numpy.around(sigma_star, 3))))
print u'mu : %s, sgima : %s' % (str(numpy.sort(numpy.around(mu, 3))), str(numpy.sort(numpy.around(sigma, 3))))
# plot
n, bins, _ = pyplot.hist(X, 50, normed = 1, alpha = 0.3)
seq = numpy.arange(-15, 15, 0.02)
for i in range(K):
pyplot.plot(seq, gaussian(mu[i], sigma[i])(seq), linewidth = 2.0)
pyplot.savefig('gmm_graph.png')
pyplot.show()
The following results were obtained. It seems that a reasonable estimate has been made.
I was able to implement a mixed normal distribution in python. Since there are some parts that I do not understand, such as the derivation method of each parameter in the M step and the nature of the EM algorithm. I will study and add it in the future.
Recommended Posts