Introduction

I implemented a mixed normal distribution in python. I used ["First pattern recognition"](https://www.amazon.co.jp/First pattern recognition-Hirai-Yuzo / dp / 4627849710) as a textbook.

Structure of this article

Introduction
Mixed normal distribution
Mixed normal distribution model
Hidden variables and posterior probabilities
Log-likelihood and Q function
Parameter estimation by EM algorithm
Implementation in python
Result
in conclusion

Mixed normal distribution

By applying a probabilistic model to the data distribution, it is possible to stochastically determine which cluster each data belongs to. Since many probability models can only represent monomodal probability distributions, it is necessary to model the overall probability distribution with a weighted linear sum of multiple probability models. Let the number of clusters be $ K $ and the probability model of the $ k $ th cluster be $ p_k (\ boldsymbol x) $, and the overall probability distribution is expressed as follows.

p(\boldsymbol x) = \sum_{k = 1}^K \pi_kp_k(\boldsymbol x)

$ \ Pi_k $ is the weight of the $ k $ th stochastic model. A model that uses a normal distribution as a probabilistic model is called a ** mixed normal distribution model **.

Mixed normal distribution model

The $ d $ dimensional normal distribution function representing the $ k $ th cluster is expressed as follows.

N(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k) = \frac{1}{(2\pi)^{1/d} \ |\boldsymbol \Sigma_k|^{1/2}}\exp\bigl(-\frac{1}{2}(\boldsymbol x - \boldsymbol \mu_k)^T \boldsymbol \Sigma_k^{-1}(\boldsymbol x - \boldsymbol \mu_k)\bigr)

The overall distribution is expressed as a linear sum of these.

p(\boldsymbol x) = \sum_k^K \pi_k N(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k), \ 0 \leq \pi_k \leq 1, \ \sum_{k = 1}^K \pi_k = 1

$ \ Pi_k $ is called the mixture ratio, and the parameters of the mixed normal distribution model are as follows.

\boldsymbol \pi = (\pi_1, ..., \pi_K), \ \boldsymbol \mu = (\boldsymbol \mu_1, ..., \boldsymbol \mu_K), \ \boldsymbol \Sigma = (\boldsymbol \Sigma_1, ..., \boldsymbol \Sigma_K)

You will need $ \ boldsymbol \ pi, \ \ boldsymbol \ mu, \ \ boldsymbol \ Sigma $ to represent the entire data well. However, since it is unknown which cluster each data belongs to, it is not possible to obtain each parameter directly.

Hidden variables and posterior probabilities

In the mixed normal distribution model, in order to estimate $ \ boldsymbol \ pi, \ \ boldsymbol \ mu, \ \ boldsymbol \ Sigma $ from the whole data, Introduces a $ K $ dimensional hidden variable $ \ boldsymbol z = (z_1, ..., z_K) ^ T $ that represents which cluster each data $ \ boldsymbol x $ belongs to. $ z_k $ takes $ 1 $ if the data belongs to the $ k $ th cluster, $ 0 $ if it does not.

\sum_{k = 1}^K z_k = 1, \ \boldsymbol z = (0, ..., 0, 1, 0, ..., 0)^T

The joint distribution $ p (\ boldsymbol x, \ boldsymbol z) $ of the observed $ \ boldsymbol x $ and the hidden variable $ \ boldsymbol z $ is as follows.

p(\boldsymbol x, \boldsymbol z) = p(\boldsymbol z)p(\boldsymbol x \ | \ \boldsymbol z)

The hidden variable distribution $ p (\ boldsymbol z) $ and the conditional distribution $ p (\ boldsymbol x \ | \ \ boldsymbol z) $ due to the hidden variables of the observed data can be expressed as follows.

\begin{align}
& p(\boldsymbol z) = \prod_{k = 1}^K \pi_k^{z_k} \\
& p(\boldsymbol x \ | \ \boldsymbol z) = \prod_{k = 1}^K \big[N(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)\bigr]^{z_k}
\end{align}

Furthermore, the joint distribution $ p (\ boldsymbol x, \ boldsymbol z) $ can be added to all $ \ boldsymbol z $ to obtain $ p (\ boldsymbol x) $.

\begin{align}
p(\boldsymbol x) &= \sum_{\substack{k = 1 \\ (z_k = 1, z_{j \neq k} = 0)}}^K p(\boldsymbol x, \boldsymbol z) \\
&= \sum_{\substack{k = 1 \\ (z_k = 1, z_{j \neq k} = 0)}}^K p(\boldsymbol z)p(\boldsymbol x \ | \ \boldsymbol z) \\
&= \sum_{\substack{k = 1 \\ (z_k = 1, z_{j \neq k} = 0)}}^K \prod_{k = 1}^K \pi_k^{z_k} \prod_{k = 1}^K \big[N(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)\bigr]^{z_k} \\
&= \sum_{k = 1}^K \pi_kN(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)
\end{align}

Using the above distribution, the posterior probability $ \ gamma (z_k) $ of the hidden variable is expressed as follows.

\begin{align}
\gamma(z_k) &\equiv p(z_k = 1 \ | \ \boldsymbol x) \\
&\\
&= \frac{p(z_k = 1)p(\boldsymbol x \ | \ z_k = 1)}{\sum_{j = 1}^K p(z_j = 1)p(\boldsymbol x \ | \ z_j = 1)} \\
&\\
&= \frac{\pi_kN(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)}{\sum_{j = 1}^K \pi_jN(\boldsymbol x \ | \ \boldsymbol \mu_j, \boldsymbol \Sigma_j)} \tag{*}
\end{align}

Log-likelihood and Q function

The observed data $ \ boldsymbol X $ and the hidden variable $ \ boldsymbol Z $ are represented as follows. $ C_k $ represents the cluster $ k $.

\begin{align}
& \boldsymbol X = (\boldsymbol x_1, ..., \boldsymbol x_N), \ \boldsymbol x_i = (x_{i1}, ..., x_{id})^T \\
& \boldsymbol Z = (\boldsymbol z_1, ..., \boldsymbol z_N), \ \boldsymbol z_i = (z_{i1}, ..., z_{iK})^T \\
& z_{ik} = \begin{cases}
1 \ (x_i \in C_k) \\
0 \ (x_i \notin C_k)
\end{cases}
\end{align}

The set of observed data and hidden variables is called complete data.

\boldsymbol Y = (\boldsymbol x_1, ..., \boldsymbol x_N, \boldsymbol z_1, ..., \boldsymbol z_N) = (\boldsymbol X, \boldsymbol Z)

The parameters of the mixed normal distribution $ \ boldsymbol \ pi, \ \ boldsymbol \ mu, \ \ boldsymbol \ Sigma $ are the parameters that maximize the likelihood of the complete data. The likelihood of complete data can be expressed as follows.

\begin{align}
p(\boldsymbol Y \ | \ \boldsymbol \pi, \ \boldsymbol \mu, \ \boldsymbol \Sigma) &= p(\boldsymbol Z \ | \ \boldsymbol \pi, \ \boldsymbol \mu, \ \boldsymbol \Sigma) p(\boldsymbol X \ | \ \boldsymbol Z, \ \boldsymbol \pi, \ \boldsymbol \mu, \ \boldsymbol \Sigma) \\
&\\
&= \prod_{i = 1}^N \prod_{k = 1}^K \bigl[\pi_kN(\boldsymbol x_i \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k) \bigr]^{z_{ik}}
\end{align}

It transforms to log-likelihood to find the maximum likelihood estimate.

\begin{align}
\tilde{L} &= \ln p(\boldsymbol Y \ | \ \boldsymbol \pi, \ \boldsymbol \mu, \ \boldsymbol \Sigma) \\
&\\
&= \sum_{i = 1}^N \sum_{k = 1}^K z_{ik} \ln \pi_k + \sum_{i = 1}^N \sum_{k = 1}^K z_{ik} \ln N(\boldsymbol x_i \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k) \\
&\\
&= \sum_{i = 1}^N \sum_{k = 1}^K z_{ik} \ln \pi_k + \sum_{i = 1}^N \sum_{k = 1}^K z_{ik} \bigl(-\frac{d}{2}\ln (2\pi) + \frac{1}{2} \ln \bigl|\boldsymbol \Sigma_k \bigr|^{-1} - \frac{1}{2}(\boldsymbol x_i - \boldsymbol \mu_k)^T \boldsymbol \Sigma_k^{-1}(\boldsymbol x_i - \boldsymbol \mu_k) \bigr)
\end{align}

Takes the expected value for the hidden variable of log-likelihood.

\begin{align}
L &= E_z \bigl\{\tilde{L} \bigr\} \\
&\\
&= \sum_{i = 1}^N \sum_{k = 1}^K E_{z_{ik}} \bigl\{z_{ik} \bigr\} \ln \pi_k + \sum_{i = 1}^N \sum_{k = 1}^K E_{z_{ik}} \bigl\{z_{ik} \bigr\} \bigl(-\frac{d}{2}\ln (2\pi) + \frac{1}{2} \ln \bigl|\boldsymbol \Sigma_k \bigr|^{-1} - \frac{1}{2}(\boldsymbol x_i - \boldsymbol \mu_k)^T \boldsymbol \Sigma_k^{-1}(\boldsymbol x_i - \boldsymbol \mu_k) \bigr)
\end{align}

The expected value for the hidden variable is as follows from the equation (*), which is the posterior probability of the hidden variable.

\begin{align}
E_{z_{ik}} \bigl\{z_{ik} \bigr\} &= \sum_{z_{ik} = \{0, 1\}} z_{ik}p(z_{ik} \ | \ \boldsymbol x_i, \pi_k, \boldsymbol \mu_k, \boldsymbol \Sigma_k) \\
&\\
&= 1 \times p(z_{ik} = 1 \ | \ \boldsymbol x_i) \\
&\\
& = \frac{p(z_{ik} = 1)p(\boldsymbol x_i \ | \ z_{ik} = 1)}{\sum_{j = 1}^K p(z_{ij} = 1)p(\boldsymbol x \ | \ z_{ij} = 1)} \\
&\\
&= \frac{\pi_kN(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)}{\sum_{j = 1}^K \pi_jN(\boldsymbol x \ | \ \boldsymbol \mu_j, \boldsymbol \Sigma_j)} \\
&\\
&= \gamma(z_{ik})
\end{align}

The function that replaces the expected value of the hidden variable of $ L $ with posterior probability is called ** Q function **.

Q = \sum_{i = 1}^N \sum_{k = 1}^K \gamma(z_{ik}) \ln \pi_k + \sum_{i = 1}^N \sum_{k = 1}^K \gamma(z_{ik}) \ln \pi_k \bigl(-\frac{d}{2}\ln (2\pi) + \frac{1}{2} \ln \bigl|\boldsymbol \Sigma_k \bigr|^{-1} - \frac{1}{2}(\boldsymbol x_i - \boldsymbol \mu_k)^T \boldsymbol \Sigma_k^{-1}(\boldsymbol x_i - \boldsymbol \mu_k) \bigr)

Parameter estimation by EM algorithm

The ** EM algorithm ** is used as a method for finding the maximum likelihood estimation of the parameters of a probability model with hidden variables. The EM algorithm consists of two steps.

Expectation step: An operation that takes an expected value for a hidden variable. The posterior probability of the hidden variable is calculated by fixing the model parameters.
Maximization step: Find the model parameters that maximize the Q function using the posterior probabilities of the hidden variables found in the E step.

The EM algorithm repeats E and M steps until there is no change in the log-likelihood of the complete data. Since the convergence value depends on the initial value, only the local optimum solution can be obtained. It is necessary to change the initial value and use the EM algorithm multiple times to obtain the maximum likelihood estimation value and select the best one from them. The EM algorithm can be summarized as follows.

Initialization of $ \ pi_k, \ \ boldsymbol \ mu_k, \ \ boldsymbol \ Sigma_k $
Step E: Estimate posterior probabilities $ \ gamma (z_ {ik}) $ using the current parameters $ \ pi_k, \ \ boldsymbol \ mu_k, \ \ boldsymbol \ Sigma_k $ $\gamma(z_{ik}) = \frac{\pi_kN(\boldsymbol x \ | \ \boldsymbol \mu_k, \boldsymbol \Sigma_k)}{\sum_{j = 1}^K \pi_jN(\boldsymbol x \ | \ \boldsymbol \mu_j, \boldsymbol \Sigma_j)}$
M step: Re-estimation of model parameters using estimated $ \ gamma (z_ {ik}) $ $ \begin{align} & N_k = \sum_{i = 1}^N \gamma(z_{ik}) \\\ & \\\ & \boldsymbol \mu_k = \frac{1}{N_k} \sum_{i = 1}^N \gamma(z_{ik}) \boldsymbol x_i \\\ & \\\ & \boldsymbol \Sigma_k = \frac{1}{N_k} \sum_{i = 1}^N \gamma(z_{ik}) (\boldsymbol x_i - \boldsymbol \mu_k)(\boldsymbol x_i - \boldsymbol \mu_k)^T \\\ & \\\ & \pi_k = \frac{N_k}{N} \end{align} $
End judgment: If the log-likelihood of the complete data has not converged, recalculate from 2. If it has converged, it ends.

Implementation in python

I implemented it as follows.

`mixturesofgaussian.py`


import numpy
from matplotlib import pyplot
import sys

def create_data(N, K):
    X, mu_star, sigma_star = [], [], []
    for i in range(K):
        loc = (numpy.random.rand() - 0.5) * 10.0 # range: -5.0 - 5.0
        scale = numpy.random.rand() * 3.0 # range: 0.0 - 3.0
        X = numpy.append(X, numpy.random.normal(loc = loc, scale = scale, size = N / K))
        mu_star = numpy.append(mu_star, loc)
        sigma_star = numpy.append(sigma_star, scale)
    return (X, mu_star, sigma_star)

def gaussian(mu, sigma):
    def f(x):
        return numpy.exp(-0.5 * (x - mu) ** 2 / sigma) / numpy.sqrt(2 * numpy.pi * sigma)
    return f

def estimate_posterior_likelihood(X, pi, gf):
    l = numpy.zeros((X.size, pi.size))
    for (i, x) in enumerate(X):
        l[i, :] = gf(x)
    return pi * l * numpy.vectorize(lambda y: 1 / y)(l.sum(axis = 1).reshape(-1, 1))

def estimate_gmm_parameter(X, gamma):
    N = gamma.sum(axis = 0)
    mu = (gamma * X.reshape((-1, 1))).sum(axis = 0) / N
    sigma = (gamma * (X.reshape(-1, 1) - mu) ** 2).sum(axis = 0) / N
    pi = N / X.size
    return (mu, sigma, pi)

def calc_Q(X, mu, sigma, pi, gamma):
    Q = (gamma * (numpy.log(pi * (2 * numpy.pi * sigma) ** (-0.5)))).sum()
    for (i, x) in enumerate(X):
        Q += (gamma[i, :] * (-0.5 * (x - mu) ** 2 / sigma)).sum()
    return Q

if __name__ == '__main__':

    # data
    K = 2
    N = 1000 * K
    X, mu_star, sigma_star = create_data(N, K)

    # termination condition
    epsilon = 0.000001

    # initialize gmm parameter
    pi = numpy.random.rand(K)
    mu = numpy.random.randn(K)
    sigma = numpy.abs(numpy.random.randn(K))
    Q = -sys.float_info.max
    delta = None

    # EM algorithm
    while delta == None or delta >= epsilon:
        gf = gaussian(mu, sigma)

        # E step: estimate posterior probability of hidden variable gamma
        gamma = estimate_posterior_likelihood(X, pi, gf)

        # M step: miximize Q function by estimating mu, sigma and pi
        mu, sigma, pi = estimate_gmm_parameter(X, gamma)

        # calculate Q function
        Q_new = calc_Q(X, mu, sigma, pi, gamma)
        delta = Q_new - Q
        Q = Q_new

    # result
    print u'mu*: %s, simga*: %s' % (str(numpy.sort(numpy.around(mu_star, 3))), str(numpy.sort(numpy.around(sigma_star, 3))))
    print u'mu : %s, sgima : %s' % (str(numpy.sort(numpy.around(mu, 3))), str(numpy.sort(numpy.around(sigma, 3))))

    # plot
    n, bins, _ = pyplot.hist(X, 50, normed = 1, alpha = 0.3)
    seq = numpy.arange(-15, 15, 0.02)
    for i in range(K):
        pyplot.plot(seq, gaussian(mu[i], sigma[i])(seq), linewidth = 2.0)
    pyplot.savefig('gmm_graph.png')
    pyplot.show()

result

The following results were obtained. It seems that a reasonable estimate has been made.

in conclusion

I was able to implement a mixed normal distribution in python. Since there are some parts that I do not understand, such as the derivation method of each parameter in the M step and the nature of the EM algorithm. I will study and add it in the future.

Mixed normal distribution implementation in python