Introduction

I'm currently studying Bayesian inference. This time, I would like to note my understanding of the ** EM algorithm in a mixed Gaussian distribution **.

As I continue to study, I feel that ** simple examples are fine, so understanding formulas and words will progress very quickly by thinking while illustrating and embodying them **. Therefore, I would like to make the article easy to understand with implementation as much as possible.

I have referred to this article a lot this time. It is very easy to understand from the concept to the formula transformation and implementation.

Thorough explanation of EM algorithm https://qiita.com/kenmatsu4/items/59ea3e5dfa3d4c161efb

What is the EM algorithm?

The EM algorithm (Expectation Maximazation algorithm) is an algorithm ** used for learning and optimizing models that include hidden variables **.

Mixing coefficient

As an explanation of terms, first deepen your understanding of the mixing coefficient.

Consider the probability model $ p (x) $ of $ x $ when the following two-dimensional observation $ x $ is obtained. At this time, it seems that it is generated from two clusters A and B, so consider a model that reflects this.

Assuming that it is determined by the Gaussian distribution, it can be expressed as follows.

\begin{align}
p(x) &= \pi_A\mathcal N(x|\mu_A, \Sigma_A) +\pi_B\mathcal N(x|\mu_B, \Sigma_B)\\

\end{align}

However,

$ X $: $ D $ dimensional data
$ \ mathcal {D} = {x_1, \ cdots, x_N} $: $ N $ observation points (data set)
$ Μ $: $ D $ Dimensional mean vector
$ Σ $: $ D × D $ covariance matrix
$ \ mathcal N (x | μ, Σ) $: $ D $ dimensional Gaussian distribution

will do. The generalization is as follows.

\begin{align}
p(x) &= \sum_{k=1}^{K} \pi_k\mathcal N(x|\mu_k, \Sigma_k) \hspace{1cm}(Equation 1)

\end{align}

This $ π_k $ is called ** mixing coefficient ** and satisfies the following.

\sum_{k=1}^{K} π_k =1\\

0 \leqq π_k \leqq 1

However, $ K $: Number of clusters. In other words, the mixing coefficient is a numerical value that represents ** the weight in each cluster (= which cluster has the highest probability of existence) **.

Burden rate

Next, consider the term burden rate. Let $ π_k $ = $ p (k) $ be the prior probability of selecting the $ k $ th cluster \mathcal N(x|\mu_k, \Sigma_k)=p(x|k)TokWhen givenxGiven the conditional probability ofxPeripheral density of

p(x) = \sum_{k=1}^{K} p(k)p(x|k)\hspace{1cm}(Equation 2)\\

It can be expressed as. This is equivalent to the formula $ 1 $ above. By the way, $ p (k | x) $ at this time is called the burden rate. This burden rate is also expressed as $ γ_k (x) $, and using Bayes' theorem,

\begin{align}
γ_k(x) &\equiv p(k|x)\\
&=\frac {p(k)p(x|k)}{\sum_lp(l)p(x|l)}\\
&=\frac {π_k\mathcal N(x|\mu_k, \Sigma_k)}{\sum_lπ_l\mathcal N(x|\mu_l, \Sigma_l)}

\end{align}

Can be expressed as. What does this burden rate mean? I would like to confirm it while implementing it.

Implement and check

`EM.ipynb`


import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy import stats as st
# ======================================
# Parameters
K = 4
n = 300
xx = np.linspace(-4, 10, n)

mu = [-1.2, 0.4, 2, 4]
sigma = [1.1, 0.7, 1.5, 2.0]
pi = [0.2, 0.3, 0.3, 0.2]

# Density function
pdfs = np.zeros((n, K))
for k in range(K):
    pdfs[:, k] = pi[k]*st.norm.pdf(xx, loc=mu[k], scale=sigma[k])

# =======================================
# Visualization
plt.figure(figsize=(14, 6))
for k in range(K):
    plt.plot(xx, pdfs[:, k])
plt.title("pdfs")
plt.show()

plt.figure(figsize=(14, 6))
plt.stackplot(xx, pdfs[:, 0], pdfs[:, 1], pdfs[:, 2], pdfs[:, 3])
plt.title("stacked")
plt.show()

As you can see in the above figure, the burden rate is the middle value of the density function of the mixed Gaussian distribution given a certain point $ x $, and is the ratio of $ k $ to each cluster.

Detour about Gaussian distribution (3 types of covariance matrix)

Now, in learning the Gaussian distribution, the covariance matrix $ Σ $ may be calculated as $ Σ = σ ^ 2I $. To think about what this means, let's summarize the three types of covariance matrices.

General symmetric covariance matrix

Consider a Gaussian distribution of the $ D $ dimension. At this time, the covariance matrix $ Σ $ can be expressed as follows.


\Sigma = \begin{pmatrix}
σ_{1}^2 & σ_{12} &・ ・ ・& σ_{1D}^2 \\
σ_{12} & σ_{2}^2\\
・\\
・\\
・\\
σ_{1D}& σ_{22} &・ ・ ・& σ_{D}^2\\
\end{pmatrix}\\

Such a covariance matrix is called a general symmetric type. This covariance matrix has $ D × (D + 1) / 2 $ free parameters (count the variables in the above matrix).

The covariance matrix is diagonal

Next, consider the case where the covariance matrix is diagonal.

\Sigma =diag(σ_i^2)=\begin{pmatrix}
σ_{1}^2 & 0 &・ ・ ・& 0 \\
0 & σ_{2}^2\\
・\\
・\\
・\\
0& 0 &・ ・ ・& σ_{D}^2\\
\end{pmatrix}\\

In this case, the free parameters are $ D $ equal to the number of dimensions.

The covariance matrix is proportional to the identity matrix (isotropic)

Finally, consider the case where the covariance matrix is proportional to the identity matrix. This is called the isotropic covariance matrix.

\Sigma =σ^2\bf I= σ^2\begin{pmatrix}
1 & 0 &・ ・ ・& 0 \\
0 & 1\\
・\\
・\\
・\\
0& 0 &・ ・ ・& 1\\
\end{pmatrix}\\

In such a case, there is only one free parameter, $ σ $. Now, the probability densities for these three cases are shown below.

By reducing the number of free parameters, the calculation becomes easier and the calculation speed increases. On the other hand, you can see that the expressive power of the probability density decreases. In the case of general symmetry, it can also represent diagonal or isotropic shapes. 　 There is an idea to solve it by introducing ** latent variables and non-observed variables ** in order to achieve both quick calculation and ensuring expressiveness. It is common practice to enhance expressiveness by this latent variable and multiple Gaussian distributions (= mixed Gaussian distributions).

Applying EM algorithm to mixed Gaussian distribution

Now, let's return to the main topic. The EM algorithm used as the subject of this time corresponds to 3. below.

Given a $ N × K $ matrix $ \ bf Z $ with the latent variable $ \ bf z ^ T $ as the row vector, the log-likelihood function can be expressed as follows.

\begin{align}
ln \hspace{1mm} p(\bf X| π,μ,Σ) &=\sum_{n=1}^{N}ln\Bigl\{ \sum_{k=1}^{K} \pi_k\mathcal N(x|\mu_k, \Sigma_k) \Bigr\}

\end{align}

By maximizing this ** log-likelihood function, it is possible to predict unknown data $ x $ with high probability **. In other words, the maximum likelihood function is calculated.

Click here for the detailed content of the idea of likelihood. I have referred to it in great detail.

[Statistics] What is likelihood? Let's explain graphically. https://qiita.com/kenmatsu4/items/b28d1b3b3d291d0cc698

However, it is very difficult to perform the function maximum likelihood analysis this time ($ log-Σ $ is very difficult to solve). Therefore, consider a method called ** EM algorithm **.

The ** EM algorithm for a mixed Gaussian distribution ** is as follows.

Given a Gaussian mixed model (or set by yourself), the goal is to adjust the ** parameters of the mean, variance, and mixing factors of each element ** to maximize the likelihood function.

I thought it was similar to updating weight parameters in a neural network. The weight parameter is updated using the gradient of the loss function to find the weight parameter that minimizes the loss function. At this time, mathematically finding the gradient (= derivative) of the loss function itself requires a huge amount of calculation. Therefore, the gradient is calculated by an algorithm called the inverse error propagation method.

Similarly, in the case of the EM algorithm, $ π, μ, and Σ $ are calculated and updated respectively. Then, if the convergence test is made and the criteria are satisfied, the maximum likelihood function is used.

Parameter update

The EM algorithm in a mixed Gaussian distribution requires updating the burden factor $ γ (z_ {nk}) $, the mixing factor $ π_k $, the mean $ μ_k $, and the covariance matrix $ Σ_k $. It is necessary to differentiate this calculation and set it to 0 and solve it. This article is very detailed on how to solve it, so I would be grateful if you could take a look.

Thorough explanation of EM algorithm https://qiita.com/kenmatsu4/items/59ea3e5dfa3d4c161efb

As a result, each parameter can be expressed as:


γ(z_{nk}) =\frac {π_k\mathcal N(x_n|\mu_k, \Sigma_k)}{\sum_{l=1}^{K}π_l\mathcal N(x|\mu_l, \Sigma_l)}\\

π_k = \frac {N_k}{N}\\
μ_k = \frac {1}{N_k}\sum_{n=1}^{N}γ(z_{nk})\bf{ x_n}\\
\Sigma_k = \frac{1}{N_k}\sum_{n=1}^{N}γ(z_{nk})(\bf x_n -μ_k)(\bf x_n -μ_k)^T\\

As you can see, $ π, μ, and Σ $ all depend on $ γ (z_ {nk}) $. Also, when trying to find the maximum likelihood function in a single Gaussian distribution, it is the value when this $ γ (z_ {nk}) $ becomes 1. Then, it is easy to understand because each of them simply asks for the mean value and covariance.

Try to implement

The program is stored here. https://github.com/Fumio-eisan/EM_20200509/upload

The points are the mean values, and the solid lines are the contour lines of the probability density distribution. You can see that it is gradually optimized for each cluster.

At the end

This time, we have summarized the EM algorithm for the mixed Gaussian distribution. I thought it would be easier to understand by visually understanding before mathematical understanding.

I am weak in formula expansion and implementation, so I will move my hands further to deepen my understanding. I would also like to write an article about common EM algorithms.