Derivation of EM algorithm and calculation example for coin toss

An example of solving a problem with a coin toss using the EM algorithm is shown in mathematical formulas and Python code, and then the derivation of the EM algorithm itself is shown.

Example: Estimating the probability that a coin toss will appear

** Problem **: I have two coins A and B. When I do a coin toss, I know that A is easier to flip than B. As a result of repeating the trial of performing a coin toss 20 times using either coin 10 times, the number of times the table appeared was as follows. Estimate the probability that each coin of A and B will appear.

Number of times the table appeared 15 4 6 13 3 5 4 4 7 2

** Answer **: The probability that each table of A and B will appear is $ \ theta_A, \ theta_B $, and out of $ N $ trials, the result of $ M $ coin toss in the $ i $ trial, the table Let $ x_i $ be the number of times the coin was flipped, and $ z_i \ in \ {A, B \} $ be the coin used.

The following algorithm, called the EM algorithm, finds the estimated value of $ \ theta = (\ theta_A, \ theta_B) $ $ \ hat {\ theta} $.

  1. Determine the initial value of the estimated value of $ \ theta $ $ \ theta ^ {(0)} $
  2. Calculate the posterior distribution $ p (z_i | x_i, \ theta ^ {(t)}) $ of $ z_i $ in each trial $ i $ using the estimate $ \ theta ^ {(t)} $
  3. \sum_i \sum_{z_i} p(z_i|x_i,\theta^{(t)}) \log p(x_i|z_i,\theta)To maximize\thetaCalculate and\theta^{(t+1)}To
  4. Put $ \ theta ^ {(t + 1)} $ in $ \ theta ^ {(t)} $ until $ \ theta ^ {(t)} $ converges, return to step 1 and repeat the calculation.

The specific formula for each step and the Python code are shown below.

step 1

Let $ \ theta_A ^ {(0)} = 0.5, \ theta_B ^ {(0)} = 0.4 $.

import numpy as np
import scipy.stats as st

np.random.seed(0)

# Problem setting
N = 10
M = 20
theta_true = np.array([0.8, 0.2])
z_true = np.random.randint(2, size=N)
x = np.zeros(N, dtype=int)
for i in range(0, N):
    x[i] = st.binom.rvs(M, theta_true[z_true[i]])
print(x)

# Initial estimates
t = 0
t_max = 50
thetas = np.zeros([2, t_max])
thetas[:,t] = np.array([0.5, 0.4])

** Step 2. **

The posterior distribution of $ z_i $ given $ \ theta ^ {(t)} $ is

\begin{eqnarray}
p(z_i|x_i,\theta^{(t)})&=& p(z_i|x_i,\theta^{(t)})\\
 &=&  \frac{p(x_i|z_i, \theta^{(t)}) p(z_i)}{\sum_{z_i} P(x_i|z_i,\theta^{(t)})p(z_i)}\\
 &=&  \frac{p(x_i|z_i, \theta^{(t)})}{\sum_{z_i} P(x_i|z_i,\theta^{(t)})}\\
 &=&  \frac{B(x_i|M,\theta_{z_i}^{(t)})}{\sum_{z_i} B(x_i|M,\theta_{z_i}^{(t)})} \ .
\end{eqnarray}

The equation on the second line uses Bayes' theorem, and the equation on the third line has prior distribution because there is no information about which coin was used.p(z_i=A)=p(z_i=B)Calculated as. Also,B(x|n,p)Represents the binomial distribution. Binomial distributionB(x|n,p)Is an eventU,VOf the eventsUIs the probabilitypThe trial selected bynWhen I went independentlyxTimesURepresents the probability that is selected.

def get_post_z_i(z_i, x_i, M, theta):
    norm_c = 0 # Normalization term in denominator
    for j in range(0,2):
        p_j = st.binom.pmf(x_i, M, theta[j])
        norm_c = norm_c + p_j
        if z_i == j:
            ll = p_j
    return ll / norm_c

** Step 3 **

Using the calculation result in step 1, calculate the maximum $ \ theta $ by the gradient method.

def neg_expect(theta_next, theta_cur, x):
    N = x.size
    e = 0
    for i in range(0, N): # for trial i
        for j in range(0, 2): # used coin j
            post_z = get_post_z_i(j, x[i], M, theta_cur)
            prob_x = st.binom.logpmf(x[i], M, theta_next[j])
            e = e + post_z * prob_x
    return -e

# Sample calculation
bnds = [(0,0.99), (0,0.99)]
res = minimize(neg_expect, thetas[:,t], args=(thetas[:,t], x),
          bounds=bnds, method='SLSQP', options={'maxiter': 1000})
res.x

** Step 4 **

Repeat until converged.

t = 0
while t < t_max-1:
    if t % 10 == 0:
        print(str(t) + "/" + str(t_max))
    res = minimize(neg_expect, thetas[:,t], args=(thetas[:,t], x),
         bounds=bnds, method='SLSQP', options={'maxiter': 1000})
    thetas[:,t+1] = res.x
    t = t + 1

result

import matplotlib.pyplot as plt
import seaborn as sns
# result
plt.plot(np.arange(0,t_max), thetas[0,:])
plt.plot(np.arange(0,t_max), thetas[1,:])
# true value
plt.plot([0, t_max], np.ones(2) * theta_true[0])
plt.plot([0, t_max], np.ones(2) * theta_true[1])
plt.ylim([0, 1])

result.png

In this problem, the number of data was small, so we could not estimate it so accurately.

Derivation of EM algorithm

Let the sample of the random variable obtained in N trials be $ x = \ {x_0, \ cdots, x_N \} $, the unobservable random variable be $ z $, and the parameter you want to estimate be $ \ theta $. The log-likelihood function $ l (\ theta) $ for $ \ theta $ can be written as: (In the following, we will consider $ \ sum_z $ assuming that $ z $ is a discrete variable according to this problem, but if $ z $ is a continuous variable, it is the same if we consider the integral.)

l(\theta) := \log p(x|\theta) = \log \sum_z p(x,z|\theta) \ . (1)

In the actual calculation

l(\theta) = \sum_i \log \sum_{z_i} p(x_i,z_i|\theta)\ ,

It is easier to calculate, but it will be difficult to see the formula, so here we will proceed with formula (1). Now consider an arbitrary probability distribution $ Q (z) $ for $ z $ and use Jensen's inequality.

l(\theta) = \log \sum_z Q(z) \frac{p(x,z|\theta)}{Q(z)} \geq \sum_z Q(z) \log \frac{p(x,z|\theta)}{Q(z)}\ ,

Is established. In the above equation, the equal sign holds in the inequality, assuming that the constant is $ C $.

\frac{p(x,z|\theta)}{Q(z)} = C \ ,

Only when is. $ Q (z) $ that satisfies this is considered the normalization constant.

\begin{eqnarray}
Q(z)&=\frac{p(x,z|\theta) }{\sum_zp(x,z|\theta)} \\
&=\frac{p(x,z|\theta)}{p(x|\theta)}\\
&=p(z|x,\theta)  \ ,
\end{eqnarray}

is. That is, if you give a certain $ \ theta ^ {(t)} $,

\log \sum_z p(z|x,\theta^{(t)}) \frac{p(x,z|\theta^{(t)})}{p(z|x,\theta^{(t)})} = \sum_z p(z|x,\theta^{(t)}) \log \frac{p(x,z|\theta^{(t)})}{p(z|x,\theta^{(t)})}\ ,

Is true, no matter what $ \ theta ^ {(*)} \ neq \ theta ^ {(t)} $ is given

\log \sum_z p(z|x,\theta^{(t)}) \frac{p(x,z|\theta^{(*)})}{p(z|x,\theta^{(t)})} \geq \sum_z p(z|x,\theta^{(t)}) \log \frac{p(x,z|\theta^{(*)})}{p(z|x,\theta^{(t)})}\ ,  \ \ (2)

Is true. Therefore, the lower limit of the inequality, that is,\log p(x,z|\theta^{(\*)})/p(z|x,\theta^{(t)})ofzMaximize expectations about(Maximization of Expextation)Like to do\theta^{(*)}To\theta^{(t+1)}will do;

\begin{eqnarray}
\theta^{(t+1)} &=& \mathrm{arg \ max}_{\theta^{(*)}}  \sum_z p(z|x,\theta^{(t)}) \log \frac{p(x,z|\theta^{(*)})}{p(z|x,\theta^{(t)})} \  \ (3)\\
&=& \mathrm{arg \ max}_{\theta^{(*)}}  \sum_z p(z|x,\theta^{(t)}) \log p(x,z|\theta^{(*)}) \ .  \  \ (4)
\end{eqnarray}

Then we see that the following inequality holds.

\begin{eqnarray}
l(\theta^{(t+1)})
&\geq &\sum_z p(z|x,\theta^{(t)})\log  \frac{p(x,z|\theta^{(t+1)})}{p(z|x,\theta^{(t)})} \\
&\geq& \sum_z p(z|x,\theta^{(t)})\log  \frac{p(x,z|\theta^{(t)})}{p(z|x,\theta^{(t)})} \\
&=& l(\theta^{(t)}) \ .
\end{eqnarray}

Here, we used the relation (2) for the first inequality and the relation (3) for the second inequality.

You have now derived the EM algorithm! To summarize the above, given a parameter $ \ theta ^ {(t)} $, $ \ theta ^ {(t + 1)} $ obtained by the following two-step calculation is $ . It turns out that $ l (\ theta ^ {(t)}) \ leq l (\ theta ^ {(t + 1)}) $ always holds for the likelihood function $ l (\ theta) $ of theta $. It was.

  1. For the hidden variable $ z $, use $ \ theta ^ {(t)} $ to calculate the posterior distribution $ p (z | x, \ theta ^ {(t)}) $
  2. \sum_z p(z|x,\theta^{(t)}) \log p(x,z|\theta)To maximize\thetaCalculate and\theta^{(t+1)}To

In other words, if you decide $ \ theta ^ {(0)} $ appropriately and repeat the above procedure repeatedly, you will get the locally optimal estimated value $ \ hat {\ theta} $. This is the EM algorithm.

Reference material

Recommended Posts

Derivation of EM algorithm and calculation example for coin toss
Prepared for date calculation and automation of my bot
Derivation and implementation of update equations for non-negative tensor factorization
Explanation and implementation of ESIM algorithm
Calculation of homebrew class and existing class
[Scientific / technical calculation by Python] Derivation of analytical solutions for quadratic and cubic equations, mathematical formulas, sympy
Example of python code for exponential distribution and maximum likelihood estimation (MLE)
Explanation and implementation of Decomposable Attention algorithm
EM of mixed Gaussian distribution
Gaussian mixed model EM algorithm [statistical machine learning]
Mixed Gaussian distribution and logsumexp
Derivation of EM algorithm and calculation example for coin toss
PRML Chapter 9 Mixed Gaussian Distribution Python Implementation
Variational Bayesian estimation of mixed Gaussian distribution
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.