Derivation of EM algorithm and calculation example for coin toss

An example of solving a problem with a coin toss using the EM algorithm is shown in mathematical formulas and Python code, and then the derivation of the EM algorithm itself is shown.

Example: Estimating the probability that a coin toss will appear

** Problem **: I have two coins A and B. When I do a coin toss, I know that A is easier to flip than B. As a result of repeating the trial of performing a coin toss 20 times using either coin 10 times, the number of times the table appeared was as follows. Estimate the probability that each coin of A and B will appear.

Number of times the table appeared	15	4	6	13	3	5	4	4	7	2

** Answer **: The probability that each table of A and B will appear is $ \ theta_A, \ theta_B $, and out of $ N $ trials, the result of $ M $ coin toss in the $ i $ trial, the table Let $ x_i $ be the number of times the coin was flipped, and $ z_i \ in \ {A, B \} $ be the coin used.

The following algorithm, called the EM algorithm, finds the estimated value of $ \ theta = (\ theta_A, \ theta_B) $ $ \ hat {\ theta} $.

Determine the initial value of the estimated value of $ \ theta $ $ \ theta ^ {(0)} $
Calculate the posterior distribution $ p (z_i | x_i, \ theta ^ {(t)}) $ of $ z_i $ in each trial $ i $ using the estimate $ \ theta ^ {(t)} $
\sum_i \sum_{z_i} p(z_i|x_i,\theta^{(t)}) \log p(x_i|z_i,\theta)To maximize\thetaCalculate and\theta^{(t+1)}To
Put $ \ theta ^ {(t + 1)} $ in $ \ theta ^ {(t)} $ until $ \ theta ^ {(t)} $ converges, return to step 1 and repeat the calculation.

The specific formula for each step and the Python code are shown below.

step 1

Let $ \ theta_A ^ {(0)} = 0.5, \ theta_B ^ {(0)} = 0.4 $.

import numpy as np
import scipy.stats as st

np.random.seed(0)

# Problem setting
N = 10
M = 20
theta_true = np.array([0.8, 0.2])
z_true = np.random.randint(2, size=N)
x = np.zeros(N, dtype=int)
for i in range(0, N):
    x[i] = st.binom.rvs(M, theta_true[z_true[i]])
print(x)

# Initial estimates
t = 0
t_max = 50
thetas = np.zeros([2, t_max])
thetas[:,t] = np.array([0.5, 0.4])

** Step 2. **

The posterior distribution of $ z_i $ given $ \ theta ^ {(t)} $ is

\begin{eqnarray}
p(z_i|x_i,\theta^{(t)})&=& p(z_i|x_i,\theta^{(t)})\\
 &=&  \frac{p(x_i|z_i, \theta^{(t)}) p(z_i)}{\sum_{z_i} P(x_i|z_i,\theta^{(t)})p(z_i)}\\
 &=&  \frac{p(x_i|z_i, \theta^{(t)})}{\sum_{z_i} P(x_i|z_i,\theta^{(t)})}\\
 &=&  \frac{B(x_i|M,\theta_{z_i}^{(t)})}{\sum_{z_i} B(x_i|M,\theta_{z_i}^{(t)})} \ .
\end{eqnarray}

The equation on the second line uses Bayes' theorem, and the equation on the third line has prior distribution because there is no information about which coin was used.p(z_i=A)=p(z_i=B)Calculated as. Also,B(x|n,p)Represents the binomial distribution. Binomial distributionB(x|n,p)Is an eventU,VOf the eventsUIs the probabilitypThe trial selected bynWhen I went independentlyxTimesURepresents the probability that is selected.

def get_post_z_i(z_i, x_i, M, theta):
    norm_c = 0 # Normalization term in denominator
    for j in range(0,2):
        p_j = st.binom.pmf(x_i, M, theta[j])
        norm_c = norm_c + p_j
        if z_i == j:
            ll = p_j
    return ll / norm_c

** Step 3 **

Using the calculation result in step 1, calculate the maximum $ \ theta $ by the gradient method.

def neg_expect(theta_next, theta_cur, x):
    N = x.size
    e = 0
    for i in range(0, N): # for trial i
        for j in range(0, 2): # used coin j
            post_z = get_post_z_i(j, x[i], M, theta_cur)
            prob_x = st.binom.logpmf(x[i], M, theta_next[j])
            e = e + post_z * prob_x
    return -e

# Sample calculation
bnds = [(0,0.99), (0,0.99)]
res = minimize(neg_expect, thetas[:,t], args=(thetas[:,t], x),
          bounds=bnds, method='SLSQP', options={'maxiter': 1000})
res.x

** Step 4 **

Repeat until converged.

t = 0
while t < t_max-1:
    if t % 10 == 0:
        print(str(t) + "/" + str(t_max))
    res = minimize(neg_expect, thetas[:,t], args=(thetas[:,t], x),
         bounds=bnds, method='SLSQP', options={'maxiter': 1000})
    thetas[:,t+1] = res.x
    t = t + 1

result

import matplotlib.pyplot as plt
import seaborn as sns
# result
plt.plot(np.arange(0,t_max), thetas[0,:])
plt.plot(np.arange(0,t_max), thetas[1,:])
# true value
plt.plot([0, t_max], np.ones(2) * theta_true[0])
plt.plot([0, t_max], np.ones(2) * theta_true[1])
plt.ylim([0, 1])

In this problem, the number of data was small, so we could not estimate it so accurately.

Derivation of EM algorithm

Let the sample of the random variable obtained in N trials be $ x = \ {x_0, \ cdots, x_N \} $, the unobservable random variable be $ z $, and the parameter you want to estimate be $ \ theta $. The log-likelihood function $ l (\ theta) $ for $ \ theta $ can be written as: (In the following, we will consider $ \ sum_z $ assuming that $ z $ is a discrete variable according to this problem, but if $ z $ is a continuous variable, it is the same if we consider the integral.)

l(\theta) := \log p(x|\theta) = \log \sum_z p(x,z|\theta) \ . (1)

In the actual calculation

l(\theta) = \sum_i \log \sum_{z_i} p(x_i,z_i|\theta)\ ,

It is easier to calculate, but it will be difficult to see the formula, so here we will proceed with formula (1). Now consider an arbitrary probability distribution $ Q (z) $ for $ z $ and use Jensen's inequality.

l(\theta) = \log \sum_z Q(z) \frac{p(x,z|\theta)}{Q(z)} \geq \sum_z Q(z) \log \frac{p(x,z|\theta)}{Q(z)}\ ,

Is established. In the above equation, the equal sign holds in the inequality, assuming that the constant is $ C $.

\frac{p(x,z|\theta)}{Q(z)} = C \ ,

Only when is. $ Q (z) $ that satisfies this is considered the normalization constant.

\begin{eqnarray}
Q(z)&=\frac{p(x,z|\theta) }{\sum_zp(x,z|\theta)} \\
&=\frac{p(x,z|\theta)}{p(x|\theta)}\\
&=p(z|x,\theta)  \ ,
\end{eqnarray}

is. That is, if you give a certain $ \ theta ^ {(t)} $,

\log \sum_z p(z|x,\theta^{(t)}) \frac{p(x,z|\theta^{(t)})}{p(z|x,\theta^{(t)})} = \sum_z p(z|x,\theta^{(t)}) \log \frac{p(x,z|\theta^{(t)})}{p(z|x,\theta^{(t)})}\ ,

Is true, no matter what $ \ theta ^ {(*)} \ neq \ theta ^ {(t)} $ is given

\log \sum_z p(z|x,\theta^{(t)}) \frac{p(x,z|\theta^{(*)})}{p(z|x,\theta^{(t)})} \geq \sum_z p(z|x,\theta^{(t)}) \log \frac{p(x,z|\theta^{(*)})}{p(z|x,\theta^{(t)})}\ ,  \ \ (2)

Is true. Therefore, the lower limit of the inequality, that is,\log p(x,z|\theta^{(\*)})/p(z|x,\theta^{(t)})ofzMaximize expectations about(Maximization of Expextation)Like to do\theta^{(*)}To\theta^{(t+1)}will do;

\begin{eqnarray}
\theta^{(t+1)} &=& \mathrm{arg \ max}_{\theta^{(*)}}  \sum_z p(z|x,\theta^{(t)}) \log \frac{p(x,z|\theta^{(*)})}{p(z|x,\theta^{(t)})} \  \ (3)\\
&=& \mathrm{arg \ max}_{\theta^{(*)}}  \sum_z p(z|x,\theta^{(t)}) \log p(x,z|\theta^{(*)}) \ .  \  \ (4)
\end{eqnarray}

Then we see that the following inequality holds.

\begin{eqnarray}
l(\theta^{(t+1)})
&\geq &\sum_z p(z|x,\theta^{(t)})\log  \frac{p(x,z|\theta^{(t+1)})}{p(z|x,\theta^{(t)})} \\
&\geq& \sum_z p(z|x,\theta^{(t)})\log  \frac{p(x,z|\theta^{(t)})}{p(z|x,\theta^{(t)})} \\
&=& l(\theta^{(t)}) \ .
\end{eqnarray}

Here, we used the relation (2) for the first inequality and the relation (3) for the second inequality.

You have now derived the EM algorithm! To summarize the above, given a parameter $ \ theta ^ {(t)} $, $ \ theta ^ {(t + 1)} $ obtained by the following two-step calculation is $ . It turns out that $ l (\ theta ^ {(t)}) \ leq l (\ theta ^ {(t + 1)}) $ always holds for the likelihood function $ l (\ theta) $ of theta $. It was.

For the hidden variable $ z $, use $ \ theta ^ {(t)} $ to calculate the posterior distribution $ p (z | x, \ theta ^ {(t)}) $
\sum_z p(z|x,\theta^{(t)}) \log p(x,z|\theta)To maximize\thetaCalculate and\theta^{(t+1)}To

In other words, if you decide $ \ theta ^ {(0)} $ appropriately and repeat the above procedure repeatedly, you will get the locally optimal estimated value $ \ hat {\ theta} $. This is the EM algorithm.

Derivation of EM algorithm and calculation example for coin toss

Example: Estimating the probability that a coin toss will appear

Derivation of EM algorithm

Reference material