An example of solving a problem with a coin toss using the EM algorithm is shown in mathematical formulas and Python code, and then the derivation of the EM algorithm itself is shown.
** Problem **: I have two coins A and B. When I do a coin toss, I know that A is easier to flip than B. As a result of repeating the trial of performing a coin toss 20 times using either coin 10 times, the number of times the table appeared was as follows. Estimate the probability that each coin of A and B will appear.
Number of times the table appeared | 15 | 4 | 6 | 13 | 3 | 5 | 4 | 4 | 7 | 2 |
---|
** Answer **: The probability that each table of A and B will appear is $ \ theta_A, \ theta_B $, and out of $ N $ trials, the result of $ M $ coin toss in the $ i $ trial, the table Let $ x_i $ be the number of times the coin was flipped, and $ z_i \ in \ {A, B \} $ be the coin used.
The following algorithm, called the EM algorithm, finds the estimated value of $ \ theta = (\ theta_A, \ theta_B) $ $ \ hat {\ theta} $.
The specific formula for each step and the Python code are shown below.
step 1
Let $ \ theta_A ^ {(0)} = 0.5, \ theta_B ^ {(0)} = 0.4 $.
import numpy as np
import scipy.stats as st
np.random.seed(0)
# Problem setting
N = 10
M = 20
theta_true = np.array([0.8, 0.2])
z_true = np.random.randint(2, size=N)
x = np.zeros(N, dtype=int)
for i in range(0, N):
x[i] = st.binom.rvs(M, theta_true[z_true[i]])
print(x)
# Initial estimates
t = 0
t_max = 50
thetas = np.zeros([2, t_max])
thetas[:,t] = np.array([0.5, 0.4])
** Step 2. **
The posterior distribution of $ z_i $ given $ \ theta ^ {(t)} $ is
\begin{eqnarray}
p(z_i|x_i,\theta^{(t)})&=& p(z_i|x_i,\theta^{(t)})\\
&=& \frac{p(x_i|z_i, \theta^{(t)}) p(z_i)}{\sum_{z_i} P(x_i|z_i,\theta^{(t)})p(z_i)}\\
&=& \frac{p(x_i|z_i, \theta^{(t)})}{\sum_{z_i} P(x_i|z_i,\theta^{(t)})}\\
&=& \frac{B(x_i|M,\theta_{z_i}^{(t)})}{\sum_{z_i} B(x_i|M,\theta_{z_i}^{(t)})} \ .
\end{eqnarray}
The equation on the second line uses Bayes' theorem, and the equation on the third line has prior distribution because there is no information about which coin was used.
def get_post_z_i(z_i, x_i, M, theta):
norm_c = 0 # Normalization term in denominator
for j in range(0,2):
p_j = st.binom.pmf(x_i, M, theta[j])
norm_c = norm_c + p_j
if z_i == j:
ll = p_j
return ll / norm_c
** Step 3 **
Using the calculation result in step 1, calculate the maximum $ \ theta $ by the gradient method.
def neg_expect(theta_next, theta_cur, x):
N = x.size
e = 0
for i in range(0, N): # for trial i
for j in range(0, 2): # used coin j
post_z = get_post_z_i(j, x[i], M, theta_cur)
prob_x = st.binom.logpmf(x[i], M, theta_next[j])
e = e + post_z * prob_x
return -e
# Sample calculation
bnds = [(0,0.99), (0,0.99)]
res = minimize(neg_expect, thetas[:,t], args=(thetas[:,t], x),
bounds=bnds, method='SLSQP', options={'maxiter': 1000})
res.x
** Step 4 **
Repeat until converged.
t = 0
while t < t_max-1:
if t % 10 == 0:
print(str(t) + "/" + str(t_max))
res = minimize(neg_expect, thetas[:,t], args=(thetas[:,t], x),
bounds=bnds, method='SLSQP', options={'maxiter': 1000})
thetas[:,t+1] = res.x
t = t + 1
result
import matplotlib.pyplot as plt
import seaborn as sns
# result
plt.plot(np.arange(0,t_max), thetas[0,:])
plt.plot(np.arange(0,t_max), thetas[1,:])
# true value
plt.plot([0, t_max], np.ones(2) * theta_true[0])
plt.plot([0, t_max], np.ones(2) * theta_true[1])
plt.ylim([0, 1])
In this problem, the number of data was small, so we could not estimate it so accurately.
Let the sample of the random variable obtained in N trials be $ x = \ {x_0, \ cdots, x_N \} $, the unobservable random variable be $ z $, and the parameter you want to estimate be $ \ theta $. The log-likelihood function $ l (\ theta) $ for $ \ theta $ can be written as: (In the following, we will consider $ \ sum_z $ assuming that $ z $ is a discrete variable according to this problem, but if $ z $ is a continuous variable, it is the same if we consider the integral.)
l(\theta) := \log p(x|\theta) = \log \sum_z p(x,z|\theta) \ . (1)
In the actual calculation
l(\theta) = \sum_i \log \sum_{z_i} p(x_i,z_i|\theta)\ ,
It is easier to calculate, but it will be difficult to see the formula, so here we will proceed with formula (1). Now consider an arbitrary probability distribution $ Q (z) $ for $ z $ and use Jensen's inequality.
l(\theta) = \log \sum_z Q(z) \frac{p(x,z|\theta)}{Q(z)} \geq \sum_z Q(z) \log \frac{p(x,z|\theta)}{Q(z)}\ ,
Is established. In the above equation, the equal sign holds in the inequality, assuming that the constant is $ C $.
\frac{p(x,z|\theta)}{Q(z)} = C \ ,
Only when is. $ Q (z) $ that satisfies this is considered the normalization constant.
\begin{eqnarray}
Q(z)&=\frac{p(x,z|\theta) }{\sum_zp(x,z|\theta)} \\
&=\frac{p(x,z|\theta)}{p(x|\theta)}\\
&=p(z|x,\theta) \ ,
\end{eqnarray}
is. That is, if you give a certain $ \ theta ^ {(t)} $,
\log \sum_z p(z|x,\theta^{(t)}) \frac{p(x,z|\theta^{(t)})}{p(z|x,\theta^{(t)})} = \sum_z p(z|x,\theta^{(t)}) \log \frac{p(x,z|\theta^{(t)})}{p(z|x,\theta^{(t)})}\ ,
Is true, no matter what $ \ theta ^ {(*)} \ neq \ theta ^ {(t)} $ is given
\log \sum_z p(z|x,\theta^{(t)}) \frac{p(x,z|\theta^{(*)})}{p(z|x,\theta^{(t)})} \geq \sum_z p(z|x,\theta^{(t)}) \log \frac{p(x,z|\theta^{(*)})}{p(z|x,\theta^{(t)})}\ , \ \ (2)
Is true. Therefore, the lower limit of the inequality, that is,
\begin{eqnarray}
\theta^{(t+1)} &=& \mathrm{arg \ max}_{\theta^{(*)}} \sum_z p(z|x,\theta^{(t)}) \log \frac{p(x,z|\theta^{(*)})}{p(z|x,\theta^{(t)})} \ \ (3)\\
&=& \mathrm{arg \ max}_{\theta^{(*)}} \sum_z p(z|x,\theta^{(t)}) \log p(x,z|\theta^{(*)}) \ . \ \ (4)
\end{eqnarray}
Then we see that the following inequality holds.
\begin{eqnarray}
l(\theta^{(t+1)})
&\geq &\sum_z p(z|x,\theta^{(t)})\log \frac{p(x,z|\theta^{(t+1)})}{p(z|x,\theta^{(t)})} \\
&\geq& \sum_z p(z|x,\theta^{(t)})\log \frac{p(x,z|\theta^{(t)})}{p(z|x,\theta^{(t)})} \\
&=& l(\theta^{(t)}) \ .
\end{eqnarray}
Here, we used the relation (2) for the first inequality and the relation (3) for the second inequality.
You have now derived the EM algorithm! To summarize the above, given a parameter $ \ theta ^ {(t)} $, $ \ theta ^ {(t + 1)} $ obtained by the following two-step calculation is $ . It turns out that $ l (\ theta ^ {(t)}) \ leq l (\ theta ^ {(t + 1)}) $ always holds for the likelihood function $ l (\ theta) $ of theta $. It was.
In other words, if you decide $ \ theta ^ {(0)} $ appropriately and repeat the above procedure repeatedly, you will get the locally optimal estimated value $ \ hat {\ theta} $. This is the EM algorithm.
Recommended Posts