Implemented maximum likelihood estimation of topic model in python. As a textbook [Topic model](https://www.amazon.co.jp/ Topic model-Machine learning professional series-Iwata-Guji / dp / 4061529048 / ref = sr_1_2? Topic model) was used.
Structure of this article
The topic model assumes that a document has multiple topics.
** Example ** Article on "Medicine Bill Deliberation" => Two topics, "Medical" and "Politics" Articles on "Economic Effects of the Olympics" => Two topics, "Sports" and "Economy"
To better express the assumption that a document has multiple topics Set the topic distribution $ \ boldsymbol \ theta_d = (\ theta_ {d1}, \ cdots, \ theta_ {dK}) $ for each document. $ \ theta_ {dk} = p (k \ | \ \ boldsymbol \ theta_d) $ is the probability that the topic $ k $ will be assigned to the word in the document $ d $. The topic $ z_ {dn} $ is assigned to each word in the document $ d $ according to the topic distribution $ \ boldsymbol \ theta_d $. Words are generated according to the word distribution $ \ boldsymbol \ phi_ {z_ {dn}} $ of the assigned topic. $ \ boldsymbol \ phi_k = (\ phi_ {k1}, \ cdots, \ phi_ {kV}) $ represents the probability that the topic $ k $ will generate the vocabulary $ v $.
The figure below shows the process of generating a specific document set.
A topic is assigned to each word, and words are generated from the word distribution according to that topic. In the middle of the figure, there is a high probability that the topics "sports" and "economy" will be assigned. Many words related to these two topics are observed. Looking at the word distribution of economic topics, in descending order of probability of generation It is lined up with "economy," "stock price," "economy," and "strong yen."
The method of maximum likelihood estimation of the topic model is called ** stochastic latent semantic analysis ** (** PLSA ). By the way, Bayesian estimation of the topic model is a method called ** latent Dirichlet distribution model ** ( LDA **).
PLSA uses the EM algorithm to estimate the maximum likelihood of the topic model. Parameters that maximize the log-likelihood of the following $ \ boldsymbol \ Theta = (\ boldsymbol \ theta_1, \ cdots, \ boldsymbol \ theta_D) $ and $ \ boldsymbol \ Phi = (\ boldsymbol \ phi_1, \ cdots, \ boldsymbol \ phi_K) Find $.
\begin{align}
L &= \sum^D_{d = 1} \sum^{N_d}_{n = 1} \log \sum^K_{k = 1} \theta_{dk}\phi_{kw_{dn}} \\
&= \sum^D_{d = 1} \sum^{N_d}_{n = 1} \log \sum^K_{k = 1} q_{dnk} \frac{\theta_{dk}\phi_{kw_{dn}}}{q_{dnk}} \\
&\geq \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log \frac{\theta_{dk}\phi_{kw_{dn}}}{q_{dnk}} \\
&= \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log \theta_{dk}\phi_{kw_{dn}} - \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log q_{dnk} \equiv F \tag{1}
\end{align}
Use the EM algorithm to maximize the lower bound $ F $. Below, we will expand the formula by dividing it into ** E step ** and ** M step **.
Here, I will sort out the characters used once.
$ D $: Number of documents
$ N_d $: Number of words contained in document $ d $ (document length)
$ V $: Number of types of words (vocabulary) appearing in all documents
$ \ boldsymbol W $: Document set
$ \ boldsymbol w_d $: Document $ d $
$ w_ {dn} $: $ n $ th word in document $ d $
$ K $: Number of topics
$ \ theta_ {dk} $: Probability that document $ d $ will be assigned topic $ k $
$ \ phi_ {kv} $: Probability that topic $ k $ will generate vocabulary $ v $
$ q_ {dnk} $: Probability (burden) that topic $ k $ is assigned to the $ n $ th word in document $ d $
E step
To use Lagrange's undetermined multiplier method, we define $ F_q $ as follows.
F_q \equiv \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log \theta_{dk}\phi_{kw_{dn}} - \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log q_{dnk} + \lambda \bigl(\sum^K_{k = 1} q_{dnk} - 1 \bigr) \tag{2}
Partially differentiate $ F_q $ with $ q_ {dnk} $.
\frac{\partial F_q}{\partial q_{dnk}} = \log \theta_{dk}\phi_{kw_{dn}} - \log q_{dnk} - 1 + \lambda = 0 \tag{3}
From $ \ frac {\ partial F_q} {\ partial q_ {dnk}} = 0 $
q_{dnk} = \exp(\lambda - 1) \theta_{dk}\phi_{kw_{dn}} \tag{4}
From $ \ sum ^ K_ {k = 1} q_ {dnk} = 1 $
\begin{align}
& \sum^K_{k = 1} \exp(\lambda - 1) \theta_{dk}\phi_{kw_{dn}} = 1 \\
& \exp(\lambda - 1) = \frac{1}{\displaystyle \sum^K_{k = 1} \theta_{dk}\phi_{kw_{dn}}} \tag{5}
\end{align}
Substitute equation (5) into equation (4) to get $ q_ {dnk} $.
q_{dnk} = \frac{\theta_{dk}\phi_{kw_{dn}}}{\displaystyle \sum^K_{k' = 1} \theta_{dk'}\phi_{k'w_{dn}}} \tag{6}
$ q_ {dnk} $ is called the burden factor and represents the probability that the topic $ k $ will be assigned to the $ n $ th word in the document $ d $.
M step
In M step, the parameters $ \ boldsymbol \ Theta = (\ boldsymbol \ theta_1, \ cdots, \ theta_D) $ and $ \ boldsymbol \ Phi = (\ phi_1, \ cdots, \ phi_K) $ are considered separately. Both use Lagrange's undetermined multiplier method as in E step.
To use Lagrange's undetermined multiplier method, we define $ F_ \ theta $ as follows.
F_\theta \equiv \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log \theta_{dk}\phi_{kw_{dn}} - \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log q_{dnk} + \lambda \bigl(\sum^K_{k = 1} \theta_{dk} - 1 \bigr) \tag{7}
Partial differential of $ F_ \ theta $ with $ \ theta_ {dk} $.
\frac{\partial F_\theta}{\partial \theta_{dk}} = \sum^{N_d}_{n = 1} \frac{q_{dnk}}{\theta_{dk}} + \lambda \tag{8}
From $ \ frac {\ partial F_ \ theta} {\ partial \ theta_ {dk}} = 0 $
\theta_{dk} = - \sum^{N_d}_{n = 1} \frac{q_{dnk}}{\lambda} \tag{9}
From $ \ sum ^ K_ {k = 1} \ theta_ {dk} = 1 $
\begin{align}
& - \sum^K_{k = 1} \sum^{N_d}_{n = 1} \frac{q_{dnk}}{\lambda} = 1 \\
& \lambda = - \sum^K_{k = 1} \sum^{N_d}_{n = 1} q_{dnk} \tag{10}
\end{align}
Substitute equation (10) into equation (9) to get $ \ theta_ {dk} $.
\theta_{dk} = \frac{\displaystyle \sum^{N_d}_{n = 1} q_{dnk}}{\displaystyle \sum^K_{k' = 1} \sum^{N_d}_{n = 1} q_{dnk'}} \tag{11}
To use Lagrange's undetermined multiplier method, we define $ F_ \ phi $ as follows.
F_\phi \equiv \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log \theta_{dk}\phi_{kw_{dn}} - \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log q_{dnk} + \lambda \bigl(\sum^V_{v = 1} \phi_{kv} - 1 \bigr) \tag{12}
Partially differentiate $ F_ \ phi $ with $ \ phi_ {kv} $.
\frac{\partial F_\phi}{\partial \phi_{kv}} = \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v} \frac{q_{dnk}}{\phi_{kv}} + \lambda \tag{13}
From $ \ frac {\ partial F_ \ phi} {\ partial \ phi_ {kv}} = 0 $
\phi_{kv} = - \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v} \frac{q_{dnk}}{\lambda} \tag{14}
From $ \ sum ^ V_ {v = 1} \ phi_ {kv} = 1 $
\begin{align}
& - \sum^V_{v = 1} \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v} \frac{q_{dnk}}{\lambda} = 1 \\
& \lambda = - \sum^V_{v = 1} \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v} q_{dnk} \tag{15}
\end{align}
Substitute equation (15) into equation (14) to get $ \ phi_ {kv} $.
\phi_{kv} = \frac{\displaystyle \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v} q_{dnk}}{\displaystyle \sum^V_{v' = 1} \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v'} q_{dnk}}
I implemented a maximum likelihood estimation toy program for topic models in python.
import numpy as np
def normalize(ndarray, axis):
return ndarray / ndarray.sum(axis = axis, keepdims = True)
def normalized_random_array(d0, d1):
ndarray = np.random.rand(d0, d1)
return normalize(ndarray, axis = 1)
if __name__ == "__main__":
# initialize parameters
D, K, V = 1000, 3, 6
theta = normalized_random_array(D, K)
phi = normalized_random_array(K, V)
theta_est = normalized_random_array(D, K)
phi_est = normalized_random_array(K, V)
# for generate documents
_theta = np.array([theta[:, :k+1].sum(axis = 1) for k in range(K)]).T
_phi = np.array([phi[:, :v+1].sum(axis = 1) for v in range(V)]).T
# generate documents
W, Z = [], []
N = np.random.randint(100, 300, D)
for (d, N_d) in enumerate(N):
Z.append((np.random.rand(N_d, 1) < _theta[d, :]).argmax(axis = 1))
W.append((np.random.rand(N_d, 1) < _phi[Z[-1], :]).argmax(axis = 1))
# estimate parameters
q = np.zeros((D, np.max(N), K))
T = 300
likelihood = np.zeros(T)
for t in range(T):
print(t)
# E step
for (d, N_d) in enumerate(N):
q[d, :N_d, :] = normalize(theta_est[d, :] * phi_est[:, W[d]].T, axis = 1)
# M step
theta_est[:, :] = normalize(q.sum(axis = 1), axis = 1)
q_sum = np.zeros((K, V))
for (d, W_d) in enumerate(W):
v, index, count = np.unique(W_d, return_index= True, return_counts = True)
q_sum[:, v[:]] += count[:] * q[d, index[:], :].T
phi_est[:, :] = normalize(q_sum, axis = 1)
# likelihood
for (d, W_d) in enumerate(W):
likelihood[t] += np.log(theta_est[d, :].dot(phi_est[:, W_d])).sum()
# perplexity
perplexity = np.exp(-likelihood[:] / N.sum())
The result of running the implemented code is shown. Again, each parameter is as follows.
The word distribution estimated as the true word distribution is shown below.
phi
[[ 0.06837655 0.2294022 0.29048815 0.01610669 0.31437584 0.08125057]
[ 0.19902324 0.19283392 0.0251427 0.23391767 0.10622823 0.24285424]
[ 0.04165762 0.30678289 0.2579833 0.00882557 0.25580282 0.12894781]]
phi_est
[[ 0.04572686 0.20003413 0.33589344 0.00282668 0.36368718 0.0518317 ]
[ 0.20342652 0.16884355 0.04403442 0.23473554 0.12226066 0.22669932]
[ 0.05439782 0.37614278 0.18788228 0.01364525 0.18248741 0.18544446]]
A graph of log-likelihood is shown. The log-likelihood is calculated by the following formula.
\begin{align}
likelihood &= \sum^D_{d = 1} \log p(\boldsymbol w_d \ | \ M) \\
&= \sum^D_{d = 1} \log \prod^{N_d}_{n = 1} \sum^K_{k = 1} \theta_{dk} \phi_{kw_{dn}}
\end{align}
The graph of perplexity is shown.
perplexity = \exp\Biggl(- \frac{\displaystyle \sum^D_{d = 1} \log p(\boldsymbol w_d \ | \ M)}{\displaystyle \sum^D_{d = 1} N_d}\Biggr)
We have implemented maximum likelihood estimation for the topic model. Next, I will try ** LDA **.
Recommended Posts