Introduction

Implemented maximum likelihood estimation of topic model in python. As a textbook [Topic model](https://www.amazon.co.jp/ Topic model-Machine learning professional series-Iwata-Guji / dp / 4061529048 / ref = sr_1_2? Topic model) was used.

Structure of this article

Introduction
Topic model
Stochastic latent semantic analysis
Implementation
Result
in conclusion

Topic model

The topic model assumes that a document has multiple topics.

** Example ** Article on "Medicine Bill Deliberation" => Two topics, "Medical" and "Politics" Articles on "Economic Effects of the Olympics" => Two topics, "Sports" and "Economy"

To better express the assumption that a document has multiple topics Set the topic distribution $ \ boldsymbol \ theta_d = (\ theta_ {d1}, \ cdots, \ theta_ {dK}) $ for each document. $ \ theta_ {dk} = p (k \ | \ \ boldsymbol \ theta_d) $ is the probability that the topic $ k $ will be assigned to the word in the document $ d $. The topic $ z_ {dn} $ is assigned to each word in the document $ d $ according to the topic distribution $ \ boldsymbol \ theta_d $. Words are generated according to the word distribution $ \ boldsymbol \ phi_ {z_ {dn}} $ of the assigned topic. $ \ boldsymbol \ phi_k = (\ phi_ {k1}, \ cdots, \ phi_ {kV}) $ represents the probability that the topic $ k $ will generate the vocabulary $ v $.

The figure below shows the process of generating a specific document set.

A topic is assigned to each word, and words are generated from the word distribution according to that topic. In the middle of the figure, there is a high probability that the topics "sports" and "economy" will be assigned. Many words related to these two topics are observed. Looking at the word distribution of economic topics, in descending order of probability of generation It is lined up with "economy," "stock price," "economy," and "strong yen."

Stochastic latent semantic analysis

The method of maximum likelihood estimation of the topic model is called ** stochastic latent semantic analysis ** (** PLSA ). By the way, Bayesian estimation of the topic model is a method called ** latent Dirichlet distribution model ** ( LDA **).

I don't understand LDA at all, so I can't explain it.

PLSA uses the EM algorithm to estimate the maximum likelihood of the topic model. Parameters that maximize the log-likelihood of the following $ \ boldsymbol \ Theta = (\ boldsymbol \ theta_1, \ cdots, \ boldsymbol \ theta_D) $ and $ \ boldsymbol \ Phi = (\ boldsymbol \ phi_1, \ cdots, \ boldsymbol \ phi_K) Find $.

\begin{align}
L &= \sum^D_{d = 1} \sum^{N_d}_{n = 1} \log \sum^K_{k = 1} \theta_{dk}\phi_{kw_{dn}} \\
&= \sum^D_{d = 1} \sum^{N_d}_{n = 1} \log \sum^K_{k = 1} q_{dnk} \frac{\theta_{dk}\phi_{kw_{dn}}}{q_{dnk}} \\
&\geq \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log \frac{\theta_{dk}\phi_{kw_{dn}}}{q_{dnk}} \\
&= \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log \theta_{dk}\phi_{kw_{dn}} - \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log q_{dnk} \equiv F \tag{1}
\end{align}

Use the EM algorithm to maximize the lower bound $ F $. Below, we will expand the formula by dividing it into ** E step ** and ** M step **.

Here, I will sort out the characters used once.
$ D $: Number of documents
$ N_d $: Number of words contained in document $ d $ (document length)
$ V $: Number of types of words (vocabulary) appearing in all documents
$ \ boldsymbol W $: Document set
$ \ boldsymbol w_d $: Document $ d $
$ w_ {dn} $: $ n $ th word in document $ d $
$ K $: Number of topics
$ \ theta_ {dk} $: Probability that document $ d $ will be assigned topic $ k $
$ \ phi_ {kv} $: Probability that topic $ k $ will generate vocabulary $ v $
$ q_ {dnk} $: Probability (burden) that topic $ k $ is assigned to the $ n $ th word in document $ d $

E step

To use Lagrange's undetermined multiplier method, we define $ F_q $ as follows.

F_q \equiv \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log \theta_{dk}\phi_{kw_{dn}} - \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log q_{dnk} + \lambda \bigl(\sum^K_{k = 1} q_{dnk} - 1 \bigr) \tag{2}

Partially differentiate $ F_q $ with $ q_ {dnk} $.

\frac{\partial F_q}{\partial q_{dnk}} = \log \theta_{dk}\phi_{kw_{dn}} - \log q_{dnk} - 1 + \lambda = 0 \tag{3}

From $ \ frac {\ partial F_q} {\ partial q_ {dnk}} = 0 $

q_{dnk} = \exp(\lambda - 1) \theta_{dk}\phi_{kw_{dn}} \tag{4}

From $ \ sum ^ K_ {k = 1} q_ {dnk} = 1 $

\begin{align}
& \sum^K_{k = 1} \exp(\lambda - 1) \theta_{dk}\phi_{kw_{dn}} = 1 \\
& \exp(\lambda - 1) = \frac{1}{\displaystyle \sum^K_{k = 1} \theta_{dk}\phi_{kw_{dn}}} \tag{5}
\end{align}

Substitute equation (5) into equation (4) to get $ q_ {dnk} $.

q_{dnk} = \frac{\theta_{dk}\phi_{kw_{dn}}}{\displaystyle \sum^K_{k' = 1} \theta_{dk'}\phi_{k'w_{dn}}} \tag{6}

$ q_ {dnk} $ is called the burden factor and represents the probability that the topic $ k $ will be assigned to the $ n $ th word in the document $ d $.

M step

In M step, the parameters $ \ boldsymbol \ Theta = (\ boldsymbol \ theta_1, \ cdots, \ theta_D) $ and $ \ boldsymbol \ Phi = (\ phi_1, \ cdots, \ phi_K) $ are considered separately. Both use Lagrange's undetermined multiplier method as in E step.

Topic distribution

To use Lagrange's undetermined multiplier method, we define $ F_ \ theta $ as follows.

F_\theta \equiv \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log \theta_{dk}\phi_{kw_{dn}} - \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log q_{dnk} + \lambda \bigl(\sum^K_{k = 1} \theta_{dk} - 1 \bigr) \tag{7}

Partial differential of $ F_ \ theta $ with $ \ theta_ {dk} $.

\frac{\partial F_\theta}{\partial \theta_{dk}} = \sum^{N_d}_{n = 1} \frac{q_{dnk}}{\theta_{dk}} + \lambda \tag{8}

From $ \ frac {\ partial F_ \ theta} {\ partial \ theta_ {dk}} = 0 $

\theta_{dk} = - \sum^{N_d}_{n = 1} \frac{q_{dnk}}{\lambda} \tag{9}

From $ \ sum ^ K_ {k = 1} \ theta_ {dk} = 1 $

\begin{align}
& - \sum^K_{k = 1} \sum^{N_d}_{n = 1} \frac{q_{dnk}}{\lambda} = 1 \\
& \lambda = - \sum^K_{k = 1} \sum^{N_d}_{n = 1} q_{dnk} \tag{10}
\end{align}

Substitute equation (10) into equation (9) to get $ \ theta_ {dk} $.

\theta_{dk} = \frac{\displaystyle \sum^{N_d}_{n = 1} q_{dnk}}{\displaystyle \sum^K_{k' = 1} \sum^{N_d}_{n = 1} q_{dnk'}} \tag{11}

Word distribution

To use Lagrange's undetermined multiplier method, we define $ F_ \ phi $ as follows.

F_\phi \equiv \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log \theta_{dk}\phi_{kw_{dn}} - \sum^D_{d = 1} \sum^{N_d}_{n = 1} \sum^K_{k = 1} q_{dnk} \log q_{dnk} + \lambda \bigl(\sum^V_{v = 1} \phi_{kv} - 1 \bigr) \tag{12}

Partially differentiate $ F_ \ phi $ with $ \ phi_ {kv} $.

\frac{\partial F_\phi}{\partial \phi_{kv}} = \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v} \frac{q_{dnk}}{\phi_{kv}} + \lambda \tag{13}

From $ \ frac {\ partial F_ \ phi} {\ partial \ phi_ {kv}} = 0 $

\phi_{kv} = - \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v} \frac{q_{dnk}}{\lambda} \tag{14}

From $ \ sum ^ V_ {v = 1} \ phi_ {kv} = 1 $

\begin{align}
& - \sum^V_{v = 1} \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v} \frac{q_{dnk}}{\lambda} = 1 \\
& \lambda = - \sum^V_{v = 1} \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v} q_{dnk} \tag{15}
\end{align}

Substitute equation (15) into equation (14) to get $ \ phi_ {kv} $.

\phi_{kv} = \frac{\displaystyle \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v} q_{dnk}}{\displaystyle \sum^V_{v' = 1} \sum^D_{d = 1} \sum^{}_{n:w_{dn} = v'} q_{dnk}}

Implementation

I implemented a maximum likelihood estimation toy program for topic models in python.

import numpy as np

def normalize(ndarray, axis):
	return ndarray / ndarray.sum(axis = axis, keepdims = True)
	
def normalized_random_array(d0, d1):
	ndarray = np.random.rand(d0, d1)
	return normalize(ndarray, axis = 1)

if __name__ == "__main__":
	
	# initialize parameters
	D, K, V = 1000, 3, 6
	theta = normalized_random_array(D, K)
	phi = normalized_random_array(K, V)
	theta_est = normalized_random_array(D, K)
	phi_est = normalized_random_array(K, V)

	# for generate documents
	_theta = np.array([theta[:, :k+1].sum(axis = 1) for k in range(K)]).T
	_phi = np.array([phi[:, :v+1].sum(axis = 1) for v in range(V)]).T

	# generate documents
	W, Z = [], []
	N = np.random.randint(100, 300, D)
	for (d, N_d) in enumerate(N):
		Z.append((np.random.rand(N_d, 1) < _theta[d, :]).argmax(axis = 1))
		W.append((np.random.rand(N_d, 1) < _phi[Z[-1], :]).argmax(axis = 1))
	
	# estimate parameters
	q = np.zeros((D, np.max(N), K))
	T = 300
	likelihood = np.zeros(T)
	for t in range(T):
		print(t)

		# E step
		for (d, N_d) in enumerate(N):
			q[d, :N_d, :] = normalize(theta_est[d, :] * phi_est[:, W[d]].T, axis = 1)

		# M step
		theta_est[:, :] = normalize(q.sum(axis = 1), axis = 1)
		q_sum = np.zeros((K, V))
		for (d, W_d) in enumerate(W):
			v, index, count = np.unique(W_d, return_index= True, return_counts = True)
			q_sum[:, v[:]] += count[:] * q[d, index[:], :].T
		phi_est[:, :] = normalize(q_sum, axis = 1)

		# likelihood
		for (d, W_d) in enumerate(W):
			likelihood[t] += np.log(theta_est[d, :].dot(phi_est[:, W_d])).sum()

	# perplexity
	perplexity = np.exp(-likelihood[:] / N.sum())

result

The result of running the implemented code is shown. Again, each parameter is as follows.

Number of documents $ D = 1000 $
Number of topics $ K = 3 $
Vocabulary $ V = 6 $
Number of learnings $ T = 300 $

The word distribution estimated as the true word distribution is shown below.

phi
[[ 0.06837655  0.2294022   0.29048815  0.01610669  0.31437584  0.08125057]
 [ 0.19902324  0.19283392  0.0251427   0.23391767  0.10622823  0.24285424]
 [ 0.04165762  0.30678289  0.2579833   0.00882557  0.25580282  0.12894781]]

phi_est
[[ 0.04572686  0.20003413  0.33589344  0.00282668  0.36368718  0.0518317 ]
 [ 0.20342652  0.16884355  0.04403442  0.23473554  0.12226066  0.22669932]
 [ 0.05439782  0.37614278  0.18788228  0.01364525  0.18248741  0.18544446]]

A graph of log-likelihood is shown. The log-likelihood is calculated by the following formula.

\begin{align}
likelihood &= \sum^D_{d = 1} \log p(\boldsymbol w_d \ | \ M) \\
&= \sum^D_{d = 1} \log \prod^{N_d}_{n = 1} \sum^K_{k = 1} \theta_{dk} \phi_{kw_{dn}}
\end{align}

The graph of perplexity is shown.

It doesn't make much sense because learning data is used for evaluation data. .. .. Perplexity is calculated by the following formula.

perplexity = \exp\Biggl(- \frac{\displaystyle \sum^D_{d = 1} \log p(\boldsymbol w_d \ | \ M)}{\displaystyle \sum^D_{d = 1} N_d}\Biggr)

in conclusion

We have implemented maximum likelihood estimation for the topic model. Next, I will try ** LDA **.

Maximum likelihood estimation implementation of topic model in python