Manipulate topic models ~ Interactive Topic Model ~

Implementation of Interactive Topic Model and its results


There is a concept called a topic model as a method of extracting contents from a set of documents in natural language processing technology.

Among them, there is an Interactive Topic Model as a method of intentionally manipulating words that appear in a topic.

Therefore, in this article, we will implement the Interactive Topic Model and verify its effect.


Topic model

In the topic model, the probability that a topic (for example, a newspaper article contains a topic such as politics or sports) appears from a set of documents, the topic distribution $ \ theta $, and how within that topic. It is a method to estimate the word distribution $ \ phi $ to see if the word is easy to come out.

A description of the topic model Please refer to the materials around for easy understanding.

Latent Dirichlet Allocation(LDA)

Of the various topic models, Latent Dirichlet Allocation (LDA) is the most famous.

LDA considers that there are multiple topics (politics, news, etc.) in one document (newspaper article), and each topic has a different word distribution.

The graphical model is as shown in the figure below

LDA.png By the way, $ \ theta $ is the topic distribution, $ \ phi $ is the word distribution, $ z $ is the topic assigned to the words in the document, $ v $ is the words in the document, $ N $ is the number of words in the document, $ D $ is the number of documents, $ K $ is the number of topics, $ \ alpha $ and $ \ beta $ are hyperparameters.

Gibbs sampling and variational Bayes can be used to calculate LDA, but is collapsed Gibbs sampling (CGS) the most famous? Is.

The pseudo code for calculating LDA with collapsed Gibbs sampling is as follows

N_dk = 0  #The number of words in document d to which topic k is assigned
N_kv = 0  #The number of times the word v appears in topic k
N_k = 0   #Number of words to which topic k is assigned
d = 1, …, D  #Document number
k = 1, …, K  #Topic number
v = 1, …, V  #Vocabulary number

initialize(z_dn)  #Randomly initialize the topic of the nth word in document d

  for d = 1, …, D do
    for n = 1, …, N_d do # N_d is the number of words used in document d

      N_d[k=z_dn] -= 1  #Subtract from count
      N_[k=z_dn][v=w_n] -= 1
      N_[k=z_dn] -= 1

      for k = 1, …, K do
        cal(p(z_dn = k)) #Calculate the probability that topic k will be assigned to the nth word in document d

      z_dn ~ Categorical(p(z_dn))  # z_Sampling dn topics

      N_d[k=z_dn] += 1  #Count newly assigned topics
      N_[k=z_dn][v=w_n] += 1
      N_[k=z_dn] += 1

until the end condition is met

The probability of becoming $ p (z_ {dn} = k) $ in the pseudo code is calculated as follows.

p(z_{dn}=k) \propto (N_{dk}+\alpha)\frac{N_{kw_{dn}}+\beta}{N_k+\beta V}

Can be calculated with.

Interactive Topic Model(ITM)

When I calculate a topic with LDA, I sometimes want this word and this word to be the same topic.

You can tackle this problem by constraining that word A and word B should come from the same topic.

That is the Interactive Topic Model (ITM)

To give a sensuous explanation, ITM considers constrained words as one word, and evenly distributes the probability of occurrence of those words, making it easier for constrained words to appear on the same topic. スクリーンショット 2017-02-16 16.12.51.png

The calculation is simple, and the formula for calculating $ p (z_ {dn} = k) $ in LDA is rewritten as follows.

p(z_{dn} = k) \propto \begin{cases}
(N_{dk}+\alpha)\frac{N_{kw_{dn}}+\beta}{N_k+\beta V} \;\;\;\;(w_{dn} \notin \Omega)\\
(N_{dk}+\alpha)\frac{N_{k\Omega}+|\Omega|\beta}{N_k+\beta V}\frac{W_{k\Omega w_{dn}}+\eta}{W_{k \Omega} + |\Omega|\eta} \;\;\;\;(w_{dn} \in \Omega)


However,\OmegaIs a constraint,|\Omega|Is the number of words contained within the constraint,N_{k\Omega}Is a topickConstraint with\OmegaThe number of times that came out,W_{k\Omega w_{dn}}Is a topickに割り当てられた制約\OmegaWords inw_{dn}Represents the number of times that appears.

In other words, if the word $ w_ {dn} $ is not included in the constraint $ \ Omega $, the same formula as LDA can be used, and if it is, p (z_ {dn} = k) can be calculated using the following formula. ..


In the experiment, the accuracy of ITM will be verified.

data set

A livedoor corpus was used for the data set.


ITM code here

Experimental result

Fixed number of topics $ K = 10 $, $ \ alpha = 0.1 $, $ \ beta = 0.01 $, $ \ eta = 100 $

First, 50 iterations without restrictions (that is, the same as a normal LDA)

The table below shows the top words that appear in each topic.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
function App Female the work Japan
Release powered by golf movies update
update screen myself 153 Relation
article smartphone male 181 world
Use Presentation marriage directed by Popular
Digi Correspondence Many Release movies
Relation Max Co., Ltd. Opponent 3 http://
smartphone Android jobs 96 myself
software display Christmas 13 topic
user year 2012 Girls Book Wow

You can see some topics with this alone.

Next, constrain the blue words in the table so that they are on the same topic. Turn another 50 iterations.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Release App Female 153 movies
Use powered by myself 181 Japan
function smartphone golf 3 the work
update Presentation male 96 Release
article Correspondence marriage 13 world
service Ma Many 552 directed by
Relation Max Co., Ltd. Opponent 144 Relation
Digi Android jobs 310 Special feature
software display Christmas 98 http://
information year 2012 Good Hero Wow

Words in blue with restrictions appear in common in Topic 5.


The Interactive Topic Model (ITM) constrained words to estimate the topic distribution.

At first glance, the content I posted looks good, but in reality, it is the result of many trials and errors ...

ITM also has a journal, so if you implement that content, it may be more accurate.

Recommended Posts

Manipulate topic models ~ Interactive Topic Model ~
Creating an interactive application using a topic model
Continuous space topic model implementation
Pokemon classification by topic model