Let's use tomotopy instead of gensim

What is tomotopy?

tomotopy is an abbreviation for TOpic MOdeling TOol, a Python library that can mainly handle LDA (Latent Dirichlet Allocation) and its derived algorithms. ..

It is easier to handle than the library gensim, which has similar functions, and the calculation is faster because it is built in C ++.

Introduction method

Just put it in with pip.

pip install tomotopy

How to use

As an example, use the following dataset from the gensim tutorial.

Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

When using LDA with tomotopy, it will be as follows.

Use the dataset after preprocessing (preprocessing is [this](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-] Same as core-concepts-py)).

import tomotopy as tp
from pprint import pprint

texts = [
    ['human', 'interface', 'computer'],
    ['survey', 'user', 'computer', 'system', 'response', 'time'],
    ['eps', 'user', 'interface', 'system'],
    ['system', 'human', 'system', 'eps'],
    ['user', 'response', 'time'],
    ['trees'],
    ['graph', 'trees'],
    ['graph', 'minors', 'trees'],
    ['graph', 'minors', 'survey']
]

#Model initialization
model = tp.LDAModel(k=2, seed=1)  #k is the number of topics

#Creating a corpus
for text in texts:
    model.add_doc(text)

#Learning
model.train(iter=100)

#Extracting the word distribution of a topic
for k in range(model.k):
    print(f"Topic {k}")
    pprint(model.get_topic_words(k, top_n=5))

"""output
Topic 0
[('system', 0.20972803235054016),
 ('user', 0.15742677450180054),
 ('human', 0.10512551665306091),
 ('interface', 0.10512551665306091),
 ('computer', 0.10512551665306091)]
Topic 1
[('trees', 0.2974308431148529),
 ('graph', 0.2974308431148529),
 ('survey', 0.1986166089773178),
 ('minors', 0.1986166089773178),
 ('system', 0.0009881423320621252)]
"""

Features of tomtopy

Good point

easy to handle.

Most of what you want to do when you want to use LDA can be easily done by initializing the model and setting the arguments of the learning function.

(Parallelization, TF-IDF, setting upper and lower limits of word frequency and document frequency, etc.)

--The learning algorithm is sampling (collapsed Gibbs sampling).

Variational reasoning is used in gensim, but sampling is said to be more accurate.

The disadvantage of sampling is that it takes time,

Since tomotopy is built in C ++ and can be easily parallelized, it is much faster than MALLET.

--LDA derivatives are available.

The following are available:

Labeled LDA (LLDAModel)
Partially Labeled LDA (PLDAModel)
Supervised LDA (SLDAModel)
Dirichlet Multinomial Regression (DMRModel)
Generalized Dirichlet Multinomial Regression (GDMRModel)
Hierarchical Dirichlet Process (HDPModel)
Hierarchical LDA (HLDAModel)
Multi Grain LDA (MGLDAModel)
Pachinko Allocation (PAModel)
Hierarchical PA (HPAModel)
Correlated Topic Model (CTModel)
Dynamic Topic Model (DTModel)

bad place

――It may not be possible to reach the itchy place.

Perhaps because tomotopy specializes in ease of use, there are times when I'm asked, "Well, can't I do this?"

For example

-~~ The processed corpus cannot be reused (you must create a corpus each time you learn). ~~ If you look closely, it seems that you can do it using a class called tomotopy.utils.Corpus. However, when I tried it, it was a disappointing specification that the cost was high in terms of time and RAM.

--There is no way to save RAM.

(Well, if neither of them is a dataset of 10 million records, it doesn't bother me that much.)

Summary

With tomotopy, you can learn LDA models by sampling very easily.

To be honest, I can't go back to gensim anymore.