Topic Analysis (LDA) in Java

About this page

I wanted to practice JavaFX & self-contained packages, I wanted to make a tool that I could use anyway, so I decided to use LDA. Here is a summary of LDA on Java. See the links at the end of this article for JavaFX & self-contained packages.

What is LDA?

It is one of the (general purpose) machine learning methods for estimating topics from a specified set of documents in natural language processing. As of 2019, deep learning is popular, but before that, I was on a business trip to improve accuracy in the NLP area. For those who want to read about logic, the following pages are easy to read and recommended.

Excerpt from Latent Dirichlet Allocation (LDA) Introduction to Yurufuwa

LDA is a type of language model that assumes that a document consists of multiple topics. In Japanese, it is called "Latent Dirichlet Allocation Method". If you describe a word as superficial, the topic is latent because it does not appear on the surface unlike a word. I wonder if it is called the "latent Dirichlet distribution method" because the Dirichlet distribution is assumed to be the prior distribution of the distribution of the potential elements. (Omitted) The Dirichlet distribution is roughly the probability distribution of the probability distribution. For example, if there are three topics, "sports", "economy", and "politics" Probability of generation of each topic (sports, economy, politics) = (0.3, 0.2, 0.5) The probability that 0> .1, (sports, economy, politics) = (0.1, 0.2, 0.7) determines the probability of the probability distribution as 0.2.

Implementation policy

As an implementation policy of LDA, it seems that the Python library called gensim is famous. Reference: Introduction to gensim

From the perspective of simplifying subsequent JavaFX apps, we will not collaborate with Python & Java. An implementation example of Java LDA was published on GitHub, so I decided to borrow it. Thanks.

I did two hits under the name of LDA4j, but this time I adopted the module of Mr. hankcs. I almost like it. (As an input document set, 1 file (1 document per line) in breakbee / LDA4J, In hankcs / LDA4j, there was a difference between multiple files (1 file, 1 document), I personally preferred by file)

Let's run

environment service/version
Execution environment Windows10
Development environment eclipse 4.1.0
development language Java 8

Pull the module to eclipse appropriately. Download forks & clones or Zip from Github and import projects. After that, I created my own execution module. (MainRunner.java)

image.png

As the ReadMe says ...

MainRunner.java


package com.ketman.app;

import java.io.IOException;
import java.util.Map;

import com.hankcs.lda.Corpus;
import com.hankcs.lda.LdaGibbsSampler;
import com.hankcs.lda.LdaUtil;

public class MainRunner {
	public static void main(String[] args)
	{
		// 1. Load corpus from disk
		Corpus corpus;
		try {
			corpus = Corpus.load("data/mini");
			// 2. Create a LDA sampler
			LdaGibbsSampler ldaGibbsSampler = new LdaGibbsSampler(corpus.getDocument(), corpus.getVocabularySize());
			// 3. Train it
			ldaGibbsSampler.gibbs(10);
			// 4. The phi matrix is a LDA model, you can use LdaUtil to explain it.
			double[][] phi = ldaGibbsSampler.getPhi();
			Map<String, Double>[] topicMap = LdaUtil.translate(phi, corpus.getVocabulary(), 10);
			LdaUtil.explain(topicMap);
		} catch (IOException e) {
			//TODO auto-generated catch block
			e.printStackTrace();
		}
	}
}

Try running MainRunner from Execution ⇒ Execution Configuration ⇒ Java Application. You should see the following output on the console. Estimates the specified number (10) of topics for the set of documents stored in data / mini.

Sampling 1000 iterations with burn-in of 100 (B/S=20).
BBBBB|S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||S||
topic 0 :
China=0.0097164123524064
Market=0.007268178259268298
Business=0.006646897977003122
Company=0.006420165848545306
Exhibition=0.005931172520179485
Tourism=0.005517115761050293
Imminent=0.004144655174798414
Reporter=0.003896963247878764
Products=0.0038405773231741857
Service=0.0036131627315211285

topic 1 :
Beautiful country=0.007753386939328633
Japan=0.004271883755069139
Training=0.0039382838929572965
Systematic=0.0038821627109404673
Airplane=0.0037908977218186262
Troop=0.003713327985408122
Military=0.003662570207063461
Advance=0.003548971364140448
Creation=0.003465095755923189
Equipment=0.0033491792847693187

~ Omitted ~

topic 9 :
Hirai=0.00887335526016362
队员=0.003820752808354389
联赛=0.0034088636107220934
Ball=0.0030593385176732896
Club=0.002519739439727434
Crown=0.0025101075962186965
China=0.002314435002019442
Ball=0.0023066510579788685
赛=0.002282312176369107
Reporter=0.0022029528425211455

A story that I actually used as a tool

Recommended Posts

Topic Analysis (LDA) in Java
Morphological analysis in Java with Kuromoji
Creating lexical analysis in Java 8 (Part 2)
1 Implement simple lexical analysis in Java
Creating lexical analysis in Java 8 (Part 1)
Partization in Java
Changes in Java 11
Rock-paper-scissors in Java
Pi in Java
FizzBuzz in Java
NLP4J [001b] Morphological analysis in Java (using kuromoji)
Interpreter implementation in Java
Make Blackjack in Java
Rock-paper-scissors app in Java
Constraint programming in Java
Put java8 in centos7
NVL-ish guy in Java
"Hello World" in Java
Callable Interface in Java
Comments in Java source
Azure functions in java
Format XML in Java
Simple htmlspecialchars in Java
Boyer-Moore implementation in Java
Hello World in Java
Use OpenCV in Java
webApi memorandum in java
Type determination in Java
Ping commands in Java
Various threads in java
Heapsort implementation (in java)
Zabbix API in Java
ASCII art in Java
Compare Lists in Java
POST JSON in Java
Express failure in Java
Create JSON in Java
Date manipulation in Java 8
What's new in Java 8
Use PreparedStatement in Java
What's new in Java 9,10,11
Parallel execution in Java
Initializing HashMap in Java
In 2021, there is no topic in Java these days (Poem)
Try using RocksDB in Java
Read binary files in Java 1
Avoid Yubaba's error in Java
Get EXIF information in Java
Save Java PDF in Excel
[Neta] Sleep Sort in Java
Edit ini in Java: ini4j
Java history in this world
Let Java segfault in 6 lines
Log aggregation and analysis (working with AWS Athena in Java)
Try developing Spresense in Java (1)
Try functional type in Java! ①
I made roulette in Java.
Create hyperlinks in Java PowerPoint
Implement two-step verification in Java
Write flyway callbacks in Java
Importing Excel data in Java 2