Morphological analysis in Java with Kuromoji

About this page

Let's perform morphological analysis in Java. Considering that it will be a prerequisite in various other articles, I will summarize up to the operation check.

What is morphological analysis?

It refers to the process of dividing a document into the smallest meaningful units such as words. It is one of the most commonly used techniques for getting machines to process a language.

There are many other terms in this article, First of all, we will describe the operation check, and refer to each term in the appendix.

Development policy

The policy is to add the Kuromoji library on top of Spring Boot & Gradle. If you are from environment construction, please refer to the following. ⇒ Introduction to Spring Boot ... It's good, so I'm sure!

environment service/version
Execution environment Windows10
Development environment eclipse Oxygen.2 Release (4.7.2)Java version
development language Java 8
Framework SpringBoot 2.1.3

Welcome Kuromoji to the project

Kuromoji's library seems to be in Maven Central, This time, I decided to fetch it from codelibs.

Added to repositories and dependencies as follows. Then perform a Gralde refresh to update the dependencies.

build.gralde


plugins {
	id 'org.springframework.boot' version '2.1.3.RELEASE'
	id 'java'
}

apply plugin: 'io.spring.dependency-management'

group = 'com.lab.app.ketman'
version = '0.0.1-SNAPSHOT'
sourceCompatibility = '1.8'

repositories {
	mavenCentral()
	//add to
	maven {
        url "http://maven.codelibs.org"
    }
    //So far
}

dependencies {
	implementation 'org.springframework.boot:spring-boot-starter-thymeleaf'
	implementation 'org.springframework.boot:spring-boot-starter-web'
	implementation 'org.mybatis.spring.boot:mybatis-spring-boot-starter:2.0.0'

	//add to
	implementation 'org.codelibs:lucene-analyzers-kuromoji-ipadic-neologd:7.6.0-20190325'
	//So far

	runtimeOnly 'org.springframework.boot:spring-boot-devtools'
	runtimeOnly 'org.postgresql:postgresql'

	testImplementation 'org.springframework.boot:spring-boot-starter-test'
}

Try to output to the console for the time being

The analysis result is stored in the Attribute object. Declare the information you want as a variable and get it.

Attribute Overview
CharTermAttribute Representation of the analyzed sentence as it is
ReadingAttribute Morpheme reading
OffsetAttribute What character the morpheme appears in
PartOfSpeechAttribute Part of speech information
BaseFormAttribute prototype
InflectionAttribute Utilization

KuromojiSample


public class KuromojiSample {
	//Return a list of Kuromoji Entity as return
	public List<KuromojiEntity> kuromojineologd(String src){
		List<KuromojiEntity> keList = new ArrayList<KuromojiEntity>();
		try(JapaneseTokenizer jt =
				new JapaneseTokenizer(null, false, JapaneseTokenizer.Mode.NORMAL)){
			jt.setReader(new StringReader(src));
			jt.reset();
			while(jt.incrementToken()){

				CharTermAttribute ct = jt.addAttribute(CharTermAttribute.class);
				ReadingAttribute ra = jt.addAttribute(ReadingAttribute.class);
				OffsetAttribute oa = jt.addAttribute(OffsetAttribute.class);
				PartOfSpeechAttribute posa = jt.addAttribute(PartOfSpeechAttribute.class);
				BaseFormAttribute bfa = jt.addAttribute(BaseFormAttribute.class);
				InflectionAttribute ifa = jt.addAttribute(InflectionAttribute.class);

				System.out.println(
						ct.toString()
						+ " | " + ra.getReading()
						+ " | " + oa.startOffset()
						+ " | " + posa.getPartOfSpeech()
						+ " | " + bfa.getBaseForm()
						+ " | " + ifa.getInflectionForm()
						+ " | " + ifa.getInflectionType());
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
		return keList;
	}
}

KuromojiSample


@Controller
public class SampleKuromojiController {
	KuromojiSample ks = new KuromojiSample();

	@RequestMapping("/kuromoji")
	public String index(Model model) {
		String sentence = "neologd can interpret Yuru-chara as a proper noun.";
		ks.kuromojineologd(sentence);
		return "index";
	}
}

result

neologd's dictionary seems to be divided like this. It is characteristic that the reading includes Jiccouiinkai.

neologd |Neologdy| 0 |noun-Proper noun-General| NEologd | null | null
You|Kun| 7 |noun-suffix-Personal name| null | null | null
Is|C| 8 |Particle-Particle| null | null | null
Yuru-chara|Yuru Chara Grand Prix Jiccoui Inkai| 9 |noun-Proper noun-Personal name-General| null | null | null
To|Wo| 14 |Particle-Case particles-General| null | null | null
Proper noun|Koyu Meishi| 15 |noun-General| null | null | null
As|Toshite| 19 |Particle-Case particles-Collocation| null | null | null
Interpretation|Kaishaku| 22 |noun-Change connection| null | null | null
Finished|Deki| 24 |verb-Independence|Can do|Continuous form|One step
Masu|trout| 26 |Auxiliary verb| null |Uninflected word|Special / mass
。 | 。 | 28 |symbol-Kuten| null | null | null

appendix

① What is Lucene Analyzer?

Excerpt from 1. Lucene Overview

Lucene is a 100% PureJava indexing type full-text search engine developed by Jakarta Project 1. (An index is an index attached for fast search.) Lucene itself is a library, not a complete program, By using the API provided by Lucene, you can easily create an easy-to-use full-text search program. Also, because it is written in Java, it can be easily adapted to web applications. Lucene itself cannot analyze Japanese, but it is possible to search for Japanese by using a morphological analysis program.

② What is ipadic-neologd?

Maintenance of information (dictionary) given to machines in the evolving natural language day and night is one of the issues. The idea is to tackle this issue by crawling on the Web. Partial excerpt from neologd / mecab-ipadic-neologd

mecab-ipadic-NEologd is a system dictionary for MeCab customized by adding new words derived from many web language resources. When analyzing documents on the Web, it is recommended to use this dictionary together with the standard system dictionary (ipadic). (Omitted) Advantages Approximately 3.12 million pairs (including duplicate entries) of word surface (notation) and frigana pairs of words such as named entities that cannot be correctly divided by MeCab's standard system dictionary are recorded. This dictionary is updated automatically on the development server Will be updated at least twice a week Monday and Thursday Utilizing language resources on the Web, new named entities can be recorded at the time of update The resources currently in use are: ・ Dump data of Hatena keyword ・ Download zip code data … (Omitted) Disadvantages Insufficient classification of named entities For example, some personal names and product names are classified in the same named entity category. Words that are not named entities are also registered as named entities …

③ About setting analysis policy

In the sample code, the argument (JapaneseTokenizer.Mode.NORMAL) was given, There are also Search and Extends modes, each with the following features.

Excerpt from About Kuromoji

Normal mode After initializing the normal mode, morphological analysis is performed in this format by default.

Search mode A word that combines multiple words such as "Nikkei" is "Japan"|Economy|It analyzes separately like a newspaper. When used in combination with a full-text search engine, the Nihon Keizai Shimbun can be searched by "economy" or "newspaper", which is convenient.

Extends mode In addition to Search mode, treat unknown words as uni-gram. For example, "Mobage" is "Mobage"|Ba|Ge|-"Is divided into each character. A function that seems to reduce the chance of failing to search for unknown words.

④ Additional grammar to Gradle

If you want to get it from Maven Central, you should do it like this. [Home » com.atilika.kuromoji » kuromoji-ipadic » 0.9.0] (https://mvnrepository.com/artifact/com.atilika.kuromoji/kuromoji-ipadic/0.9.0)

// https://mvnrepository.com/artifact/com.atilika.kuromoji/kuromoji-ipadic
compile group: 'com.atilika.kuromoji', name: 'kuromoji-ipadic', version: '0.9.0'

Recommended Posts

Morphological analysis in Java with Kuromoji
NLP4J [001b] Morphological analysis in Java (using kuromoji)
Static code analysis with Checkstyle in Java + Gradle
Topic Analysis (LDA) in Java
Use Japanese morphological analysis "kuromoji"
Log aggregation and analysis (working with AWS Athena in Java)
I tried morphological analysis with MeCab
Creating lexical analysis in Java 8 (Part 2)
1 Implement simple lexical analysis in Java
Play with Markdown in Java flexmark-java
Creating lexical analysis in Java 8 (Part 1)
NLP4J [001a] Morphological analysis in Java (using Yahoo! Developer Network Japanese morphological analysis)
English morphological analysis like MeCab with OpenNLP
Concurrency Method in Java with basic example
[Java] Spam judgment using morphological analysis "lucene-gosen"
Chinese morphological analysis like Mecab with FNLP
Read xlsx file in Java with Selenium
Split a string with ". (Dot)" in Java
Working with huge JSON in Java Lambda
Partization in Java
Changes in Java 11
Rock-paper-scissors in Java
Pi in Java
FizzBuzz in Java
Read a string in a PDF file with Java
Create a CSR with extended information in Java
Refactored GUI tools made with Java8 + JavaFX in 2016
Text extraction in Java from PDF with pdfbox-2.0.8
Practice working with Unicode surrogate pairs in Java
[JAVA] [Spring] [MyBatis] Use IN () with SQL Builder
Encrypt / decrypt with AES256 in PHP and Java
Programming with direct sum types in Java (Neta)
Get along with Java containers in Cloud Run
[java] sort in list
Install java with Homebrew
Read JSON in Java
Interpreter implementation in Java
Make Blackjack in Java
Rock-paper-scissors app in Java
Constraint programming in Java
Put java8 in centos7
NVL-ish guy in Java
Combine arrays in Java
"Hello World" in Java
Callable Interface in Java
Change seats with java
Install Java with Ansible
Comments in Java source
Include image in jar file with java static method
Azure functions in java
Comfortable download with JAVA
Format XML in Java
Simple htmlspecialchars in Java
Boyer-Moore implementation in Java
Hello World in Java
Switch java with direnv
Use OpenCV in Java
webApi memorandum in java
Type determination in Java
Quickly implement a singleton with an enum in Java
I dealt with Azure Functions not working in Java