[JAVA] Chinese morphological analysis like Mecab with FNLP

Overview

I want to do things like Japanese morphological analysis (MeCab) in Chinese, so I use FNLP.

environment

OS : Windows7 64bit Language: Java8 IDE: Ecplise4.8.0

Purpose

Same as English morphological analysis like MeCab with OpenNLP

I want to do "morphemes", "part of speech", and "basic forms" that can be obtained by applying Japanese sentences to Mecab in Chinese. Use the open source "Fudan NLP (FNLP)" to acquire "morphemes" and "part of speech" from the Chinese part.

table of contents

  1. Prior knowledge of Chinese
  2. Java implementation
  3. Preparation
  4. Word-separation
  5. Part of speech decomposition

1. Prior knowledge of Chinese

Two types of characters

In this article, we will limit ourselves to simplified Chinese sentences.

There is no tense

Therefore, in this article, we assume that all Chinese morphemes can be obtained in the "basic form".

2. Java implementation

1. Preparation

If you specify fnlp-core in MavenRepository directly in pom.xml, an error will occur, so build the source code once and create the fnlp-core-2.1-SNAPSHOT.jar file.

Create a maven project and place the created fnlp-core-2.1-SNAPSHOT.jar file under the dic folder

Add the following to pom.xml

<dependency>
	<groupId>net.sf.trove4j</groupId>
	<artifactId>trove4j</artifactId>
	<version>3.0.3</version>
</dependency>
<dependency>
	<groupId>commons-cli</groupId>
	<artifactId>commons-cli</artifactId>
	<version>1.2</version>
</dependency>
<dependency>
	<groupId>org.fnlp</groupId>
	<artifactId>core</artifactId>
	<version>2.1</version>
</dependency>
<dependency>
    <groupId>org.fnlp</groupId>
    <artifactId>core</artifactId>
    <version>2.1</version>
    <scope>system</scope>
    <systemPath>${project.basedir}/dic/fnlp-core-2.1-SNAPSHOT.jar</systemPath>
</dependency>

Also, download the three model files (pos.m, seg.m, dep.m) published at https://github.com/xpqiu/fnlp/releases and place them in the dic folder.

2. Word-separation

CNFactory factory = null;
//Specify the path of the model file and call the morphological analyzer
try {
    factory = CNFactory.getInstance("./dic");
} catch (LoadModelException lme) {
    lme.printStackTrace();
}
String message = "Now the weather is good!";
String[][] tokens = factory.tag(message);
System.out.println(Arrays.asList(tokens[0]));
>> [Imaten,Weather,true,Good,啊, !]

3. Part of speech decomposition

CNFactory factory = null;
//Specify the path of the model file and call the morphological analyzer
try {
    factory = CNFactory.getInstance("./dic");
} catch (LoadModelException lme) {
    lme.printStackTrace();
}
String message = "Now the weather is good!";
String[][] tokens = factory.tag(message);
System.out.println(Arrays.asList(tokens[1]));
>> [Time short phrase,Noun,Adverb,Predicate,Interjection,Punctuation]

Reference link

Recommended Posts

Chinese morphological analysis like Mecab with FNLP
English morphological analysis like MeCab with OpenNLP
I tried morphological analysis with MeCab
Morphological analysis in Java with Kuromoji
NLP4J [006-030] 100 language processing knocks with NLP4J # 30 Reading morphological analysis results
Get detailed results of morphological analysis with Apache Solr 7.6 + SolrJ