Overview

I want to do things like Japanese morphological analysis (MeCab) in Chinese, so I use FNLP.

environment

OS : Windows7 64bit Language: Java8 IDE: Ecplise4.8.0

Purpose

Same as English morphological analysis like MeCab with OpenNLP

I want to do "morphemes", "part of speech", and "basic forms" that can be obtained by applying Japanese sentences to Mecab in Chinese. Use the open source "Fudan NLP (FNLP)" to acquire "morphemes" and "part of speech" from the Chinese part.

Prior knowledge of Chinese
Java implementation
Preparation
Word-separation
Part of speech decomposition

1. Prior knowledge of Chinese

Two types of characters

Simplified and Traditional Chinese

In this article, we will limit ourselves to simplified Chinese sentences.

There is no tense

Past, present, and future are judged by context

Past form: Yesterday, Shanghai
I went to Shanghai yesterday
Present form: I'm Shanghai
I go to Shanghai
Future: Shanghai
I have to go to Shanghai tomorrow

Therefore, in this article, we assume that all Chinese morphemes can be obtained in the "basic form".

2. Java implementation

1. Preparation

If you specify fnlp-core in MavenRepository directly in pom.xml, an error will occur, so build the source code once and create the fnlp-core-2.1-SNAPSHOT.jar file.

To create fnlp-core-2.1-SNAPSHOT.jar, perform "Download" and "Build" of Chinese morphological analysis with FNLP.

Create a maven project and place the created fnlp-core-2.1-SNAPSHOT.jar file under the dic folder

Add the following to pom.xml

<dependency>
	<groupId>net.sf.trove4j</groupId>
	<artifactId>trove4j</artifactId>
	<version>3.0.3</version>
</dependency>
<dependency>
	<groupId>commons-cli</groupId>
	<artifactId>commons-cli</artifactId>
	<version>1.2</version>
</dependency>
<dependency>
	<groupId>org.fnlp</groupId>
	<artifactId>core</artifactId>
	<version>2.1</version>
</dependency>
<dependency>
    <groupId>org.fnlp</groupId>
    <artifactId>core</artifactId>
    <version>2.1</version>
    <scope>system</scope>
    <systemPath>${project.basedir}/dic/fnlp-core-2.1-SNAPSHOT.jar</systemPath>
</dependency>

Also, download the three model files (pos.m, seg.m, dep.m) published at https://github.com/xpqiu/fnlp/releases and place them in the dic folder.

2. Word-separation

CNFactory factory = null;
//Specify the path of the model file and call the morphological analyzer
try {
    factory = CNFactory.getInstance("./dic");
} catch (LoadModelException lme) {
    lme.printStackTrace();
}
String message = "Now the weather is good!";
String[][] tokens = factory.tag(message);
System.out.println(Arrays.asList(tokens[0]));
>> [Imaten,Weather,true,Good,啊, ！]

3. Part of speech decomposition

CNFactory factory = null;
//Specify the path of the model file and call the morphological analyzer
try {
    factory = CNFactory.getInstance("./dic");
} catch (LoadModelException lme) {
    lme.printStackTrace();
}
String message = "Now the weather is good!";
String[][] tokens = factory.tag(message);
System.out.println(Arrays.asList(tokens[1]));
>> [Time short phrase,Noun,Adverb,Predicate,Interjection,Punctuation]

[JAVA] Chinese morphological analysis like Mecab with FNLP