Use different morphological analysis modules

NLP4J uses the morphological analysis process of the Yahoo! developer network in the standard (nlp4j-core).

Text analysis: Japanese morphological analysis-Yahoo! Developer Network https://developer.yahoo.co.jp/webapi/jlp/ma/v1/parse.html

The API of the Yahoo! Developer Network is convenient because it can be called by HTTP, but it also has the weakness of having a limited number of times. Therefore, I decided to create a library that uses kuromoji that can be used locally.

Creating an Annotator

This time, I created nlp4j-kuromoji as a sub module of the nlp4j project.

nlp4j-kuromoji https://github.com/oyahiroki/nlp4j/tree/master/nlp4j/nlp4j-kuromoji

Maven has added dependency to use kuromoji.

<!-- https://mvnrepository.com/artifact/com.atilika.kuromoji/kuromoji -->
<dependency>
 <groupId>com.atilika.kuromoji</groupId>
 <artifactId>kuromoji</artifactId>
 <version>0.9.0</version>
 <type>pom</type>
</dependency>
<dependency>
 <groupId>com.atilika.kuromoji</groupId>
 <artifactId>kuromoji-ipadic</artifactId>
 <version>0.9.0</version>
</dependency>

Class Diagram

It looks like this as a class diagram. As a morphological analysis engine, it does the same thing, so it is a sibling relationship. Once implemented, you will not be aware of the difference, so you will probably be aware of the implementation of kuromoji only this time.

SoWkIImgAStDuShBAJ39qdF9JoxDJSqhSSpBooz9BCalKh2fqTLLYFGgy4s4Y-5NwrrQb9-RdvM94EPoICrB0Ta10000.png

@startuml
nlp4j.DocumentAnnotator <|-- YJpMaAnnotator
nlp4j.DocumentAnnotator <|-- KuromojiAnnotator 
@enduml

Code

It implements the nlp4j.DocumentAnnotator interface provided by NLP4J. The keywords extracted by kuromoji are mapped to the keywords prepared by NLP4J.


package nlp4j.krmj.annotator;
import java.util.List;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.atilika.kuromoji.ipadic.Token;
import com.atilika.kuromoji.ipadic.Tokenizer;
import nlp4j.AbstractDocumentAnnotator;
import nlp4j.Document;
import nlp4j.DocumentAnnotator;
import nlp4j.impl.DefaultKeyword;

/**
 * Kuromoji Annotator
 * @author Hiroki Oya
 * @since 1.2
 */
public class KuromojiAnnotator extends AbstractDocumentAnnotator implements DocumentAnnotator {
	static private final Logger logger = LogManager.getLogger(KuromojiAnnotator.class);
	@Override
	public void annotate(Document doc) throws Exception {
		Tokenizer tokenizer = new Tokenizer(); //Instance of kuromoji
		for (String target : targets) {
			Object obj = doc.getAttribute(target);
			if (obj == null || obj instanceof String == false) {
				continue;
			}
			String text = (String) obj;
			List<Token> tokens = tokenizer.tokenize(text);
			int sequence = 1;
			for (Token token : tokens) {
				logger.debug(token.getAllFeatures());
				DefaultKeyword kwd = new DefaultKeyword(); //New keywords
				kwd.setLex(token.getBaseForm());
				kwd.setStr(token.getSurface());
				kwd.setReading(token.getReading());
				kwd.setBegin(token.getPosition());
				kwd.setEnd(token.getPosition() + token.getSurface().length());
				kwd.setFacet(token.getPartOfSpeechLevel1());
				kwd.setSequence(sequence);
				doc.addKeyword(kwd);
				sequence++;
			}
		}
	}
}

You can see that there are differences between baseForm and lex even in the same "prototype", and that the terms are slightly different.

How to use

It is the same as the Yahoo! Developer Network except that the Annotator class specification is changed. You are WRAPing the natural language processing of kuromoji and the Yahoo! developer network, which are separate natural language processing.

	public void testAnnotateDocument001() throws Exception {
		//Natural text
		String text = "I went to school.";
		Document doc = new DefaultDocument();
		doc.putAttribute("text", text);
		KuromojiAnnotator annotator = new KuromojiAnnotator(); //Modules can be replaced by changing only here
		annotator.setProperty("target", "text");
		annotator.annotate(doc); // throws Exception
		System.err.println("Finished : annotation");
		for (Keyword kwd : doc.getKeywords()) {
			System.err.println(kwd);
		}
	}

result

The result is as follows. I was able to use it without being aware of the implementation of the natural language processing library.

Finished : annotation
I[sequence=1, facet=noun, lex=I, str=I, reading=I, count=-1, begin=0, end=1, correlation=0.0]
Is[sequence=2, facet=Particle, lex=Is, str=Is, reading=C, count=-1, begin=1, end=2, correlation=0.0]
school[sequence=3, facet=noun, lex=school, str=school, reading=Gakkou, count=-1, begin=2, end=4, correlation=0.0]
To[sequence=4, facet=Particle, lex=To, str=To, reading=D, count=-1, begin=4, end=5, correlation=0.0]
go[sequence=5, facet=verb, lex=go, str=To go, reading=Iki, count=-1, begin=5, end=7, correlation=0.0]
Masu[sequence=6, facet=Auxiliary verb, lex=Masu, str=Better, reading=Mashi, count=-1, begin=7, end=9, correlation=0.0]
Ta[sequence=7, facet=Auxiliary verb, lex=Ta, str=Ta, reading=Ta, count=-1, begin=9, end=10, correlation=0.0]
。 [sequence=8, facet=symbol, lex=。, str=。, reading=。, count=-1, begin=10, end=11, correlation=0.0]

Summary

With NLP4J, you can easily process natural language in Java!

Project URL

https://www.nlp4j.org/

Return to Index