NLP4J uses the morphological analysis process of the Yahoo! developer network in the standard (nlp4j-core).
Text analysis: Japanese morphological analysis-Yahoo! Developer Network https://developer.yahoo.co.jp/webapi/jlp/ma/v1/parse.html
The API of the Yahoo! Developer Network is convenient because it can be called by HTTP, but it also has the weakness of having a limited number of times. Therefore, I decided to create a library that uses kuromoji that can be used locally.
This time, I created nlp4j-kuromoji as a sub module of the nlp4j project.
nlp4j-kuromoji https://github.com/oyahiroki/nlp4j/tree/master/nlp4j/nlp4j-kuromoji
Maven has added dependency to use kuromoji.
<!-- https://mvnrepository.com/artifact/com.atilika.kuromoji/kuromoji -->
<dependency>
<groupId>com.atilika.kuromoji</groupId>
<artifactId>kuromoji</artifactId>
<version>0.9.0</version>
<type>pom</type>
</dependency>
<dependency>
<groupId>com.atilika.kuromoji</groupId>
<artifactId>kuromoji-ipadic</artifactId>
<version>0.9.0</version>
</dependency>
Class Diagram
It looks like this as a class diagram. As a morphological analysis engine, it does the same thing, so it is a sibling relationship. Once implemented, you will not be aware of the difference, so you will probably be aware of the implementation of kuromoji only this time.
@startuml
nlp4j.DocumentAnnotator <|-- YJpMaAnnotator
nlp4j.DocumentAnnotator <|-- KuromojiAnnotator
@enduml
Code
It implements the nlp4j.DocumentAnnotator interface provided by NLP4J. The keywords extracted by kuromoji are mapped to the keywords prepared by NLP4J.
package nlp4j.krmj.annotator;
import java.util.List;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.atilika.kuromoji.ipadic.Token;
import com.atilika.kuromoji.ipadic.Tokenizer;
import nlp4j.AbstractDocumentAnnotator;
import nlp4j.Document;
import nlp4j.DocumentAnnotator;
import nlp4j.impl.DefaultKeyword;
/**
* Kuromoji Annotator
* @author Hiroki Oya
* @since 1.2
*/
public class KuromojiAnnotator extends AbstractDocumentAnnotator implements DocumentAnnotator {
static private final Logger logger = LogManager.getLogger(KuromojiAnnotator.class);
@Override
public void annotate(Document doc) throws Exception {
Tokenizer tokenizer = new Tokenizer(); //Instance of kuromoji
for (String target : targets) {
Object obj = doc.getAttribute(target);
if (obj == null || obj instanceof String == false) {
continue;
}
String text = (String) obj;
List<Token> tokens = tokenizer.tokenize(text);
int sequence = 1;
for (Token token : tokens) {
logger.debug(token.getAllFeatures());
DefaultKeyword kwd = new DefaultKeyword(); //New keywords
kwd.setLex(token.getBaseForm());
kwd.setStr(token.getSurface());
kwd.setReading(token.getReading());
kwd.setBegin(token.getPosition());
kwd.setEnd(token.getPosition() + token.getSurface().length());
kwd.setFacet(token.getPartOfSpeechLevel1());
kwd.setSequence(sequence);
doc.addKeyword(kwd);
sequence++;
}
}
}
}
You can see that there are differences between baseForm and lex even in the same "prototype", and that the terms are slightly different.
It is the same as the Yahoo! Developer Network except that the Annotator class specification is changed. You are WRAPing the natural language processing of kuromoji and the Yahoo! developer network, which are separate natural language processing.
public void testAnnotateDocument001() throws Exception {
//Natural text
String text = "I went to school.";
Document doc = new DefaultDocument();
doc.putAttribute("text", text);
KuromojiAnnotator annotator = new KuromojiAnnotator(); //Modules can be replaced by changing only here
annotator.setProperty("target", "text");
annotator.annotate(doc); // throws Exception
System.err.println("Finished : annotation");
for (Keyword kwd : doc.getKeywords()) {
System.err.println(kwd);
}
}
The result is as follows. I was able to use it without being aware of the implementation of the natural language processing library.
Finished : annotation
I[sequence=1, facet=noun, lex=I, str=I, reading=I, count=-1, begin=0, end=1, correlation=0.0]
Is[sequence=2, facet=Particle, lex=Is, str=Is, reading=C, count=-1, begin=1, end=2, correlation=0.0]
school[sequence=3, facet=noun, lex=school, str=school, reading=Gakkou, count=-1, begin=2, end=4, correlation=0.0]
To[sequence=4, facet=Particle, lex=To, str=To, reading=D, count=-1, begin=4, end=5, correlation=0.0]
go[sequence=5, facet=verb, lex=go, str=To go, reading=Iki, count=-1, begin=5, end=7, correlation=0.0]
Masu[sequence=6, facet=Auxiliary verb, lex=Masu, str=Better, reading=Mashi, count=-1, begin=7, end=9, correlation=0.0]
Ta[sequence=7, facet=Auxiliary verb, lex=Ta, str=Ta, reading=Ta, count=-1, begin=9, end=10, correlation=0.0]
。 [sequence=8, facet=symbol, lex=。, str=。, reading=。, count=-1, begin=10, end=11, correlation=0.0]
With NLP4J, you can easily process natural language in Java!
https://www.nlp4j.org/
Recommended Posts