Overview

I want to do things like Japanese morphological analysis (MeCab) in English, so I use Apache OpenNLP

environment

OS: Windows7 64bit Language: Java8 IDE: Eclipse4.6.1

Purpose

When using MeCab on the command line

It's nice weather today. 　　　　　　↓ 　　　　　　↓ Today "Nouns, adverbs possible, \ *, \ *, \ *, \ *, today, Kyo, Kyo" Is "particle, particle, \ *, \ *, \ *, \ *, ha, ha, wa" Good "adjective, independence, \ *, \ *, adjective / good, uninflected word, good, good, good" Weather "Noun, General, \ *, \ *, \ *, \ *, Weather, Tenki, Tenki" "Auxiliary verb, \ *, \ *, \ *, special death, uninflected word, is, death, death" Ne "Particles, final particles, \ *, \ *, \ *, \ *, ne, ne, ne" .. "Symbols, Kuten, \ *, \ *, \ *, \ * ,.,.,."

And morpheme information is displayed.

When using the ipadic dictionary Information on "part of speech, part of speech subclassification 1, part of speech subclassification 2, part of speech subclassification 3, inflected type, inflected form, uninflected word, reading, pronunciation" can be obtained.

From this information, we obtain three "morphemes," "part of speech," and "basic forms" and use them for analysis.

I want to do the same thing in English, so I use OpenNLP to get "morphemes", "part of speech", and "uninflected words" from English sentences.

Functions provided by OpenNLP
Java implementation
Java preparation
Word-separation
Part of speech decomposition
Word archetype

1. Functions provided by OpenNLP

Since OpenNLP itself supports multiple languages, it has the following functions.

Language Detector
Sentence Detector
Tokenizer
Name Finder / Named Entity Recognition
Part-of-Speech Tagger (assign part of speech to a word)
Lemmatizer (prototype)
Parser (creates a syntax tree)
Chunker (makes a shallow syntax tree)
Document Categorizer
Coreference Resolution (find the reference of the directive)

I want to get "morpheme", "part of speech", and "basic form", so this time

Tokenizer
Part-of-Speech Tagger (assign part of speech to a word)
Lemmatizer (prototype)

Handle

2. Java implementation

1. Preparation

Create a maven project and add the following to pom.xml

<dependency>
  <groupId>org.apache.opennlp</groupId>
  <artifactId>opennlp-tools</artifactId>
  <version>1.8.4</version>
</dependency>

Also, download the following file from the OpenNLP site and put it in the project so that the path will pass

en-token.bin
Binary file used for Tokenizer
Download destination http://opennlp.sourceforge.net/models-1.5/
en-pos-maxent.bin(en-pos-perceptron.bin)
Binary file used for Part-of-Speech Tagger
Download destination http://opennlp.sourceforge.net/models-1.5/
en-lemmatizer.txt
Prototype dictionary of words used in Lemmatizer
Download destination https://raw.githubusercontent.com/richardwilly98/elasticsearch-opennlp-auto-tagging/master/src/main/resources/models/en-lemmatizer.dict
En-lemmatizer.txt uses the linked data as text data.

2. Word-separation

//Tokenizer settings
InputStream modelIn = new FileInputStream("~/en-token.bin");
TokenizerModel model = new TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);

message = "It is a fine day today.";
String[] morphemes = tokenizer.tokenize(message);

System.out.println(Arrays.asList(morphemes));
>> [It, is, a, fine, day, today, .]

3. Part of speech decomposition

// Part-of-speech Tagger settings
InputStream posModelIn = new FileInputStream("~/en-pos-maxent.bin");
POSModel posModel = new POSModel(posModelIn);
POSTaggerME posTagger = new POSTaggerME(posModel);

//Use the divided data
String [] tags = posTagger.tag(morphemes);
System.out.println(Arrays.asList(tags));
>> [PRP, VBZ, DT, JJ, NN, NN, .]

Refer to the following site for part of speech information of OpenNLP http://dpdearing.com/posts/2011/12/opennlp-part-of-speech-pos-tags-penn-english-treebank/

4. Word prototype

//Lemmatizer settings
InputStream dictLemmatizer = new FileInputStream("~/en-lemmatizer.txt");
DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);

//Use divided data and part of speech data
String [] lemmas = lemmatizer.lemmatize(morphemes, tags);
System.out.println(Arrays.asList(lemmas));
>> [it, be, a, fine, day, today, O]

The "O" that appears in the result of word prototyping is displayed when part of speech information cannot be obtained well or because it is a proper noun and cannot be prototyped.

Since the result of word prototyping is often "O" more than I expected, it is necessary to make adjustments such as replacing it with morpheme data.

Reference link

Apache OpenNLP Developer Documentation
Survey on Apache OpenNLP
https://www.tutorialkart.com/opennlp/lemmatizer-example-in-apache-opennlp/

[JAVA] English morphological analysis like MeCab with OpenNLP