[JAVA] English morphological analysis like MeCab with OpenNLP

Overview

I want to do things like Japanese morphological analysis (MeCab) in English, so I use Apache OpenNLP

environment

OS: Windows7 64bit Language: Java8 IDE: Eclipse4.6.1

Purpose

When using MeCab on the command line

It's nice weather today.       ↓       ↓ Today "Nouns, adverbs possible, \ *, \ *, \ *, \ *, today, Kyo, Kyo" Is "particle, particle, \ *, \ *, \ *, \ *, ha, ha, wa" Good "adjective, independence, \ *, \ *, adjective / good, uninflected word, good, good, good" Weather "Noun, General, \ *, \ *, \ *, \ *, Weather, Tenki, Tenki" "Auxiliary verb, \ *, \ *, \ *, special death, uninflected word, is, death, death" Ne "Particles, final particles, \ *, \ *, \ *, \ *, ne, ne, ne" .. "Symbols, Kuten, \ *, \ *, \ *, \ * ,.,.,."

And morpheme information is displayed.

From this information, we obtain three "morphemes," "part of speech," and "basic forms" and use them for analysis.

I want to do the same thing in English, so I use OpenNLP to get "morphemes", "part of speech", and "uninflected words" from English sentences.

table of contents

  1. Functions provided by OpenNLP
  2. Java implementation
  3. Java preparation
  4. Word-separation
  5. Part of speech decomposition
  6. Word archetype

1. Functions provided by OpenNLP

Since OpenNLP itself supports multiple languages, it has the following functions.

I want to get "morpheme", "part of speech", and "basic form", so this time

Handle

2. Java implementation

1. Preparation

Create a maven project and add the following to pom.xml

<dependency>
  <groupId>org.apache.opennlp</groupId>
  <artifactId>opennlp-tools</artifactId>
  <version>1.8.4</version>
</dependency>

Also, download the following file from the OpenNLP site and put it in the project so that the path will pass

2. Word-separation

//Tokenizer settings
InputStream modelIn = new FileInputStream("~/en-token.bin");
TokenizerModel model = new TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);

message = "It is a fine day today.";
String[] morphemes = tokenizer.tokenize(message);

System.out.println(Arrays.asList(morphemes));
>> [It, is, a, fine, day, today, .]

3. Part of speech decomposition

// Part-of-speech Tagger settings
InputStream posModelIn = new FileInputStream("~/en-pos-maxent.bin");
POSModel posModel = new POSModel(posModelIn);
POSTaggerME posTagger = new POSTaggerME(posModel);

//Use the divided data
String [] tags = posTagger.tag(morphemes);
System.out.println(Arrays.asList(tags));
>> [PRP, VBZ, DT, JJ, NN, NN, .]

4. Word prototype

//Lemmatizer settings
InputStream dictLemmatizer = new FileInputStream("~/en-lemmatizer.txt");
DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);

//Use divided data and part of speech data
String [] lemmas = lemmatizer.lemmatize(morphemes, tags);
System.out.println(Arrays.asList(lemmas));
>> [it, be, a, fine, day, today, O]

Since the result of word prototyping is often "O" more than I expected, it is necessary to make adjustments such as replacing it with morpheme data.

Reference link

Recommended Posts

English morphological analysis like MeCab with OpenNLP
Chinese morphological analysis like Mecab with FNLP
I tried morphological analysis with MeCab
Morphological analysis in Java with Kuromoji
NLP4J [006-030] 100 language processing knocks with NLP4J # 30 Reading morphological analysis results
Get detailed results of morphological analysis with Apache Solr 7.6 + SolrJ