I want to do things like Japanese morphological analysis (MeCab) in English, so I use Apache OpenNLP
OS: Windows7 64bit Language: Java8 IDE: Eclipse4.6.1
When using MeCab on the command line
It's nice weather today. ↓ ↓ Today "Nouns, adverbs possible, \ *, \ *, \ *, \ *, today, Kyo, Kyo" Is "particle, particle, \ *, \ *, \ *, \ *, ha, ha, wa" Good "adjective, independence, \ *, \ *, adjective / good, uninflected word, good, good, good" Weather "Noun, General, \ *, \ *, \ *, \ *, Weather, Tenki, Tenki" "Auxiliary verb, \ *, \ *, \ *, special death, uninflected word, is, death, death" Ne "Particles, final particles, \ *, \ *, \ *, \ *, ne, ne, ne" .. "Symbols, Kuten, \ *, \ *, \ *, \ * ,.,.,."
And morpheme information is displayed.
From this information, we obtain three "morphemes," "part of speech," and "basic forms" and use them for analysis.
I want to do the same thing in English, so I use OpenNLP to get "morphemes", "part of speech", and "uninflected words" from English sentences.
Since OpenNLP itself supports multiple languages, it has the following functions.
I want to get "morpheme", "part of speech", and "basic form", so this time
Handle
Create a maven project and add the following to pom.xml
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.8.4</version>
</dependency>
Also, download the following file from the OpenNLP site and put it in the project so that the path will pass
en-token.bin
Binary file used for Tokenizer
Download destination http://opennlp.sourceforge.net/models-1.5/
en-pos-maxent.bin(en-pos-perceptron.bin)
Binary file used for Part-of-Speech Tagger
Download destination http://opennlp.sourceforge.net/models-1.5/
en-lemmatizer.txt
Prototype dictionary of words used in Lemmatizer
Download destination https://raw.githubusercontent.com/richardwilly98/elasticsearch-opennlp-auto-tagging/master/src/main/resources/models/en-lemmatizer.dict
En-lemmatizer.txt uses the linked data as text data.
//Tokenizer settings
InputStream modelIn = new FileInputStream("~/en-token.bin");
TokenizerModel model = new TokenizerModel(modelIn);
Tokenizer tokenizer = new TokenizerME(model);
message = "It is a fine day today.";
String[] morphemes = tokenizer.tokenize(message);
System.out.println(Arrays.asList(morphemes));
>> [It, is, a, fine, day, today, .]
// Part-of-speech Tagger settings
InputStream posModelIn = new FileInputStream("~/en-pos-maxent.bin");
POSModel posModel = new POSModel(posModelIn);
POSTaggerME posTagger = new POSTaggerME(posModel);
//Use the divided data
String [] tags = posTagger.tag(morphemes);
System.out.println(Arrays.asList(tags));
>> [PRP, VBZ, DT, JJ, NN, NN, .]
//Lemmatizer settings
InputStream dictLemmatizer = new FileInputStream("~/en-lemmatizer.txt");
DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);
//Use divided data and part of speech data
String [] lemmas = lemmatizer.lemmatize(morphemes, tags);
System.out.println(Arrays.asList(lemmas));
>> [it, be, a, fine, day, today, O]
Since the result of word prototyping is often "O" more than I expected, it is necessary to make adjustments such as replacing it with morpheme data.
Recommended Posts