From the introduction of JUMAN ++ to morphological analysis of Japanese with Python

Introduction

It is almost always necessary to use morphological analysis when doing things related to natural language processing. Morphological analyzers that can use Japanese are "MeCab" and "[JUMAN ++](http://nlp.ist.i.kyoto-u.ac." jp / index.php? JUMAN ++) "is famous. This time, we will introduce JUMAN ++ and perform morphological analysis.

The contents of this article are as follows.

Introduction of Juman ++
Use Human ++ from Python

What is natural language processing?

Natural language processing (English: natural language processing, abbreviation: NLP) is a series of technologies that allow a computer to process the natural language that humans use on a daily basis, and is used in artificial intelligence and linguistics. It is a field. [Natural language processing | Wikipedia](https://ja.wikipedia.org/wiki/Natural language processing)

** "In a nutshell" **: Technology that processes the language that humans usually use on a computer

What is morphological analysis?

Morphological analysis is from text data (sentences) in natural language without notes of grammatical information to information such as the grammar of the target language and the part of words of words called dictionaries. Originally, it is the work of dividing into columns of morphemes (Morpheme, roughly speaking, the smallest unit that has meaning in the language), and determining the part of each morpheme. [Morphological analysis | Wikipedia](https://ja.wikipedia.org/wiki/Morphological analysis)

** "In a word" **: A process of dividing a given sentence into the smallest meaningful words and giving part-of-speech information, etc.

What is JUMAN ++

JUMAN ++ is a high-performance morphological analysis system developed by the Kurobashi / Kawahara Laboratory of Kyoto University. By using RNNLM as a language model, analysis is performed considering the semantic naturalness of the word sequence. The basic accuracy does not change, but in addition to the good connection of words, it seems that higher accuracy than MeCab was confirmed in some respects. However, it seems to be slower than others, so if you need real-time performance, you may want to use MeCab.

** "In a nutshell" **: A high-performance morphological analyzer in Japanese, which may be more accurate than MeCab.

Operating environment

OS: Linux (confirmed operation on CentOS 6.7)
Required memory: 4GB or more
Disk capacity: 2GB or more

Introduction of JUMAN ++

Now let's start introducing JUMAN ++. This time, we will introduce JUMAN ++ to Linux.

For mac users, please refer to here.

These are the two sites I referred to.

First, install two prerequisite packages for using JUMAN ++.

gcc (4.9 or later)
Boost C ++ Libraries (1.57 or later)
There are many people who have already installed gcc, so don't worry, but be careful as an error will occur unless Boost is 1.57 or later.

Next, install JUMAN ++ itself.

$ wget http://lotus.kuee.kyoto-u.ac.jp/nl-resource/jumanpp/jumanpp-1.01.tar.xz
$ tar xJvf jumanpp-1.01.tar.xz
$ cd jumanpp-1.01
$ ./configure
$ make
$ make install

JUMAN ++ is now installed! By default, it is installed in / usr / local /, so if you want to specify the installation destination, ./configure Add the --prefix = / path option to.

Try immediately.

$ jumanpp
I started studying morphological analysis

Form Keitai Form Noun 6 Appellative 1* 0 * 0 "Representative notation:form/Keitai category:Shape / pattern"
Elementary noun 6 Appellative 1* 0 * 0 "Representative notation:Elementary/So kanji reading:Sound category:Abstract"
Analysis Kaiseki Analysis Noun 6 Sahen Noun 2* 0 * 0 "Representative notation:analysis/Kaiseki category:Abstract domain:Education / learning;Science and technology"
Nono particle 9 Conjunctive particle 3* 0 * 0 NIL
Study Benkyo Study Noun 6 Sahen Noun 2* 0 * 0 "Representative notation:study/Benkyo category:Abstract domain:Education / learning"
To the particle 9 case particle 1* 0 * 0 NIL
Begin Begin Begin Verb 2*0 Vowel verb 1 Basic continuous form 8"Representative notation:start/Beginning Attached verb candidate (basic) Self-transitive verb:Self:Start/Rebellion that begins:verb:Finish/Yeah"
Suffix 14 Verb Suffix 7 Verb Suffix Type 31 Ta Form 7"Representative notation:Masu/Masu"
.. .. .. Special 1 Kuten 1* 0 * 0 NIL
EOS

The JUMAN ++ executable is jumanpp. In my environment, it was in / bin in the installation folder. Morphological analysis was successful with JUMAN ++!

Use JUMAN ++ from Python

Next, we will use JUMAN ++ from Python.

JUMAN ++ is available from Python using PyKNP. When using PyKNP, if JUMAN and KNP are not included in the current environment, you need to install both of them.

I referred to the following site. Use JUMAN ++ from Python

PyKNP (Python binding of JUMAN and KNP)
JUMAN (morphological analyzer)
KNP (Parser)

Please use the Reference Site for the above three installation methods.

Finally, let's call JUMAN ++ from Python!

`python_jumanpp.py`


#-*- encoding: utf-8 -*-
from pyknp import Jumanpp
import sys
import codecs
sys.stdin = codecs.getreader('utf_8')(sys.stdin)
sys.stdout = codecs.getwriter('utf_8')(sys.stdout)
# Use Juman++ in subprocess mode
jumanpp = Jumanpp()
result = jumanpp.analysis(u"I started natural language processing.")
for mrph in result.mrph_list():
	print u"Heading:%s" % (mrph.midasi)

$ python python_jumanpp.py
Heading:Nature
Heading:language
Heading:processing
Heading:start
Heading:Was
Heading:。

You have successfully used JUMAN ++ from Python!

that's all