Content of this article

--The story of using Juman ++ in server mode -Python package for easy morphological analysis made Human ++ available

What is Juman ++?

Juman ++ is a morphological analyzer developed in the Kurobashi laboratory at Kyoto University. The point is, "What's the difference with Mecab?", But the difference is that Human ++ uses the RNN (so-called deep learning system) language model.

Introductory articles are gradually increasing in Qiita, and I look forward to its widespread use in the future.

-I tried to touch the new morphological analyzer JUMAN ++, but I thought about switching from MeCab with higher accuracy than I expected -Compare multiple morphological analyzers

A little worrisome point of Juman ++

You have to update the dependent libraries. Especially around gcc
Slow

There is a concern that the dependency library issue may update gcc and other code groups may get stuck ... In that case, use the cool solution Prepare Docker environment.

Now, the problem is the speed aspect of 2. This article

Mecab took about 10 seconds, while JUMAN ++ took more than 10 hours

So, it is certain that there are concerns about speed.

I also made a measurement comparison in my environment.

time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | mecab

echo   0.00s user 0.00s system 26% cpu 0.005 total
mecab  0.00s user 0.00s system 49% cpu 0.007 total

time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | jumanpp

echo   0.00s user 0.00s system 31% cpu 0.004 total
jumanpp  0.14s user 0.35s system 53% cpu 0.931 total

Compared to Mecab, I got a number that is 3 digits different.

This factor is not due to the design, but because it takes time to load the model (it seems that it is a story from a certain place) In other words, there is no choice but to use the RNN language model.

Then what should I do?

Let's use server mode!

The solution is simple, more than "use a server script"! is.

Actually, this is properly written in the ver.1.0.1 Manual. See page 5.

Use the __Ruby script enclosed in the tar of Juman ++ and leave it running in server mode.

According to the manual

$ ruby script/server.rb --cmd "jumanpp -B 5" --host host.name --port 1234

Start the server with. To call as a client

echo "Eat cake" | ruby script/client.rb --host host.name --port 1234

is.

So how much time can you save by using server mode?

time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | ruby client.rb --host localhost

echo   0.00s user 0.00s system 21% cpu 0.006 total
ruby client.rb --host localhost  0.04s user 0.01s system 47% cpu 0.092 total

It's about a tenth of the time! It's amazing! By the way, what happens with Human ++ over the network? I started the Human ++ server on a server machine that exists in the local network and measured it.

time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | ruby client.rb --host sever.hogehoge

echo   0.00s user 0.00s system 22% cpu 0.005 total
ruby client.rb --host sever.hogehoge 0.03s user 0.01s system 26% cpu 0.167 total

.. .. .. .. Well, considering the network response, is it something like this? Anyway, we found that using server mode could solve the bottleneck.

__ Everyone, let's use Human ++ in server mode __

Use Human ++ server mode from Python

The above client script is written in Ruby. So, I think Ruby people should just use it as it is (booger)

However, I'm a Python addict, so I want to call it from Python. (If you want to use client.rb as Python, please see the code attached at the bottom.) Officially, a Python package called pyknp has been released, but in fact, only subprocess calls are prepared for juman ++. Is not ... (Story in pyknp-0.3) This doesn't allow you to benefit from server mode.

I have published a Python package called Japanese Tokenizers. I've included it in this Python package.

Available for both Python 2x and Python 3x.

What you can do

--Get a list of morpheme division results in one line --Morpheme division in one line-> Part of speech filtering-> Stop word removal-> Listing --Mecab, Juman, Juman ++, Kytea can be called with the same notation

Installation method

Install Mecab, Juman, Juman ++. See This README.
Start Juman ++ in server mode. Use the included server.rb from Juman ++.
pip install JapaneseTokenizer

That's it.

How to use

It only takes one line to call Juman ++ in server mode.

>>> from JapaneseTokenizer import JumanppWrapper
>>> sentence = 'Tehran (Persian): تهران  ; Tehrān Tehran.ogg pronunciation[help/File]/teɦˈrɔːn/,English:Tehran) is the capital of Iran, West Asia, and the capital of Tehran Province. Population 12,223,598 people. Metropolitan population is 13,413,Reach 348 people.'
>>> list_result = JumanppWrapper(server='localhost', port=12000).tokenize(sentence, return_list=True)
>>> print(list_result)
['Tehran', 'Persia', 'word', 'pronunciation', 'help', 'File', '英word', 'Ｔｅｈｒａｎ', 'West', 'Asia', 'Iran', 'capital', 'Tehran', 'State capital', 'population', '１２，２２３，５９８', 'city', 'Category', 'population', '１３，４１３，３４８']

To select a morpheme by part of speech, pass the part of speech you want to select with List [Tuple [str]]. See this page for the part of speech system of Juman ++.

>>> from JapaneseTokenizer import JumanppWrapper
>>> sentence = 'Tehran (Persian): تهران  ; Tehrān Tehran.ogg pronunciation[help/File]/teɦˈrɔːn/,English:Tehran) is the capital of Iran, West Asia, and the capital of Tehran Province. Population 12,223,598 people. Metropolitan population is 13,413,Reach 348 people.'
>>> pos_condition = [('noun', 'Place name')]
>>> JumanppWrapper(server='localhost', port=12000).tokenize(sentence, return_list=False).filter(pos_condition=pos_condition).convert_list_object()
['Tehran', 'Asia', 'Iran', 'Tehran']

In addition, you can also acquire part of speech information, surface system, and other information output by Human ++.

See examples.py for more information.

Improvements from previous article

--Added Juman ++ --Fixed a bug that occurs in Juman server mode --Introduction of syntactic sugar that completes part-speech filtering in one line

Use Juman ++ in server mode