--The story of using Juman ++ in server mode -Python package for easy morphological analysis made Human ++ available
Juman ++ is a morphological analyzer developed in the Kurobashi laboratory at Kyoto University. The point is, "What's the difference with Mecab?", But the difference is that Human ++ uses the RNN (so-called deep learning system) language model.
Introductory articles are gradually increasing in Qiita, and I look forward to its widespread use in the future.
-I tried to touch the new morphological analyzer JUMAN ++, but I thought about switching from MeCab with higher accuracy than I expected -Compare multiple morphological analyzers
There is a concern that the dependency library issue may update gcc and other code groups may get stuck ... In that case, use the cool solution Prepare Docker environment.
Now, the problem is the speed aspect of 2. This article
Mecab took about 10 seconds, while JUMAN ++ took more than 10 hours
So, it is certain that there are concerns about speed.
I also made a measurement comparison in my environment.
time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | mecab
echo 0.00s user 0.00s system 26% cpu 0.005 total
mecab 0.00s user 0.00s system 49% cpu 0.007 total
time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | jumanpp
echo 0.00s user 0.00s system 31% cpu 0.004 total
jumanpp 0.14s user 0.35s system 53% cpu 0.931 total
Compared to Mecab, I got a number that is 3 digits different.
This factor is not due to the design, but because it takes time to load the model (it seems that it is a story from a certain place) In other words, there is no choice but to use the RNN language model.
Then what should I do?
The solution is simple, more than "use a server script"! is.
Actually, this is properly written in the ver.1.0.1 Manual. See page 5.
Use the __Ruby script enclosed in the tar of Juman ++ and leave it running in server mode.
According to the manual
$ ruby script/server.rb --cmd "jumanpp -B 5" --host host.name --port 1234
Start the server with. To call as a client
echo "Eat cake" | ruby script/client.rb --host host.name --port 1234
is.
So how much time can you save by using server mode?
time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | ruby client.rb --host localhost
echo 0.00s user 0.00s system 21% cpu 0.006 total
ruby client.rb --host localhost 0.04s user 0.01s system 47% cpu 0.092 total
It's about a tenth of the time! It's amazing! By the way, what happens with Human ++ over the network? I started the Human ++ server on a server machine that exists in the local network and measured it.
time echo "The right of foreigners to vote was approved. I am also the last day of Sunday." | ruby client.rb --host sever.hogehoge
echo 0.00s user 0.00s system 22% cpu 0.005 total
ruby client.rb --host sever.hogehoge 0.03s user 0.01s system 26% cpu 0.167 total
.. .. .. .. Well, considering the network response, is it something like this? Anyway, we found that using server mode could solve the bottleneck.
__ Everyone, let's use Human ++ in server mode __
The above client script is written in Ruby. So, I think Ruby people should just use it as it is (booger)
However, I'm a Python addict, so I want to call it from Python.
(If you want to use client.rb
as Python, please see the code attached at the bottom.)
Officially, a Python package called pyknp has been released, but in fact, only subprocess calls are prepared for juman ++. Is not ... (Story in pyknp-0.3)
This doesn't allow you to benefit from server mode.
I have published a Python package called Japanese Tokenizers. I've included it in this Python package.
Available for both Python 2x and Python 3x.
--Get a list of morpheme division results in one line --Morpheme division in one line-> Part of speech filtering-> Stop word removal-> Listing --Mecab, Juman, Juman ++, Kytea can be called with the same notation
server.rb
from Juman ++.pip install JapaneseTokenizer
That's it.
It only takes one line to call Juman ++ in server mode.
>>> from JapaneseTokenizer import JumanppWrapper
>>> sentence = 'Tehran (Persian): تهران ; Tehrān Tehran.ogg pronunciation[help/File]/teɦˈrɔːn/,English:Tehran) is the capital of Iran, West Asia, and the capital of Tehran Province. Population 12,223,598 people. Metropolitan population is 13,413,Reach 348 people.'
>>> list_result = JumanppWrapper(server='localhost', port=12000).tokenize(sentence, return_list=True)
>>> print(list_result)
['Tehran', 'Persia', 'word', 'pronunciation', 'help', 'File', '英word', 'Tehran', 'West', 'Asia', 'Iran', 'capital', 'Tehran', 'State capital', 'population', '12,223,598', 'city', 'Category', 'population', '13,413,348']
To select a morpheme by part of speech, pass the part of speech you want to select with List [Tuple [str]]
.
See this page for the part of speech system of Juman ++.
>>> from JapaneseTokenizer import JumanppWrapper
>>> sentence = 'Tehran (Persian): تهران ; Tehrān Tehran.ogg pronunciation[help/File]/teɦˈrɔːn/,English:Tehran) is the capital of Iran, West Asia, and the capital of Tehran Province. Population 12,223,598 people. Metropolitan population is 13,413,Reach 348 people.'
>>> pos_condition = [('noun', 'Place name')]
>>> JumanppWrapper(server='localhost', port=12000).tokenize(sentence, return_list=False).filter(pos_condition=pos_condition).convert_list_object()
['Tehran', 'Asia', 'Iran', 'Tehran']
In addition, you can also acquire part of speech information, surface system, and other information output by Human ++.
See examples.py for more information.
--Added Juman ++ --Fixed a bug that occurs in Juman server mode --Introduction of syntactic sugar that completes part-speech filtering in one line
Recommended Posts