The story of making a package that speeds up the operation of Juman (Juman ++) & KNP

What to introduce in this article

--The story of making a support package for KNP --Slightly (1.6 to 2 times) processing speed is improved [^ 1] --The combination of Juman ++ & KNP is nearly 10 times faster.

What is KNP?

It is a "Japanese parser". You can do this.

--Case analysis

Previous article for reference

Well, RNN is fine, isn't it?

At the time of the DNN pandemic, you may be wondering, "Are you still parsing?" In fact, it has already been shown that with a devised RNN, the task of guessing the case type gives the highest accuracy. In fact, the source code is also open to the public, so I think it's okay to go here.

However, [KNP is still the most analyzed target](http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/corpus/KyotoCorpus4.0/doc/rel_guideline. pdf) and KNP displays not only case analysis but also syntax information from morpheme information, so it is easy to use as a tool.

A little trouble with KNP

The analysis speed is not very fast. ~~ Some people say that the analysis speed of Mecab and Cabocha is just too fast ~~

I don't really care when there is only a small amount of input data, but as the amount of data increases, I would like to devise even a little.

However, I don't want to touch the contents.

Then, the package introduced in this article is a simple idea of __ "It's okay to decentralize" __.

I made a package that decentralizes KNP call processing

Installation

Now that you've registered with Pypi, you can install it with pip.

pip install knp-utils

Feature

--Faster than single thread (obviously) --Memory consumption is low for parallel processing. Machine friendly. --Prepared a simple interface to obtain KNP analysis results --Since it has a command line interface / web application, it can be used by people who are not related to Python. --Workes with both Python 2.x / 3.x --For those who want to use pyknp [this way](https://github.com/Kensuke-Mitsuzawa/knp-utils-py/search?utf8=%E2%9C%93&q=pyknp_parsed_result+%3D+knp_obj.result% 28input_str% 3Dknp_parsed_obj.parsed_result% 29), it works.

Speed comparison (Juman & KNP)

It's faster. The difference tends to increase as the number of input documents increases. The numbers below are comparisons for 40 documents.

By the way, pexpect, ʻeverytimeis a mode name that handles Human & KNP processes inside multithreading. pexpect leaves the Human & KNP process running. ʻEverytime launches Human & KNP for every input text.

pexpect mode, finished with :44.13979196548462[sec]
everytime mode, finished with :38.31942701339722[sec]
pyknp, finished with :64.74086809158325[sec]

Speed comparison (Juman ++ & KNP)

Time comparison when combining Juman ++ & KNP. Juman ++ (1.02) is said to be "slow, what?" This is because it takes time to read the model file when the process starts.

So if you leave the process running, it will be faster. It's a simple story.

pexpect mode, finished with :48.096940994262695[sec]
everytime mode, finished with :64.07872700691223[sec]
pyknp, finished with : 602.032340992232452[sec]

Operation content

It just repeats the following process.

--Store input document in sqlite3 DB --Distributed Human & KNP calls with multiple threads [^ 2] --Save the analysis result in sqlite3 DB

Finally

Have a quick, easy, and fun parsing life!

[^ 1]: Difference when 40 documents. The more documents you enter, the faster it will be. [^ 2]: multiprocess is faster, but I got an error and it didn't work (´ ・ ω ・ `)

Recommended Posts

The story of making a package that speeds up the operation of Juman (Juman ++) & KNP
A story that reduces the effort of operation / maintenance
A story that struggled to handle the Python package of PocketSphinx
The story of making a module that skips mail with python
The story of making a lie news generator
The story of making a mel icon generator
The story of making a box that interconnects Pepper's AL Memory and MQTT
The story of making a web application that records extensive reading with Django
The story of making a Line Bot that tells us the schedule of competitive programming
The story of making a music generation neural network
A story that analyzed the delivery of Nico Nama.
The story of making a question box bot with discord.py
The story of writing a program
The story of making a standard driver for db with python.
The story of making a tool that runs on Mac and Windows at the game development site
The story of making Python an exe
A story that stumbled upon a comparison operation
The story of blackjack A processing (python)
A story that visualizes the present of Qiita with Qiita API + Elasticsearch + Kibana
The story of developing a web application that automatically generates catchphrases [MeCab]
The story of making a sound camera with Touch Designer and ReSpeaker
The story of IPv6 address that I want to keep at a minimum
The story of making the Mel Icon Generator version2
Around the authentication of PyDrive2, a package that operates Google Drive with Python
The story of Django creating a library that might be a little more useful
The story of launching a Minecraft server from Discord
[Python] A program that counts the number of valleys
Make a BOT that shortens the URL of Discord
# Function that returns the character code of a string
A story that struggled with the common set HTTP_PROXY = ~
Generate that shape of the bottom of a PET bottle
A story about changing the master name of BlueZ
The story that the return value of tape.gradient () was None
Zip 4 Gbyte problem is a story of the past
[Python] A program that compares the positions of kangaroos.
The story of sys.path.append ()
The story of creating a VIP channel for in-house chatwork
The story of introducing jedi (python auto-completion package) to emacs
The story of creating a database using the Google Analytics API
A Python script that compares the contents of two directories
A story that verified whether the number of coronas is really increasing rapidly among young people
A story that is a little addicted to the authority of the directory specified by expdp (for beginners)
The story of creating a bot that displays active members in a specific channel of slack with python
The story of building Zabbix 4.4
[Apache] The story of prefork
A memorandum of understanding for the Python package management tool ez_setup
A memo that reproduces the slide show (gadget) of Windows 7 on Windows 10.
When incrementing the value of a key that does not exist
A story stuck with the installation of the machine learning library JAX
The story that the version of python 3.7.7 was not adapted to Heroku
pandas Fetch the name of a column that contains a specific character
The story that a hash error came out when using Pipenv
A formula that simply calculates the age from the date of birth
A function that measures the processing time of a method in python
The story of the release work of the application that Google does not tell
The problem that the version of Vue CLI did not go up
I made a slack bot that notifies me of the temperature
[Pythonista] The story of making an action to copy selected text
A story that supports electronic scoring of exams with image recognition
[python] A note that started to understand the behavior of matplotlib.pyplot
[Python] A program that rotates the contents of the list to the left