--The story of making a support package for KNP --Slightly (1.6 to 2 times) processing speed is improved [^ 1] --The combination of Juman ++ & KNP is nearly 10 times faster.
It is a "Japanese parser". You can do this.
--Case analysis
Previous article for reference
At the time of the DNN pandemic, you may be wondering, "Are you still parsing?" In fact, it has already been shown that with a devised RNN, the task of guessing the case type gives the highest accuracy. In fact, the source code is also open to the public, so I think it's okay to go here.
However, [KNP is still the most analyzed target](http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/corpus/KyotoCorpus4.0/doc/rel_guideline. pdf) and KNP displays not only case analysis but also syntax information from morpheme information, so it is easy to use as a tool.
The analysis speed is not very fast. ~~ Some people say that the analysis speed of Mecab and Cabocha is just too fast ~~
I don't really care when there is only a small amount of input data, but as the amount of data increases, I would like to devise even a little.
However, I don't want to touch the contents.
Then, the package introduced in this article is a simple idea of __ "It's okay to decentralize" __.
Now that you've registered with Pypi, you can install it with pip
.
pip install knp-utils
--Faster than single thread (obviously) --Memory consumption is low for parallel processing. Machine friendly. --Prepared a simple interface to obtain KNP analysis results --Since it has a command line interface / web application, it can be used by people who are not related to Python. --Workes with both Python 2.x / 3.x --For those who want to use pyknp [this way](https://github.com/Kensuke-Mitsuzawa/knp-utils-py/search?utf8=%E2%9C%93&q=pyknp_parsed_result+%3D+knp_obj.result% 28input_str% 3Dknp_parsed_obj.parsed_result% 29), it works.
It's faster. The difference tends to increase as the number of input documents increases. The numbers below are comparisons for 40 documents.
By the way, pexpect
, ʻeverytimeis a mode name that handles Human & KNP processes inside multithreading.
pexpect leaves the Human & KNP process running. ʻEverytime
launches Human & KNP for every input text.
pexpect mode, finished with :44.13979196548462[sec]
everytime mode, finished with :38.31942701339722[sec]
pyknp, finished with :64.74086809158325[sec]
Time comparison when combining Juman ++ & KNP. Juman ++ (1.02) is said to be "slow, what?" This is because it takes time to read the model file when the process starts.
So if you leave the process running, it will be faster. It's a simple story.
pexpect mode, finished with :48.096940994262695[sec]
everytime mode, finished with :64.07872700691223[sec]
pyknp, finished with : 602.032340992232452[sec]
It just repeats the following process.
--Store input document in sqlite3 DB --Distributed Human & KNP calls with multiple threads [^ 2] --Save the analysis result in sqlite3 DB
Have a quick, easy, and fun parsing life!
[^ 1]: Difference when 40 documents. The more documents you enter, the faster it will be. [^ 2]: multiprocess is faster, but I got an error and it didn't work (´ ・ ω ・ `)
Recommended Posts