When I try to morpheme-decompose a Japanese sentence from PyKNP using Human ++, I get this error ...!
ValueError: invalid literal for int() with base 10: 'input'
(The method of using Human, Juman ++ from Python is omitted)
File "/$HOME/.pyenv/versions/anaconda3-2019.03/lib/python3.6/site-packages/pyknp/
juman/morpheme.py", line 143, in _parse_spec
self.hinsi_id = int(parts[4])
ValueError: invalid literal for int() with base 10: 'input'
For the time being, when I look it up based on the error message
-Symbols to be careful when using JUMAN from PyKNP -Talk about the cause and countermeasures for Value Error when playing with JUMAN ++ --EnsekiTT Blog
It seems that half-width spaces and half-width characters are bad.
So replace all half-width characters with full-width characters.
Even if I corrected all half-width characters to full-width characters, the same error continued to appear. Apparently the cause is different from the situation in the above article.
So I ran it with pdb and checked the contents of the variable parts
at the time of the error ~~ Do it from the beginning ~~.
(Pdb) parts
['InvalidParameter:', 'byte', 'size', 'of', 'input', 'string', '(4302)', 'is', 'greater', 'than│(base)
', 'maximum', 'allowed', '(4096)']
(It was originally a specification that the error content is included in the list that contains the analysis result when an error occurs ...)
Apparently ** the size (number of bytes) of the input string was too large **. ** The limit of the input character string seems to be 4096 bytes in total **, so it seems better to limit the capacity to less than that.
I was in the process of creating a dataset to be sent to BERT, but a sentence that is too long is a pass! ~~ UTF-8 seems to have different number of bytes depending on the character type, so it is troublesome to cut ~~
Detect statements larger than 4096 bytes under the following conditions and take some workaround.
(Split or pass)
It examines and compares the number of bytes in the string text
.
if len(text.encode('utf-8')) > 4096:
Click here for how to find out the number of bytes instead of the number of characters in a string
-Python string length and number of bytes by encoding --Memoize2
The cause of the error when using Human ++ from PyKNP is combined with the article introduced above.
-Half-width space -Some half-width symbols -** Input string size is 4096 bytes or more **
was.
Recommended Posts