I've been involved in natural language processing for a few weeks, and I've often used R and MeCab, but recently I heard that GiNZA from python is an excellent tool for natural language processing, and I'm migrating. First of all, I needed to add a word that is not in the dictionary, so I summarized the flow.
It can be easily installed with the pip command.
$ pip install "https://github.com/megagonlabs/ginza/releases/download/latest/ginza-latest.tar.gz"
It seems that "Sudachi Py" is running for morphological analysis of GiNZA. Therefore, adding a word is basically the same as adding a word to the user dictionary of "Sudachi".
First of all, prepare a dictionary with a csv file. The format of the dictionary is
Heading, left concatenation ID, right concatenation ID, cost, heading, part of speech 1,2,3,4, part of speech (conjugation type), part of speech (conjugation), reading, canonical notation, dictionary ID, division type, A Unit division information, B unit division information, unused
Arrange in the order of.
$ vim add_term.csv
Frozen,4786,4786,5000,Frozen,noun,固有noun,General,*,*,*,Anat Yukinojoou,Frozen,*,*,*,*,*
Each item is described in detail in here, but this time I assigned the recommended ID and cost. .. Enter "*" for unnecessary items. An example is posted on the above site, so I think you should select the one closest to that.
Since SudachiPy was also installed when GiNZA was installed, use the sudachipy command to build the dictionary. The command is suda chipy ubuild -s [system dictionary path] [path of created csv file]
.
$ sudachipy ubuild \
-s .pyenv/versions/anaconda3-5.2.0/envs/ginza/lib/python3.6/site-packages/sudachidict/resources/system.dic \
add_term.csv
The path is long because I am building the environment with pyenv, but please change it according to your environment. When executed, the following message will be displayed.
reading the source file...2 words
writing the POS table...2 bytes
writing the connection matrix...4 bytes
building the trie...done
writing the trie...1028 bytes
writing the word-ID table...14 bytes
writing the word parameters...16 bytes
writing the word_infos...96 bytes
writing word_info offsets...8 bytes
If successful, "user.dic" will be added to the current directory.
Add the generated user.dic path. Add the path to the following location in the configuration file.
$ vim ./pyenv/versions/anaconda3-5.2.0/envs/ginza/lib/python3.6/site-packages/sudachipy/resources/sudachi.json
{
"characterDefinitionFile" : "char.def"
"userDict" : ["user.dic path"] #Here user.dic pathを追記してください
"inputTextPlugin" : [
...
...
--Check from the command line
SudachiPy is installed when GiNZA is installed, so use it.
$ sudachipy
#Before addition
Frozen
Ana noun,Appellative,General,*,*,*Anna
And particles,Case particles,*,*,*,*When
Snow noun,Appellative,General,*,*,*snow
Particles,Case particles,*,*,*,*of
Queen noun,Appellative,General,*,*,*Queen
EOS
#After addition
Frozen
Anna and the Snow Queen noun,Proper noun,General,*,*,*Frozen
EOS
--Use with GiNZA in python
python3
import spacy
nlp = spacy.load('ja_ginza')
doc = nlp('I want to go see Anna and the Snow Queen')
for sent in doc.sents:
for token in sent:
print(token.orth_)
Execution result
Frozen
To
View
To
To go
Want
I was able to divide it into morphemes correctly!
Since Sudachi Py is used for morphological analysis of GiNZA, I think it was easier than adding words in MeCab. I would appreciate it if you could point out any mistakes.
Recommended Posts