Recently, I started to analyze using python and MeCab in my research, but I had a hard time adding words to the user dictionary, so I summarized it for myself.
Create the dictionary as a csv file. The format of the dictionary is Surface form, left context ID, right context ID, cost, part of speech, part of speech subclassification 1, part of speech subclassification 2, part of speech subclassification 3, inflected type, inflected form, prototype, reading, pronunciation Arrange in the order of.
vim add_term.csv
Frozen,,,1,noun,General,*,*,*,*,Frozen,Anat Yukinojoou,Anat Yukinojoo
If you leave the left context ID and right context ID blank, they will be entered automatically. Also, the cost indicates how likely the word is to appear, and the smaller it is, the more likely it is to appear. There seems to be a cost estimation method, but this time I set it to 1. Unnecessary items are OK with "*".
Create a user dictionary from the created csv file. To create a dictionary, use the mecab-dict-index that came with MeCab when you installed it.
#Creating a user dictionary save destination directory
mkdir /usr/local/lib/mecab/dic/userdic
#Dictionary creation
sudo /usr/lib/mecab/mecab-dict-index \
-d /usr/local/mecab/dic/ipadic \
-u /usr/local/lib/mecab/dic/userdic/add.dic \
-f utf-8 \
-t utf-8 \
add_term.csv
The options are: -d Directory containing system dictionaries -u Where to save the user dictionary -f csv File character code -t Character code of user dictionary csv file
run mecab-dict-index with full path. Also at this time, specify UTF-8 as the character code.
reading add_term.csv ... 1
emitting double-array: 100% |###########################################|
done!
Is displayed, it is successful.
Add the following statement to the configuration file.
sudo vim /etc/mecabrc
userdic = /usr/local/lib/mecab/dic/userdic/add.dic
On the official website /usr/local/lib/mecab/dic/ipadic/dicrc /usr/local/etc/mecabrc It is written to add to either of them, but it did not work in my environment, and since there was mecabrc in the above location, it worked correctly by adding it there. If you want to register multiple dictionaries,
userdic = AAA.dic,BBB.dic
If so, I was able to register.
--Check from the command line
#Before addition
mecab
Frozen
Ana noun,General,*,*,*,*,Anna,Anna,Anna
And particles,Parallel particles,*,*,*,*,When,To,To
Snow noun,General,*,*,*,*,snow,Snow,Snow
Particles,Attributive,*,*,*,*,of,No,No
Queen noun,General,*,*,*,*,Queen,The Queen,Jooh
EOS
#After addition
Frozen
Anna and the Snow Queen noun,General,*,*,*,*,Frozen,Anat Yukinojoou,Anat Yukinojoo
EOS
--Use with MeCab in python
python3
>>> import MeCab
>>> m_t = MeCab.Tagger('-Ochasen \
-u /usr/local/lib/mecab/dic/userdic/add.dic')
>>> txt = 'Let's go see Anna and the Snow Queen.'
>>> print(m_t.parse(txt))
Let's go see Anna and the Snow Queen.
If you want to use it with the installed mecab-ipadic-neologd
python3
>>> import MeCab
>>> m_t = MeCab.Tagger('-Ochasen \
-d /usr/lib/mecab/dic/mecab-ipadic-neologd \
-u /usr/local/lib/mecab/dic/userdic/add.dic')
If you change it, it will be read at the same time.
After some trial and error, I was able to confirm that it works well on python. I would appreciate it if you could point out any mistakes.
How to add words Adding words to MeCab user dictionary
Recommended Posts