SAMPLE
My noun,Pronoun,General,*,*,*,I,I,I
Particles,Attributive,*,*,*,*,of,No,No
Sister noun,General,*,*,*,*,sister,Ane,Ane
Is a particle,Particle,*,*,*,*,Is,C,Wow
Ryunosuke Akutagawa noun,Proper noun,Writer,*,*,*,Ryunosuke Akutagawa,Ryunosuke Akutagawa,Actagawa Ryunosuke
Particles,Attributive,*,*,*,*,of,No,No
This noun,General,*,*,*,*,Book,Hong,Hong
Particles,Case particles,General,*,*,*,To,Wo,Wo
Often adverbs,General,*,*,*,*,Often,Yoku,Yoku
Reading verb,Independence,*,*,Five steps, Ma line,Continuous connection,Read,Young,Young
Particles,Connection particle,*,*,*,*,so,De,De
Verb,Non-independent,*,*,One step,Uninflected word,Is,Il,Il
.. symbol,Kuten,*,*,*,*,。,。,。
BOS/EOS,*,*,*,*,*,*,*,*
REFERENCE How to add vocabulary to MeCab dictionary [Windows 10, Ubuntu 18.04]
Prepare a dictionary as utf-8 in the csv file. Directory: C: \ Users \ username \ Desktop \ MeCabUserDic File name: test_dic.csv
Ryunosuke Akutagawa,,,5543,noun,固有noun,Writer,*,*,*,Ryunosuke Akutagawa,Ryunosuke Akutagawa,Actagawa Ryunosuke
Osamu Dazai,,,5543,noun,固有noun,Writer,*,*,*,Osamu Dazai,Osamu Dazai,Dazaio Sam
Surface form, left context ID, right context ID, cost, part of speech, part of speech subclassification 1, part of speech subclassification 2, part of speech subclassification 3, inflected type, inflected form, prototype, reading, pronunciation
The left context ID and right context ID are the internal IDs when the corresponding words are counted from the left and right, respectively. It seems that it is okay if it is empty because it is given automatically, but I got an error (and garbled characters), so I assigned an appropriate value.
Give the cost the same score as the words that appear with similar frequency. The lower the cost, the easier it is to detect.
Run MeCab \ dic \ ipadic \ mecab-dict-index. When I run it at the normal command prompt, it says permission denied. Start a command prompt with administrator privileges with the following command.
powershell start-process cmd -verb runas
Create a new dic file based on the csv file prepared by the following command.
mecab-dict-index -t utf-8 -t utf-8 -d "<MeCab dictionary directory path>" -u <The path of the directory to create a new dic file> <Path of defined dictionary csv file>
The above command example is below.
mecab-dict-index -f utf-8 -t utf-8 -d "C:\Program Files\MeCab\dic\ipadic" -u C:\Users\yuri.kinoshita\Desktop\MeCabUserDic\test.dic C:\Users\yuri.kinoshita\Desktop\test_dic.csv
This is the execution result. done!
reading C:\Users\yuri.kinoshita\Desktop\MeCabUserDic\test_dic.csv ... 2
emitting double-array: 100% |###########################################|
done!
HOW TO USE
import MeCab
mecab = MeCab.Tagger (r"-Ochasen -u C:\Users\yuri.kinoshita\Desktop\MeCabUserDic\test.dic")
text = "My sister often reads Ryunosuke Akutagawa's book."
node = mecab.parseToNode(text)
while True:
node = node.next
if not node: break
print(node.surface,node.feature)
Execution example.
My noun,Pronoun,General,*,*,*,I,I,I
Particles,Attributive,*,*,*,*,of,No,No
Sister noun,General,*,*,*,*,sister,Ane,Ane
Is a particle,Particle,*,*,*,*,Is,C,Wow
Ryunosuke Akutagawa noun,Proper noun,Writer,*,*,*,Ryunosuke Akutagawa,Ryunosuke Akutagawa,Actagawa Ryunosuke
Particles,Attributive,*,*,*,*,of,No,No
This noun,General,*,*,*,*,Book,Hong,Hong
Particles,Case particles,General,*,*,*,To,Wo,Wo
Often adverbs,General,*,*,*,*,Often,Yoku,Yoku
Reading verb,Independence,*,*,Five steps, Ma line,Continuous connection,Read,Young,Young
Particles,Connection particle,*,*,*,*,so,De,De
Verb,Non-independent,*,*,One step,Uninflected word,Is,Il,Il
.. symbol,Kuten,*,*,*,*,。,。,。
BOS/EOS,*,*,*,*,*,*,*,*
Recommended Posts