There are many morphological analysis tools, but it is important to understand their characteristics before using them.
This time, I compared three morphological analysis tools available from Python.
MeCab
- Parameter estimation using CRF (Conditional Random Fields)
- Both the discrimination accuracy and the execution speed are high, and if you use it in a standard way, you should definitely use MeCab. However, the library is a little heavy.
In[1]: import MeCab
In[2]: mecab = MeCab.Tagger()
In[3]: %time print mecab.parse("Apples have proven to have a very positive effect on the human body")
Apple noun,General,*,*,*,*,Apple,Apple,Apple
Is a particle,Particle,*,*,*,*,Is,C,Wow
Human noun,General,*,*,*,*,Human,Ningen,Ningen
Particles,Attributive,*,*,*,*,of,No,No
Body noun,General,*,*,*,*,body,Shintai,Shintai
Particles for,Case particles,Collocation,*,*,*,for,Nitotte,Nitotte
Very noun,Adjectival noun stem,*,*,*,*,very,Taihen,Taihen
Good adjective,Independence,*,*,Adjective, Auoudan,Uninflected word,good,Yoi,Yoi
Effect noun,General,*,*,*,*,effect,Kouka,Coca
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
A verb,Independence,*,*,Five steps, La line,Uninflected word,is there,Al,Al
That noun,Non-independent,General,*,*,*,thing,Things,Things
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
Proof noun,Change connection,*,*,*,*,Proof,Risho,Richaud
Sa verb,Independence,*,*,Sahen Suru,Rel connection,To do,Service,Service
Re verb,suffix,*,*,One step,Continuous form,To be,Re,Re
Particles,Connection particle,*,*,*,*,hand,Te,Te
Verb,Non-independent,*,*,One step,Continuous form,Is,I,I
Auxiliary verb,*,*,*,Special / mass,Uninflected word,Masu,trout,trout
EOS
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 240 µs
Juman
- Morpheme discrimination by heuristics
- Since the discrimination accuracy is high and the __representative notation __ of each morpheme is displayed, it is excellent for analyzing things with a lot of notation fluctuation such as Twitter.
In[1]: import cJuman
In[2]: cJuman.init(['-B', '-e2'])
In[3]: %time print cJuman.parse_opt(["Apples have proven to have a very positive effect on the human body"], cJuman.SKIP_NO_RESULT)
Apple apple apple apple noun 6 appellative 1* 0 * 0 "Representative notation:Apple/Apple category:plant;Artificial object-Food domain:Cooking / meal"
Hahaha particle 9 particle 2* 0 * 0 NIL
Human human human noun 6 appellative 1* 0 * 0 "Representative notation:Human/Human category:Man"
Nono particle 9 Conjunctive particle 3* 0 * 0 NIL
Body Shintai Body Noun 6 Appellative 1* 0 * 0 "Representative notation:body/Shintai category:animal"
Ni ni ni ni particle 9 case particle 1* 0 * 0 NIL
To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:take/Attached verb candidate to take (basic)"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:take/Domain to take:Political transitive verb:Self:Can be taken/Can be taken"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:take/Take self-transitive verb:Self:Be caught/Can be taken"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:take/Take self-transitive verb:Self:Can be harvested/Can be taken"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:Take/Domain to take:Cooking / meal self-transitive verb:Self:Can be taken/Can be taken"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:take/Domain to take:Culture / art Transitive verb:Self:Can be taken/Can be taken"
@To take verb 2*0 Consonant verb La line 10 Ta system continuous use Te form 14"Representative notation:Steal/Take"
Very very very very adverb 8* 0 * 0 * 0 "Representative notation:very/Very much"
@Very very hard adjective 3*0 na adjective 21 stem 1"Representative notation:It's hard/It's hard"
Good good good good adjective 3*0 adjective Auo Dan 18 Uninflected Word 2"Representative notation:good/Good rebellion:adjective:bad/Bad"
Effect Koka Effect Noun 6 Appellative 1* 0 * 0 "Representative notation:effect/Koka category:Abstract"
Gaga gaga particle 9 case particle 1* 0 * 0 NIL
There is there there is a verb 2*0 Consonant verb La line 10 Uninflected word 2"Representative notation:Yes/A supplementary sentence:adjective:No/Absent"
Koto Koto Koto Noun 6 Formal Noun 8* 0 * 0 NIL
Gaga gaga particle 9 case particle 1* 0 * 0 NIL
Prove Proof Noun 6 Sahen Noun 2* 0 * 0 "Representative notation:Proof/Risho category:Abstract domain:Politics"
Verb 2*0 s-irregular verb 16 imperfect form 3"Representative notation:To do/To do 付属動詞候補(基本) Self他動詞:Self:Become/Become"
Suffix 14 Verb Suffix 7 Vowel verb 1 Ta system continuous te form 14"Representative notation:To be/To be"
Suffix 14 Verb Suffix 7 Vowel Suffix 1 Basic continuous form 8"Representative notation:Is/Is"
More and more suffixes 14 verbs suffixes 7 verbs suffixes type 31 uninflected word 2"Representative notation:Masu/Masu"
EOS
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 976 µs
Cabocha
- Perform dependency analysis using SVM (Support Vector Machine)
- If you use it for learning data when automatically generating sentences using Markov chains, it seems that you can do more interesting things than using orthodox morphological analysis tools such as MeCab (appropriate).
In[1]: import CaboCha
In[2]: cabocha = CaboCha.Parser()
In[3]: %time print cabocha.parseToString("Apples have proven to have a very positive effect on the human body")
Apples---------------D
Human-D |
For the body-------D |
very-D | |
good-D | |
The effect is-D |
is there-D |
That-D
Proven
EOS
CPU times: user 882 µs, sys: 84 µs, total: 966 µs
Wall time: 917 µs
Since the following output is also possible, it is easy to process using dependency analysis in python code. However, it is slow.
In[4]: print cabocha.parse("Apples have proven to have a very positive effect on the human body").toString(CaboCha.FORMAT_LATTICE)
* 0 8D 0/1 -2.111879
Apple noun,General,*,*,*,*,Apple,Apple,Apple
Is a particle,Particle,*,*,*,*,Is,C,Wow
* 1 2D 0/1 1.635242
Human noun,General,*,*,*,*,Human,Ningen,Ningen
Particles,Attributive,*,*,*,*,of,No,No
* 2 6D 0/1 1.318492
Body noun,General,*,*,*,*,body,Shintai,Shintai
Particles for,Case particles,Collocation,*,*,*,for,Nitotte,Nitotte
* 3 4D 0/0 0.781377
Very noun,Adjectival noun stem,*,*,*,*,very,Taihen,Taihen
* 4 5D 0/0 1.810798
Good adjective,Independence,*,*,Adjective, Auoudan,Uninflected word,good,Yoi,Yoi
* 5 6D 0/1 2.448702
Effect noun,General,*,*,*,*,effect,Kouka,Coca
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
* 6 7D 0/0 2.151727
A verb,Independence,*,*,Five steps, La line,Uninflected word,is there,Al,Al
* 7 8D 0/1 -2.111879
That noun,Non-independent,General,*,*,*,thing,Things,Things
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
* 8 -1D 1/5 0.000000
Proof noun,Change connection,*,*,*,*,Proof,Risho,Richaud
Sa verb,Independence,*,*,Sahen Suru,Rel connection,To do,Service,Service
Re verb,suffix,*,*,One step,Continuous form,To be,Re,Re
Particles,Connection particle,*,*,*,*,hand,Te,Te
Verb,Non-independent,*,*,One step,Continuous form,Is,I,I
Auxiliary verb,*,*,*,Special / mass,Uninflected word,Masu,trout,trout
EOS
CPU times: user 1.29 ms, sys: 101 µs, total: 1.39 ms
Wall time: 1.91 ms
In addition to this, there are many morphological analysis tools in Python such as Kytea, Igo-python, ChaSen, and Kakasi, so I hope that you will be familiar with the characteristics of each and be able to use them properly in case by case.
Recommended Posts