MeCab's constrained analysis (partial analysis) function is used when some morpheme information of a sentence is known or boundaries are known. The Python and MeCab binding natto-py provides three constrained parsing methods.
--partial
/ -p
optionSpecify the --partial
or -p
option when retrieving a MeCab instance.
The input statement passed to parse
describes the constraint in the following format.
Sentence fragment:
Sentence fragment. Normal morphological analysis is performed as if there were no restrictions. However, morphemes that straddle sentence fragments are not output.
Be sure to add \ n
(line feed) at the end.
Morpheme fragment
The format is surface \ t feature pattern \ n
.
Finally, add \ n
to the end of the input statement.
from natto import MeCab
text = """garden\t Hoge
To
Haniwa\t Hoge
Chicken\t Hoge
There is.
"""
with MeCab("--partial") as nm:
print(nm.parse(text))
Niwahoge
Particles,Case particles,General,*,*,*,To,D,D
Haniwa Hoge
Chicken Hoge
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
Verb,Independence,*,*,One step,Uninflected word,Is,Il,Il
.. symbol,Kuten,*,*,*,*,。,。,。
EOS
The above example sends the analysis result to the standard output, but for finer constraints, use the morpheme boundary constraint (boundary) or part of speech constraint (feature) function.
If you know the word boundaries in advance, you can specify the boundaries as a compiled regular expression or string with the boundary_constraints keyword argument. Those that match the specified morpheme boundary will be treated as one morpheme and analyzed.
text = "There is a chicken in the haniwa."
patt = "Chicken|Haniwa|garden"
with MeCab() as nm:
#Get information for each MeCabNode by specifying a morpheme boundary constraint
for n in nm.parse(text, boundary_constraints=patt, as_nodes=True):
if not (n.is_bos() or n.is_eos()):
print("{}:\t{}". format(n.surface, n.feature))
# BOS/Omit EOS node
garden:noun,General,*,*,*,*,*
To:Particle,Case particles,General,*,*,*,To,D,D
Haniwa:noun,General,*,*,*,*,Haniwa,Haniwa,Haniwa
Chicken:noun,General,*,*,*,*,Chicken,Chicken,Chicken
But:Particle,Case particles,General,*,*,*,But,Moth,Moth
Is:verb,Independence,*,*,One step,Uninflected word,Is,Il,Il
。:symbol,Kuten,*,*,*,*,。,。,。
For details, see 6.2. Re — Regular Expression Operation and re.finditer See /re.html#re.finditer).
The feature_constraints keyword argument allows you to specify part of speech classification for each particular morpheme. Tuple (tuple) that has part words for morphological elements as a pair, and those morphological elements and part word mappings are further stored in tuples. Then pass it to the parse method as follows:
feat = (("Chicken","Hoge"), ("Haniwa","HogeHoge"), ("garden","更にHoge"))
with MeCab() as nm:
#Get information for each MeCabNode by specifying part-speech constraints for some morphemes
for n in nm.parse(text, feature_constraints=feat, as_nodes=True):
if not (n.is_bos() or n.is_eos()):
print("{}:\t{}". format(n.surface, n.feature))
# BOS/Omit EOS node
garden:Further loosening
To:Particle,Case particles,General,*,*,*,To,D,D
Haniwa:Hogehoge
Chicken:Hoge
But:Particle,Case particles,General,*,*,*,But,Moth,Moth
Is:verb,Independence,*,*,One step,Uninflected word,Is,Il,Il
。:symbol,Kuten,*,*,*,*,。,。,。
that's all
Recommended Posts