Mecab is an open source morphological analysis engine. It can be used to divide Japanese sentences as a preparation for machine learning. The goal of this article is to install Mecab and make it available from Python.
I referred to this article.
$ sudo apt-get install mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8
(I'm not sure if I need both mecab-ipadic and mecab-ipadic-utf8, but it seems to work for now)
You can see the result of morphological analysis by executing the mecab
command and inputting Japanese sentences. For example, the result of entering "Prime Minister Shinzo Abe" is as follows.
$ mecab
Prime Minister Shinzo Abe
Abe noun,Proper noun,Personal name,Surname,*,*,Abe,Abe,Abe
Jin noun,Proper noun,Personal name,Name,*,*,Jin,Susumu,Susumu
Three nouns,number,*,*,*,*,three,Sun,Sun
Prime Minister noun,General,*,*,*,*,Prime Minister,Shusho,Shusho
EOS
"Shinzo" has not been analyzed correctly.
The default IPA dictionary seems to be vulnerable to proper noun parsing, so we have significantly enhanced proper nouns and other new words mecab-ipadic-NEologd. Enter the dictionary blob / master / README.ja.md).
$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
$ ./bin/install-mecab-ipadic-neologd -n -a
Edit / etc / mecabrc
to specify this as the default dictionary
dicdir = /usr/lib/mecab/dic/mecab-ipadic-neologd
will do.
See the official documentation (https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md) for more information.
Similarly, let's analyze "Prime Minister Shinzo Abe".
$ mecab -d
Prime Minister Shinzo Abe
Prime Minister Shinzo Abe noun,Proper noun,General,*,*,*,Shinzo Abe,Abe Shinzo Shusho,Abe Shinzosh Show
EOS
This time it is correctly recognized as a proper noun.
mecab-python3
Include Mecab bindings for Python 3.
$ pip install mecab-python3
This is OK.
mecab-test3
import sys
import MeCab
m = MeCab.Tagger("-Ochasen")
print(m.parse("Prime Minister Shinzo Abe delivered a policy speech at the Diet."))
When you run
$ python mecab-test.py
Prime Minister Shinzo Abe Abe Shinzo Shusho Noun Shinzo Abe-Proper noun-General
Ha ha is a particle-Particle
,,, sign-Comma
Diet Kokkai Diet noun-General
De de de particle-Case particles-General
Policy Address Shisei Hoshin Enzetsu Policy Address Noun-Proper noun-General
Wo Wo particle-Case particles-General
Go Okonatsu Do verb-Independent five-stage / wa line prompting sound service continuous connection
Ta ta auxiliary verb special ta ta basic form
.. .. .. symbol-Kuten
EOS
It will be.
If you want to divide it
m = MeCab.Tagger("-Owakati")
You can do it.
mecab-wakati-test.py
import sys
import MeCab
m = MeCab.Tagger("-Owakati")
items = m.parse("Prime Minister Shinzo Abe delivered a policy speech at the Diet.")
print(items)
print(type(items))
When you run
$ python mecab-wakati-test.py
Prime Minister Shinzo Abe delivered a policy speech at the Diet.
<class 'str'>
The result is returned as a string, so if you want to make it a list, you can do split ()
.
Recommended Posts