--Use MeCab for morphological analysis - http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html
--Use a new word dictionary - https://github.com/neologd/mecab-ipadic-neologd/ --Use in combination with other modules in Python scripts
Python 2.7 Use Conda.
$ conda create -n py27con python=2.7 anaconda
$ conda info -e
$ source ~/.pyenv/versions/miniconda3-3.16.0/envs/py27con/bin/activate py27con
mecab-ipadic I will use mecab-ipadic-neologd later, so I will put it in UTF-8
$ cd ~/path/to/mecab-ipadic-2.7.0-20070801/
$ make clean
$ ./configure --with-charset=utf8
$ make
$ make install
mecab-ipadic-neologd
$ cd ~/path/to/mecab-ipadic-neologd/
$ bin/install-mecab-ipadic-neologd
mecab-python
Python bindings for MeCab
$ pip install https://mecab.googlecode.com/files/mecab-python-0.996.tar.gz
test.py
# -*- coding: utf-8 -*-
import MeCab
m = MeCab.Tagger(' -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')
text = '''
"THE IDOLM @ STER CINDERELLA GIRLS" (THE IDOLM@STER CINDERELLA GIRLS) is "THE IDOLM" developed and operated by NAMCO BANDAI Entertainment (formerly NAMCO BANDAI Games) and Cygames.@A social game dedicated to mobile terminals with the motif of the world of STER.
'''
print(m.parse(text))
The text is [Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%A2%E3%82%A4%E3%83%89%E3%83%AB%E3%83%9E%E3 % 82% B9% E3% 82% BF% E3% 83% BC_% E3% 82% B7% E3% 83% B3% E3% 83% 87% E3% 83% AC% E3% 83% A9% E3% 82 From% AC% E3% 83% BC% E3% 83% AB% E3% 82% BA).
$ python test.py
"Symbol,Open parentheses,*,*,*,*,『,『,『
The Idolmaster Cinderella Girls Noun,Proper noun,General,*,*,*,Idolmaster Cinderella Girls,Idolmaster Cinderella Girls,Idolmaster Cinderella Girls
』Symbol,Parentheses closed,*,*,*,*,』,』,』
(Symbol,Open parentheses,*,*,*,*,(,(,(
THE IDOLM@STER CINDERELLA GIRLS noun,Proper noun,General,*,*,*,THE IDOLM@STER CINDERELLA GIRLS,Idolmaster Cinderella Girls,Idolmaster Cinderella Girls
) Symbol,Parentheses closed,*,*,*,*,),),)
Is a particle,Particle,*,*,*,*,Is,C,Wow
, Symbol,Comma,*,*,*,*,、,、,、
BANDAI NAMCO Entertainment Noun,Proper noun,General,*,*,*,BANDAI NAMCO Entertainment,BANDAI NAMCO Entertainment,BANDAI NAMCO Entertainment
(Symbol,Open parentheses,*,*,*,*,(,(,(
Old prefix,Noun connection,*,*,*,*,Old,Kyu,queue
Bandai Namco Games Noun,Proper noun,General,*,*,*,BANDAI NAMCO Games,Bandai Namco Games,Bandai Namco Games
) Symbol,Parentheses closed,*,*,*,*,),),)
And particles,Parallel particles,*,*,*,*,When,To,To
Cygames noun,Proper noun,General,*,*,*,Cygames,Cygames,Cygames
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
Development noun,Change connection,*,*,*,*,development of,Kaihatsu,Kaihatsu
・ Symbol,General,*,*,*,*,・,・,・
Management noun,Change connection,*,*,*,*,Operation,Unei,Unei
Verb to do,Independence,*,*,Sahen Suru,Uninflected word,To do,Suru,Suru
"Symbol,Open parentheses,*,*,*,*,『,『,『
THE IDOLM@STER noun,Proper noun,General,*,*,*,THE IDOLM@STER,Idol Master,Idol Master
』Symbol,Parentheses closed,*,*,*,*,』,』,』
Particles,Attributive,*,*,*,*,of,No,No
Worldview noun,Proper noun,General,*,*,*,View of the world,Sekaikan,Sekaikan
Particles,Case particles,General,*,*,*,To,Wo,Wo
Motif noun,General,*,*,*,*,motif,motif,motif
And particles,Case particles,General,*,*,*,When,To,To
Verb to do,Independence,*,*,Sahen Suru,Uninflected word,To do,Suru,Suru
Mobile terminal noun,Proper noun,General,*,*,*,Mobile terminal,Keitaitan pine,Keitaitan pine
Dedicated noun,Change connection,*,*,*,*,designated,Senyo,Senyo
Particles,Attributive,*,*,*,*,of,No,No
Social game noun,Proper noun,General,*,*,*,social game,social game,social game
.. symbol,Kuten,*,*,*,*,。,。,。
EOS
By the way, if you omit -d / usr / local / lib / mecab / dic / mecab-ipadic-neologd
and see the difference, you can see that the new word dictionary works nicely (mainly unique). noun).
List of frequent problems:
--The output is garbled
――Maybe you just need to use the UTF-8 dictionary properly
--Some differences / conflicts between Conda Python and System Python
--Example: Shell crashes when source activate
python
--This can be done by specifying the path of ʻactivate` properly.
--Work to make the obtained Python binding setup script and sample script compatible with Python 3.5
--Work to make the binding itself compatible with SWIG 3.5
--Still, I get Unicode related errors
I wanted to do it with 3.5 if possible, but I couldn't escape because I was addicted to it, so I did it with 2.7 for the time being.
Recommended Posts