I needed to make a Japanese corpus, so I will write a story with MeCab.
An open source morphological analysis engine. Roughly speaking, morphological analysis means "decomposing to the smallest unit of a word". In English, the smallest unit words like "This is a pen." Are separated by spaces, but in Japanese they are stuck together, so you need to analyze and decompose them. If you don't do that, you won't be able to do it. Official URL: http://taku910.github.io/mecab/ license:
I have referred to this site entirely. https://gist.github.com/YoshihitoAso/9048005 Thank you very much. m (__) m If you write the procedure, $ sudo apt-get install mecab libmecab-dev mecab-ipadic $ sudo aptitude install mecab-ipadic-utf8 $ sudo apt-get install python-mecab The first is the installation of the MeCab core, the second is the UTF8 version of the IPA dictionary, and the last is the library called from python.
This time I wanted to make a word-separation, so I created the following sample source. The result of moving it is like this.
The following site has a clear description of MeCab options, thanks. In my case, I only wanted to write in a word, so I only needed "-Owakati", but I may use it later. http://www.mwsoft.jp/programming/munou/mecab_command.html
Recommended Posts