Challenge text mining with Python. (For Python3 series) Follow the steps below.
① Morphological analysis (this article) ② Visualization with Word Cloud (next time)
Morphological analysis required to divide a Japanese sentence into words. As a well-known and easy-to-understand example "Sumomomo Momomo" To "Plums, peaches, peaches, peaches" What divides into.
Unlike English, Japanese has not clear word breaks and it is very difficult to divide sentences into words, so it is not realistic to process with your own code.
Therefore, we use a library called "MeCab" that is open source. (Probably the most major in Japanese morphological analysis. It seems to read "Mekabu")
To be able to use MeCab in Python ・ Installation of MeCab main unit ・ Installation of dictionary -Install Python bindings Is necessary.
However, since the binary package for Windows includes a dictionary, you do not need to install the dictionary. Here, the procedure is described assuming that it will be installed on Windows.
First, from the download site listed on Official Site ・ Mecab-0.996.exe ・ Mecab-python-0.996.tar.gz download.
Next, start mecab-0.996.exe and install the main body. Select the character code of the dictionary on the way, but select the default Shift-JIS. (I'm a little worried if I don't have to use UTF-8 ...)
You should be able to use the mecab command at this point, but it doesn't seem to be in your PATH. Manually add the bin of the installation directory to your PATH.
Try using mecab on the command line. As usual, "Sumomomo Momomo".
>mecab↓
Of the thighs and thighs ↓
Plum noun,General,*,*,*,*,Plum,Plum,Plum
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Particles,Attributive,*,*,*,*,of,No,No
Noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi
EOS
Next, unzip mecab-python-0.996.tar.gz to a suitable directory. Go to the unzipped directory and run build and install according to the README. Below is the result of execution.
>python setup.py build
'mecab-config'Is an internal or external command,
It is not recognized as an operable program or batch file.
Traceback (most recent call last):
File "setup.py", line 13, in <module>
version = cmd1("mecab-config --version"),
File "setup.py", line 7, in cmd1
return os.popen(str).readlines()[0][:-1]
IndexError: list index out of range
Suddenly stumble on build. It seems that there is no command called mecab-config called in setup.py. I have a PATH, but I can't find an executable file that looks like that under bin.
Googling, it seems like putting Python bindings on Windows is pretty annoying. You can do your best, but interrupted because the purpose is to do text mining and not to run MeCab on Windows. I decided to put it in another Linux environment.
Reference site
Building an environment using MeCab with R and Python (Windows, Mac)