When trying to use MeCab with Python in Windows10 64bit environment I mainly stumbled on the following 5 points and was filled with the desire to split the screen, so I summarized it systematically.
Problem 1: MeCab does not come in with pip install alone Problem 2: I was able to install it, but morphological analysis does not work Problem 3: It seems that the extraction of named entities works well by using the NEologd dictionary. Difficult to install in Windows environment Problem 4: When I try to install it, it goes through PATH, but I don't understand the concept of PATH. Problem 5: DOS commands do not pass
① Install MeCab from the unofficial version of .exe for 64bit (2) Install the library for handling MeCab in Python ③ To perform morphological analysis more precisely Clone NEologd from git and compile from command prompt
Reference: https://qiita.com/wanko5296/items/eeb7865ee71a7b9f1a3a
Officially, only the 32-bit version is supported, so It is better to install the 64-bit version built by volunteers.
The executable file is published by the following git. https://github.com/ikegami-yukino/mecab/releases/tag/v0.996
I select the character code when installing the executable file, Select according to the character code of the target text file for which you want to perform morphological analysis. If you're not particular about it, choose UTF-8. (* Default is SHIFT-JIS)
Reference: https://qiita.com/yukinoi/items/990b6933d9f21ba0fb43
With cmd or Anaconda prompt
pip install sys
pip install MeCab
Execute. If you have installed the above 64-bit version of MeCab, you can use the above pip.
With jupyter notebook etc.
import MeCab
Verify if it can be installed with.
If no error occurs, morphological analysis is possible at this stage. If you want to give it a try,
import sys
import MeCab
m = MeCab.Tagger ("-Ochasen")
print(m.parse ("Of the thighs and thighs"))
You can see that the morphological analysis is done.
However, words that include recent words (i.g. My Number, Keyakizaka46, etc.) are It becomes like My / Number, Keyaki / Saka / 46.
To prevent this, install a NEologd dictionary that contains a recent KW list.
Reference: https://qiita.com/zincjp/items/c61c441426b9482b5a48 (Basically, the above article is written for those who do not understand PATH and DOS commands.)
Install 64-bit git and 7-zip as required. The installation method is omitted here. ** ・ git ** Reference: https://eng-entrance.com/git-install ** ・ 7-zip ** Official site: https://sevenzip.osdn.jp/
You need to set environment variables in 7-zip.
C:\Program Files\7-Zip
Now, let me briefly introduce this environment variable. It is a setting to easily execute an application with cmd, and it is also said to pass through PATH.
As a setting method, if you search for "environment variable" on the control panel screen etc., the setting screen will appear.
If you select Edit environment variables in the above image, you will see a screen like this. Select the part called Path in blue and select ** Edit> New ** Add the following, which is the installation destination of 7-zip, and select OK. It will be posted again, but the installation destination differs depending on the person, and the default is as follows.
C:\Program Files\7-Zip
This puts the so-called PATH in place.
Install the NEologd dictionary from here.
Launch a command prompt with ** administrator privileges **
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
Download the necessary dictionary files. Then go to the directory of the downloaded file and check if it is downloaded with dir. There is no problem if you can see the neologd ~ system files when executing dir. If you can't find the seed folder and get an error, ** C: \ Users (user name) \ mecab-ipadic-neologd \ seed ** will move to the directory.
cd mecab-ipadic-neologd\seed
dir
By the way, it means to go to read the directory called mecab-ipadic-neologd \ seed.
Because it needs to be decompressed by 7-zip Execute the following command. I mean to answer .xz files with 7-zip.
7z X *.xz
Then compile the dictionary with the following command. (Change to a dictionary format that can be read by MeCab) However, there are some caveats.
** ① NEologd is updated daily, so all subsequent 20191024 will actually be Select the date attached to the DL file name when you cloned ** ** ② C: \ Program Files \ MeCab \ bin \ mecab-dict-index matches the installation destination of your MeCab ** ** ③ UTF-8 was selected for the installation method of mecab in this article, If you are installing in SHIFT-JIS environment, change "-t utf-8" to "-t shift-jis" **
"C:\Program Files\MeCab\bin\mecab-dict-index" -d "C:\Program Files\MeCab\dic\ipadic" -u NEologd.20191024.dic -f utf-8 -t utf-8 mecab-user-dict-seed.20191024.csv
mkdir "C:\Program Files\MeCab\dic\NEologd"
move NEologd.20191024.dic "C:\Program Files\MeCab\dic\NEologd"
By the way, the meaning is Run mecab-dict-index.exe located in C: \ Program Files \ MeCab \ bin and Exists in the current directory to which cd is moved mecab-user-dict-seed.20191024.csv in UTF-8 format Compile with the name NEologd.20191024.dic. After that, create NEologd in C: \ Program Files \ MeCab \ dic and move the compiled one in it.
At this point, the rest is almost over ** Open mecabrc ** located in C: \ Program Files \ MeCab \ etc with Notepad Replace userdic = with C: \ Program Files \ MeCab \ dic \ NEologd \ Neologd.20191024.dic Change to and save by overwriting. Depending on the authority, it may not be possible to overwrite and save, so Save mecabrc once in another folder and save it in the original place. Don't forget to delete the .txt at that time.
To check if NEologd is actually applied, when you actually perform morphological analysis with jupyter etc. Keyakizaka46 should be recognized as a proper noun.
import sys
import MeCab
m = MeCab.Tagger ("-Ochasen")
print(m.parse ("Keyakizaka46 is eating a red fox."))
In order to improve the accuracy of morphological analysis You can read the publicly available Japanese stop word list, Words specific to the target to be read can be set as a user dictionary. The accuracy of unnecessary items should be improved by steadily NG.
Recommended Posts