The article says, "I don't want to create a program for machine learning, but I want to create a service that uses machine learning." ** Regarding machine learning ** Almost non-programming, the accuracy is reasonable, and we aim not only to "classify" but also to explain how to actually incorporate it into the service.
The source is on github. In addition, a sample trained model is included, so you can try it immediately by following the steps in the article. https://github.com/shuukei-imas-cg/imas_cg_words
We will create a subset service of "Cinderella Girls Dialogue Judgment" produced by the author step by step. As an application of text classification, this is a service that lets you learn the lines of 183 idols appearing in "THE IDOLM @ STER CINDERELLA GIRLS" and determines "who seems to have spoken" when you enter an arbitrary sentence.
This time, we will deal with the fact that the judgment result is displayed when you input text from standard input on the local PC.
--Python beginner to intermediate ――People who want to touch machine learning from now on --People who don't want to write machine learning or deep learning code!
Jubatus is a distributed processing framework for online machine learning developed by PFN & NTT. It is both fast and scalable, and you can study online (while running the service). In the extreme, there are only two ways to use Jubatus.
--Learning (correct label, correct data) --Classification (data you want to classify)-> Returns the estimated label
We will use this Jubatus (more precisely, the Classifier of Jubatus's many functions).
MeCab (Wakame) is a morphological analysis engine that can be said to be a standard in the field of natural language processing. It runs very fast and has bindings from various languages. "Morphological analysis" is a process of estimating the part of speech of a morpheme by dividing the input sentence into the smallest unit in the language called "morpheme" that is smaller than a word, cannot be divided any more, and has a meaning.
mecab-ipadic-NEologd is a system dictionary for MeCab with new words added from texts on the web. It constantly crawls the language resources of the Web and is updated twice a week (Monday and Thursday), which is very effective when you want to handle new words as a group without dividing them strangely. In addition, many entries that are simply not in the IPA dictionary attached to MeCab are recorded.
To use MeCab and mecab-ipadic-NEologd in combination, specify the directory of mecab-ipadic-NEologd as the system dictionary of MeCab. For example, from the command line, type:
$ mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/
New Nintendo Switch software announced
Nintendo Switch noun,Proper noun,General,*,*,*,Nintendo Switch,Nintendo switch,Nintendo switch
Soft noun,General,*,*,*,*,soft,soft,soft
New noun,Change connection,*,*,*,*,New work,Shinsaku,Shinsaku
Announcement noun,Change connection,*,*,*,*,Presentation,Happy,Happy
EOS
You can see that the Nintendo Switch is treated as one block.
First, let's run the sample program. The following procedure is based on CentOS 6.9, but the installation procedure for Ubuntu is also explained in Japanese in the explanation of each middleware, so please read as appropriate.
#Register the Jubatus Yum repository with your system
sudo rpm -Uvh http://download.jubat.us/yum/rhel/6/stable/x86_64/jubatus-release-6-2.el6.x86_64.rpm
#jubatus and jubatus-Install client package
sudo yum install jubatus jubatus-client
#Install Jubatus' MeCab plugin
sudo yum install jubatus-plugin-mecab
#Install Jubatus' Python client library
sudo pip install jubatus
#If you are using a virtualenv environment, enter that environment and then run pip install jubatus
Use mecab-ipadic-NEologd standard installation procedure to install MeCab as well.
#Install MeCab
sudo rpm -ivh http://packages.groonga.org/centos/groonga-release-1.1.0-1.noarch.rpm
sudo yum install mecab mecab-devel mecab-ipadic git make curl xz
cd (Appropriate folder)
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
./bin/install-mecab-ipadic-neologd -n
If the last command is successful, you will be prompted to install the NEologd dictionary, enter Yes. The installation destination of the dictionary will be displayed, so make a note of it.
git clone https://github.com/shuukei-imas-cg/imas_cg_words.git
cd imas_cg_words/jubatus
Here, if the installation destination of the NEologd dictionary is other than / usr / local / lib / mecab / dic / mecab-ipadic-neologd, correct the path of serif.json.
serif.json(Part of)
Before correction"arg": "-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd",
Modification example"arg": "-d /usr/lib64/mecab/dic/mecab-ipadic-neologd",
After modifying or confirming the path, continue as follows: (2017-04-21 postscript: When loading the model file with the -m option, it seems that the setting at the time of model learning is prioritized in order to avoid the setting discrepancy with the setting file. NEologd path in serif.json Even if you specify, it will not be reflected. Therefore, when using the attached model file, make a symbolic link to / usr / local / lib /… or jubakit 0.4.3 which will be released soon. After that, please rewrite the settings in the model file using the attached tool)
#Start Jubatus classifier, resident
# -Load the config file with f,-Load the model file trained with m
jubaclassifier -f serif.json -m model/sample.jubatus &
cd ..
cd localservice/
python classify.py
When classify.py starts, try entering words and lines as appropriate. The attached model file contains 290 lines of each of the three idols, Hinako Kita, Aiumi Munakata, and Nanami Asari, who appear in The Idolmaster Cinderella Girls, and features learned from a total of 870 lines (original lines). Is not information that can be restored). Which of these three lines seems to be the line is displayed in descending order of score.
Here, the lines that are not included in the above learning data are entered. (It may not be possible to judge without knowledge of the original) It seems to be generally correct.
Divide the prepared data into 10 parts, learn with 90%, test with the remaining 10%, and repeat 10 times to perform cross-validation to take the average. If you refer to other articles for a detailed explanation of terms, the accuracy rate was about 93% and the average F value was 0.94.
The code I wrote to get this far is 65 lines in total, and there is only one configuration file.
Load the CSV file and process it line by line. Here, line [0] contains the correct label (idol name), and line [1] contains the text (line). Substitute the correct label and the text (line) to be learned into the list together in Datum format so that Jubatus can handle it. Datum is a key-value data format, where key corresponds to a feature name and value corresponds to a feature quantity. I will explain the key'serif' later.
Once you've added everything to the list, shuffle it and learn with client.train.
train.py(Excerpt)
train_data = []
for line in reader:
train_data.append((line[0], Datum({'serif': line[1]})))
random.shuffle(train_data)
for data in train_data:
client.train([data])
All processing related to machine learning (text classification) is specified in this configuration file.
serif.json
{
"method": "CW",
"converter": {
"string_filter_types": {},
"string_filter_rules": [],
"string_types": {
"mecab": {
"method": "dynamic",
"path": "libmecab_splitter.so",
"function": "create",
"arg": "-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd",
"ngram": "1",
"base": "false",
"include_features": "*",
"exclude_features": ""
}
},
"string_rules": [
{"key": "serif", "type": "mecab", "sample_weight": "bin", "global_weight": "bin"}
]
},
"parameter": {
"regularization_weight" : 1.0
}
}
In the entry of string_rules, it is specified to preprocess "mecab" for the feature "serif" that appeared earlier. The content of the "mecab" process is defined by string_types, and it is divided using the MeCab plugin of Jubatus, the specification of the mecab-ipadic-NEologd dictionary, and the unit of one divided morpheme is used as a feature ( "ngram": "1", "base": "false", etc.) are specified.
In addition, this time, in the first "method", the algorithm called CW is specified from the algorithms implemented in Jubatus, but depending on the task, it may be better to change to another algorithm. In addition, various other preprocessing (such as HTML tag removal) can be specified in this configuration file. [^ 1]
[^ 1]: Jubatus-Data conversion
It's just about 20 lines of code.
classify.py
# -*- coding:utf-8 -*-
from __future__ import absolute_import, print_function, unicode_literals
import jubatus
from config import host, port, name
from jubatus.common import Datum
def predict():
while True:
client = jubatus.Classifier(host, port, name)
words = raw_input().decode("utf-8")
datum = Datum({'serif': words})
res = client.classify([datum])
sorted_res = sorted(res[0], key=lambda x: -x.score)
for result in sorted_res:
print("label:{0} score:{1}".format(result.label, result.score))
if __name__ == '__main__':
predict()
With the feature name'serif' as the key, the text input from the console obtained by raw_input () is set as the feature amount, and it is converted to Datum format. Classify with client.classify. Since the result is returned as a list of lists, we sort and display the score in descending order by sorted and lambda expressions.
You can use it without knowing the contents, but for the time being, I will try the minimum explanation.
In advance, we manually prepared 870 (290 * 3) lines (texts) and the idol name that will be the correct label. Convert this to CSV format and learn with train.py. The text is divided into morphemes by MeCab, and each morpheme is input and learned as the "feature" of the correct label. [^ 2]
[^ 2]: This time, we are using an algorithm called Confidence Weighted Learning (CW). For details, see Jubatus-Algorithm
As a result of learning, a "weight" corresponding to "idleness" is set for each morpheme (roughly, real values are set for the number of morphemes x the number of correct labels).
This time, the trained model file (model / sample.jubatus) is specified at the time of starting jubaclassifier, loaded and used.
Even when classifying, the entered text is morpheme-separated. The score is the sum of the divided "weights" of each morpheme.
Normally, when text classification is performed by machine learning, it is often the case that characters and words that are too frequent and meaningless are excluded in advance as "stop words", but this time the dialogue is classified, so particles etc. I'm typing it as is (depending on the task) because it requires fine nuances and Jubatus does a good job with the default settings.
--Input: I was too delusional ... I have to rest my head ... --Output: I'm too delusional ... I have to rest my head ...
In this case, it is characterized by 12 morphemes: delusion, too much, more, more,…, head, rest, not, and….
If you prepare the data in CSV format as shown below, you can learn using train.py prepared in the repository.
Name, Serif
Label a, text 1
Label a, text 2
Label b, text 1
Label b, text 2
... and so on
# Exit the juba classifier launched above
pkill jubaclassifier
jubaclassifier -f serif.json &
python train.py (specify the above file)
This completes the learning and classification functions that are the core of machine learning services. Next time, we will incorporate the classification function into the Web server so that it can be used as a Web API from the outside.
If you like this article, it will be held around April 2018 at THE IDOLM @ STER CINDERELLA GIRLS as much as you think it is good. Please vote for Hinako Kita at the "7th Cinderella Girl General Election" that will be held.
Recommended Posts