A story about predicting prefectures from the names of cities, wards, towns and villages with Jubatus

Old tale

When I started learning machine learning, there was no such thing as github in the first place. I dropped the source on the net, compiled it, and said, "Oh, I don't know, but it doesn't work ..." Or, "Oh, I don't know, but it worked ...". Otherwise, I'm not sure how well I've written my own code and implemented it correctly. I was experimenting with the code (which is definitely inefficient computer science).

But these days, even overly complex theory implementations are usually found on github. And the fact that it will be published on github means that the usage is clearly stated and anyone can use it. The plain interface is easy.

Jubatus is just such a framework, like a dream where you can do machine learning without knowing any complicated theory. http://jubat.us/ja/

This is the end of the taikomochi.

It is an article that I tried using Jubatus. The first thing you want to do is to enter the name of the city, town, and village, and then make something that will tell you which prefecture the place name is. Since the purpose is "try to use", the target was anything, but to be clear, anything was fine. The address book of the prefecture name was provided as a csv file on the [here] 1 site, so I used it.

pre-processing

I will use the data downloaded above, but I don't like the character code in SJIS, so I will convert it to utf-8.

wget http://jusyo.jp/downloads/new/csv/csv_zenkoku.zip
unzip csv_zenkoku.zip
nkf -w zenkoku.csv > zenkoku_utf-8.csv

I think that the data can now be read in Japanese. By the way, this conversion is not necessary in the Windows environment, but this time it is assumed to be Linux (CentOS). (If you try to use jubatus on Windows in the first place, it should not be straightforward, so I don't think this explanation is necessary.)

So I don't like the first row (explanation of each column) so I delete it.

With this, the data to be eaten by jubatus is ready, but if it is left as it is, the arrangement of data is too regular. No matter what you say, the Hokkaido Lover, which returns "Hokkaido", will only be completed, so shuffle the lines in advance.

shuf zenkoku_utf-8.csv > shuffled_zenkoku.csv

Save this in a directory called data.

configration

Now that the data is in place, it's time to write the settings in json to feed Jubatus. https://github.com/chase0213/address_classifier/blob/master/adrs_clf.json

AROW is used as the learning algorithm. There is no particular reason.

So basically, as it is now, the input vector is a vector that has a character string as an element, so Write in the string_rules section how to handle this string. It's not a plan to make something practical, so for the time being, I'll just count the number of characters divided by unigram.

"string_rules": [
      { "key": "*", "type": "unigram", "sample_weight": "bin", "global_weight": "bin" }
]

Of course, if you want to make something practical, you need to think about this part properly. (In the first place, there is nothing practical because almost nothing is done in the pretreatment part)

Please see [Jubatus Official Page] 2 for the details of the setting.

starting jubatus server

After completing the settings, start the jubatus server.

$ jubaclassifier --configpath adrs_clf.json

If there is no error, it is running.

training

After completing the settings, we will finally enter the learning phase. This is so-called training. https://github.com/chase0213/address_classifier/blob/master/train.py

When I learned all the data, it timed out, so I gave about 50,000 for the time being.

tnum = 50000

Normally, data for learning and data for classification are stored separately. This time it's a hassle (

I haven't done anything particularly difficult, so if you've read this far, you'll know what you're doing by looking at the code. So I will omit the explanation.

The only important point is

# training data must be shuffled on online learning!
random.shuffle(train_data)

Here it is. Since the sample is diverted as it is, I have carefully included comments, but If you pass the teacher data without shuffling, the effect of the data alignment will be reflected. I don't really understand the algorithm so I can't say anything in detail, I think that the influence of the data eaten at the end will probably become stronger. In this case, the data was originally shuffled, so Even if you don't shuffle here, the performance will not deteriorate so much, but if you forget it when you reuse it, that's it.

After shuffling, you can start learning.

# run train
client.train(train_data)

classification

This isn't particularly difficult either, so take a look at the code. https://github.com/chase0213/address_classifier/blob/master/detect.py

This time, I gave three place names, "Isesaki", "Takasaki", and "Kamakura", and which prefecture is it! ?? I will do that.

Click here for the results.

$ python detect.py
Gunma Prefecture Isesaki
Takasaki, Gunma Prefecture
Kamakura, Kanagawa Prefecture

Oh! !! correct! !! Wow! !! !!

・・・・・・.

Save 50,000 "prefecture-city" pairs in your python, Which prefecture is this? Please try something like that. The total should be about 160,000, so there should be a 1/3 chance of hitting.

It's good because I knew before I started that this example wasn't smart, However, there are still good points. It is "classification ability for unknown data".

Classifiers (or machine learning) originally give known data to predict unknown data Because, even for place names that were not given as teacher data, You can predict (return the answer for the time being). If you try to do this with python only, it should be quite difficult.

That's it for trying to use jubaclassifier.