I tried to get started with Jubatus.

Assumptions & goals

CentOS-6.6
A large number of categorized blog articles (body) etc. are included in MySQL
I want to make machine learning the correspondence between feature words and categories contained in blog articles, and to estimate which category is likely to correspond when feeding an appropriate text.

Jubatus installation

Install from the package according to the instructions on the official website.

$ sudo rpm -Uvh http://download.jubat.us/yum/rhel/6/stable/x86_64/jubatus-release-6-1.el6.x86_64.rpm
$ sudo yum install jubatus jubatus-client

Get a sample

There is a sample repository called jubatus-example, so get this.

$ git clone https://github.com/jubatus/jubatus-example.git

There are quite a lot of explanations such as the Japanese README, so I think it's easy to enter from here.

Modify the sample

For this purpose, you can use the sample `` `twitter_streaming_location```. The movement of this sample is as follows.

Learning
Get the one with the geotag of Tokyo / Hokkaido / Kyushu from the public stream of Twitter
Let the body of the tweet learn which region the tweet belongs to.
Classification
If you give a sentence, it will estimate in which area the tweet was made.

twitter_streaming_locationTo a suitable name for each directory and modify it.

In the learning process, learn the correspondence between the blog category and the text, Give the classifier some text and try to guess the category.

Preparation for learning process

Preparation of teacher data

Prepare a suitable SQL and output the list of blog categories and body text to text. With CLI, you can get data by tab delimiter as follows.

$ mysql -uuser -p -N db < blog.sql > blog.txt

Modified train.py

The original train.py analyzes the geotags of tweets and does it, so it's a mess. A little rewritten to learn tab-delimited data fed from standard input instead of tweets acquired from the network.

`train.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import json
import re

from jubatus.classifier import client
from jubatus.common import Datum

# Jubatus Configuration
host = "127.0.0.1"
port = 9199
instance_name = "" # required only when using distributed mode

def print_color(color, msg, end):
    sys.stdout.write('\033[' + str(color) + 'm' + str(msg) + '\033[0m' + str(end))

def print_red(msg, end="\n"):
    print_color(31, msg, end)

def print_green(msg, end="\n"):
    print_color(32, msg, end)

def train():
    classifier = client.Classifier(host, port, instance_name)
    for line in sys.stdin:
        category_name, body = line.split("\t")
        d = Datum({'text': body})
        classifier.train([(category_name, d)])

        # Print trained entry
        print_green(category_name, ' ')
        print body

    #If you want to back up the learning data after learning, enable the following
    # classifier.save("foo")

if __name__ == '__main__':
    try:
        train()
    except KeyboardInterrupt:
        print "Stopped."

Preparation for classification process

Modified classify.py

There is almost no need to change this, but I changed the display to only the top three estimated categories.

`classify.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys

from jubatus.classifier import client
from jubatus.common import Datum

# Jubatus configuration
host = "127.0.0.1"
port = 9199
instance_name = "" # required only when using distributed mode

def estimate_blog_category_for(text):
    classifier = client.Classifier(host, port, instance_name)

    # Create datum for Jubatus
    d = Datum({'text': text})

    # Send estimation query to Jubatus
    result = classifier.classify([d])

    if len(result[0]) > 0:
        # Sort results by score
        est = sorted(result[0], key=lambda e: e.score, reverse=True)

        # Print the result
        print "Estimated Category for %s:" % text
        i = 0
        for e in est:
            print "  " + e.label + " (" + str(e.score) + ")"
            i += 1
            if i >= 3:
                break
    else:
        # No estimation results; maybe we haven't trained enough
        print "No estimation results available."
        print "Train more data or try using another text."

if __name__ == '__main__':
    if len(sys.argv) == 2:
        estimate_blog_category_for(sys.argv[1])
    else:
        print "Usage: %s data" % sys.argv[0]

Start jubatus server

I wanted the text to be split to mecab instead of bigram, so I rewrote the settings a bit.

`blog_category.json`


{
  "method": "NHERD",
  "parameter": {
    "regularization_weight": 0.001
  },
  "converter": {
    "num_filter_types": {
    },
    "num_filter_rules": [
    ],
    "string_filter_types": {
    },
    "string_filter_rules": [
    ],
    "num_types": {
    },
    "num_rules": [
    ],
    "string_types": {
        "bigram":  { "method": "ngram", "char_num": "2" },
        "mecab": {
          "method": "dynamic",
          "path": "libmecab_splitter.so",
          "function": "create"
        }
    },
    "string_rules": [
        { "key": "*", "type": "mecab", "sample_weight": "bin", "global_weight": "idf" }
    ]
  }
}

Start the server by specifying this json.

$ jubaclassifier -f blog_category.json -t 0

Operation test

Learning

Feed the prepared teacher data to train.py.

$ cat blog.txt | ./train.py

Classification

Let's guess the category by feeding a suitable text.

$ ./classify.py "Nice to meet you. My name is Tanaka."
Estimated Category for Nice to meet you. My name is Tanaka.:
Self-introduction(0.231856495142)
diary(0.0823381990194)
Notice(0.0661180838943)

Learn the trends of feature words in texts with Jubatus and categorize input texts