Learn the trends of feature words in texts with Jubatus and categorize input texts

I tried to get started with Jubatus.

Assumptions & goals

Jubatus installation

Install from the package according to the instructions on the official website.

$ sudo rpm -Uvh http://download.jubat.us/yum/rhel/6/stable/x86_64/jubatus-release-6-1.el6.x86_64.rpm
$ sudo yum install jubatus jubatus-client

Get a sample

There is a sample repository called jubatus-example, so get this.

$ git clone https://github.com/jubatus/jubatus-example.git

There are quite a lot of explanations such as the Japanese README, so I think it's easy to enter from here.

Modify the sample

For this purpose, you can use the sample `` `twitter_streaming_location```. The movement of this sample is as follows.

twitter_streaming_locationTo a suitable name for each directory and modify it.

In the learning process, learn the correspondence between the blog category and the text, Give the classifier some text and try to guess the category.

Preparation for learning process

Preparation of teacher data

Prepare a suitable SQL and output the list of blog categories and body text to text. With CLI, you can get data by tab delimiter as follows.

$ mysql -uuser -p -N db < blog.sql > blog.txt

Modified train.py

The original train.py analyzes the geotags of tweets and does it, so it's a mess. A little rewritten to learn tab-delimited data fed from standard input instead of tweets acquired from the network.

train.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import json
import re

from jubatus.classifier import client
from jubatus.common import Datum

# Jubatus Configuration
host = "127.0.0.1"
port = 9199
instance_name = "" # required only when using distributed mode

def print_color(color, msg, end):
    sys.stdout.write('\033[' + str(color) + 'm' + str(msg) + '\033[0m' + str(end))

def print_red(msg, end="\n"):
    print_color(31, msg, end)

def print_green(msg, end="\n"):
    print_color(32, msg, end)

def train():
    classifier = client.Classifier(host, port, instance_name)
    for line in sys.stdin:
        category_name, body = line.split("\t")
        d = Datum({'text': body})
        classifier.train([(category_name, d)])

        # Print trained entry
        print_green(category_name, ' ')
        print body

    #If you want to back up the learning data after learning, enable the following
    # classifier.save("foo")

if __name__ == '__main__':
    try:
        train()
    except KeyboardInterrupt:
        print "Stopped."

Preparation for classification process

Modified classify.py

There is almost no need to change this, but I changed the display to only the top three estimated categories.

classify.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys

from jubatus.classifier import client
from jubatus.common import Datum

# Jubatus configuration
host = "127.0.0.1"
port = 9199
instance_name = "" # required only when using distributed mode

def estimate_blog_category_for(text):
    classifier = client.Classifier(host, port, instance_name)

    # Create datum for Jubatus
    d = Datum({'text': text})

    # Send estimation query to Jubatus
    result = classifier.classify([d])

    if len(result[0]) > 0:
        # Sort results by score
        est = sorted(result[0], key=lambda e: e.score, reverse=True)

        # Print the result
        print "Estimated Category for %s:" % text
        i = 0
        for e in est:
            print "  " + e.label + " (" + str(e.score) + ")"
            i += 1
            if i >= 3:
                break
    else:
        # No estimation results; maybe we haven't trained enough
        print "No estimation results available."
        print "Train more data or try using another text."

if __name__ == '__main__':
    if len(sys.argv) == 2:
        estimate_blog_category_for(sys.argv[1])
    else:
        print "Usage: %s data" % sys.argv[0]

Start jubatus server

I wanted the text to be split to mecab instead of bigram, so I rewrote the settings a bit.

blog_category.json


{
  "method": "NHERD",
  "parameter": {
    "regularization_weight": 0.001
  },
  "converter": {
    "num_filter_types": {
    },
    "num_filter_rules": [
    ],
    "string_filter_types": {
    },
    "string_filter_rules": [
    ],
    "num_types": {
    },
    "num_rules": [
    ],
    "string_types": {
        "bigram":  { "method": "ngram", "char_num": "2" },
        "mecab": {
          "method": "dynamic",
          "path": "libmecab_splitter.so",
          "function": "create"
        }
    },
    "string_rules": [
        { "key": "*", "type": "mecab", "sample_weight": "bin", "global_weight": "idf" }
    ]
  }
}

Start the server by specifying this json.

$ jubaclassifier -f blog_category.json -t 0

Operation test

Learning

Feed the prepared teacher data to train.py.

$ cat blog.txt | ./train.py

Classification

Let's guess the category by feeding a suitable text.

$ ./classify.py "Nice to meet you. My name is Tanaka."
Estimated Category for Nice to meet you. My name is Tanaka.:
Self-introduction(0.231856495142)
diary(0.0823381990194)
Notice(0.0661180838943)

reference

Recommended Posts

Learn the trends of feature words in texts with Jubatus and categorize input texts
Let's play with Python Receive and save / display the text of the input form
Extract the color of the object in the image with Mask R-CNN and K-Means clustering
Learn while implementing with Scipy Logistic regression and the basics of multi-layer perceptron
Edit and debug the code in the Raspberry Pi with VS Code's SSH connection feature
Visualize the range of interpolation and extrapolation with python
Learn the design pattern "Chain of Responsibility" in Python
A server that returns the number of people in front of the camera with bottle.py and OpenCV
Try scraping the data of COVID-19 in Tokyo with Python
Learn "English grammar" instead of Python and AI related English words. .. ..
Calculate the square root of 2 in millions of digits with python
[For beginners] Summary of standard input in Python (with explanation)
See the power of speeding up with NumPy and SciPy
[Homology] Count the number of holes in data with Python
Create an authentication feature with django-allauth and CustomUser in Django
Let's use the distributed expression of words quickly with fastText!
[Tips] Problems and solutions in the development of python + kivy
Play with the password mechanism of GitHub Webhook and Python
[Python] The role of the asterisk in front of the variable. Divide the input value and assign it to a variable