Classify machine learning related information by topic model

This article is related to part ❷ of Collection and classification of machine learning related information (concept).

It has been several months since the actual investigation, so it may be slightly different from the current situation. Also, please note in advance that the results are not satisfactory.

I'm new to Qiita and Python, so there may be a lot of strange descriptions, but I'd appreciate it if you could comment on those.

The process explained in this article is as follows.

❶ Site crawl Place the crawled documents under the bookmarks.crawled directory. 　　　↓ ❷ Making article into Python object Make a Python object for each article. 　　　↓ ❸ Make the corpus a Python object Convert the entire document group into a Python object as a corpus. 　　　↓ ❹ Classification by topic model Use this corpus to try to classify by topic model.

The details of thesaurus are mixed up, but other than that, I will explain it step by step as much as possible.

❶ Site crawl

In ❷ of Collection and classification of machine learning related information (concept), the scenario was to directly input the results collected by FESS, but this time [ Shortcut Directory and Plain Text Conversion](http://qiita.com/suchowan/items/6556756d2e816c7255b7#5-%E3%83%97%E3%83%AC%E3%82%A4%E3%83%B3 % E3% 83% 86% E3% 82% AD% E3% 82% B9% E3% 83% 88% E3% 82% AF% E3% 83% AD% E3% 83% BC% E3% 83% AB% E3 % 83% 89% E3% 82% B3% E3% 83% B3% E3% 83% 86% E3% 83% B3% E3% 83% 84% E3% 83% 87% E3% 82% A3% E3% 83 The content downloaded by crawl.rb of% AC% E3% 82% AF% E3% 83% 88% E3% 83% AA) is placed under the bookmarks.crawled directory and input.

If you directly input the results collected by FESS,

・ Old documents expire ・ Duplicate articles and less important articles excluded during manual article classification will be restored.

Because.

❷ Making article into Python object

Read the HTML file under the bookmarks.crawled directory and store it in an object of Python Article class.

Article class
attribute
path HTML file path
contents An HTML file with HTML tags removed
List of nouns in tokens contents(list of string)

★ Extracting the body of an HTML file

The library survey results of Extract text from HTML of blog with Python 2015 will be helpful.

If you want to implement it in earnest, you should use Webstemmer, but it is necessary to generate a template for each blog site in advance, which is complicated. I didn't use it this time.

The implemented Article class is based on the regular expression in extractcontent.

★ Token cutting

(1) janome

I tried using Pure Python's Japanese morphological analysis library janome. Since the dictionary has almost the same structure as MeCab and alphabetic words are defined in full-width, we used the half-width full-width conversion library mojimoji for preprocessing. ..

`article_janome.py`


import codecs
import re
import mojimoji
from janome.tokenizer import Tokenizer

class Article:

    encodings = [
        "utf-8",
        "cp932",
        "euc-jp",
        "iso-2022-jp",
        "latin_1"
    ]

    tokenizer = Tokenizer("user_dic.csv", udic_type="simpledic", udic_enc="utf8")

    def __init__(self,path):
        print(path)
        self.path = path
        self.contents = self.preprocess(self.get_contents(path))
        self.tokens = [token.surface for token in self.tokenizer.tokenize(self.contents) if re.match("Custom noun|noun,(Unique|General|Sa strange)", token.part_of_speech)]

    def get_contents(self,path):
        exceptions = []
        for encoding in self.encodings:
            try:
                all = codecs.open(path, 'r', encoding).read()
                parts = re.split("(?i)<(body|frame)[^>]*>", all, 1)
                if len(parts) == 3:
                    head, void, body = parts
                else:
                    print('Cannot split ' + path)
                    body = all
                return re.sub("<[^>]+?>", "", re.sub(r"(?is)<(script|style|select|noscript)[^>]*>.*?</\1\s*>","", body))
            except UnicodeDecodeError:
                continue
        print('Cannot detect encoding of ' + path)
        print(exceptions)
        return None

    def get_title(self,path):
        return re.split('\/', path)[-1]

    def preprocess(self, text):
        text = re.sub("&[^;]+;",  " ", text)
        text = mojimoji.han_to_zen(text, digit=False)
        text = re.sub('(\s|　|＃)+', " ", text)
        return text

(2) Extension of dictionary

In the default IPA dictionary, "artificial intelligence" is decomposed into two words such as "artificial" and "intelligence". Therefore, I registered the term I want to use as one word in user_dic.csv and tried to use it from janome.

afterwards,

mecab-ipadic-NEologd : Neologism dictionary for MeCab Added Wikipedia and Hatena words to mecab's dictionary on Ubuntu 14.04 Generate and use a user dictionary from Wikipedia and Hatena keywords for morphological analysis

I also found, but I haven't tried it yet because it was after switching to the policy of using thesaurus.csv.

(3) thesaurus

As will be described later, when token extraction using the Japanese morphological analysis library, perplexity did not fall within the permissible range, and topic extraction did not work. Therefore, thesaurus.csv contains about 350 words that frequently appear in artificial intelligence by hand in advance.

`thesaurus.csv(Example)`


Natural language processing,NLP,Natural Language Processing,natural language processing
Question answering
voice recognition
AlphaGo,Alphago
…

The process of registering in and cutting out only the hit words as tokens,

`thesaurus.py`


import re
import mojimoji

class Thesaurus:

    def __init__(self,path):
        map = dict()
        with open(path, 'r') as thesaurus:
            for line in thesaurus.readlines():
                words = [mojimoji.han_to_zen(word, digit=False) for word in re.split(',', line.strip())]
                for word in words:
                    if word in map:
                        print('Word duplicated: ' + word)
                        raise
                    map[word] = words[0]
        self.words = map
        self.re    = re.compile("|".join(sorted(map.keys(), key=lambda x: -len(x))))

    def tokenize(self,sentence):
        for token in re.finditer(self.re, sentence):
            yield(Token(self.words[token.group()]))

class Token:

    def __init__(self, surface):
        self.surface = surface
        self.part_of_speech = "Custom noun"

I described it in and replaced the Japanese morphological analysis library [^ 1].

`article.py`


import codecs
import re
import mojimoji
from thesaurus import Thesaurus

class Article:

    encodings = [
        "utf-8",
        "cp932",
        "euc-jp",
        "iso-2022-jp",
        "latin_1"
    ]

    tokenizer = Thesaurus('thesaurus.csv')

    def __init__(self,path):
        print(path)
        self.path = path
        self.contents = self.preprocess(self.get_contents(path))
        self.tokens = [token.surface for token in self.tokenizer.tokenize(self.contents) if re.match("Custom noun|noun,(Unique|General|Sa strange)", token.part_of_speech)]

    def get_contents(self,path):
        exceptions = []
        for encoding in self.encodings:
            try:
                all = codecs.open(path, 'r', encoding).read()
                parts = re.split("(?i)<(body|frame)[^>]*>", all, 1)
                if len(parts) == 3:
                    head, void, body = parts
                else:
                    print('Cannot split ' + path)
                    body = all
                return re.sub("<[^>]+?>", "", re.sub(r"(?is)<(script|style|select|noscript)[^>]*>.*?</\1\s*>","", body))
            except UnicodeDecodeError:
                continue
        print('Cannot detect encoding of ' + path)
        print(exceptions)
        return None

    def get_title(self,path):
        return re.split('\/', path)[-1]

    def preprocess(self, text):
        text = re.sub("&[^;]+;",  " ", text)
        text = mojimoji.han_to_zen(text, digit=False)
        return text

❸ Make the corpus a Python object

In the topic model, sentences are handled as BOW (Bag of Words, list of (word ID, number of occurrences)). Therefore, we have defined the following classes.

★ Corpus class

attribute
      articles   (HTML file path:Article object)OrderedDictionary consisting of
keys list of HTML file paths(list of string)
size Article Number of objects
texts Tokens that make up the corpus(list of (list of string))
A conversion of corpus texts to list of BOW
Class method save/It has a load and allows you to save objects in a file.

★ Corpora class

attribute
Training Corpus object for training
Corpus object for test test
      dictionary training,test Commonly used gensim.corpora.Dictionary object
                 (Word ID(integer)Expressed as(string)Keep the correspondence of)

`corpus.py`


import pickle
from collections import defaultdict
from gensim import corpora

class Corpora:

    def __init__(self, training, test, dictionary):
        self.training   = training
        self.test       = test
        self.dictionary = dictionary

    def save(self, title):
        self.training.save(title+'_training')
        self.test.save(title+'_test')
        self.dictionary.save(title+".dict")

    @classmethod
    def load(cls, title):
        training   = Corpus.load(title+'_training')
        test       = Corpus.load(title+'_test')
        dictionary = corpora.Dictionary.load(title+".dict")
        return cls(training, test, dictionary)

    @classmethod
    def generate(cls, training, test):
        training_corpus = Corpus.generate(training)
        test_corpus     = Corpus.generate(test)
        all_texts       = training_corpus.texts + test_corpus.texts
        frequency       = defaultdict(int)
        for text in all_texts:
            for token in text:
                frequency[token] += 1
        all_texts  = [[token for token in text if frequency[token] > 1] for text in all_texts]
        dictionary = corpora.Dictionary(all_texts)
        training_corpus.mm(dictionary)
        test_corpus.mm(dictionary)
        return cls(training_corpus, test_corpus, dictionary)

class Corpus:

    def __init__(self, articles):
        self.articles  = articles
        self.keys      = list(articles.keys())
        self.size      = len(articles.keys())

    def article(self, index):
        return self.articles[self.keys[index]]

    def mm(self, dictionary):
        values_set = set(dictionary.values())
        self.texts  = [[token for token in text if token in values_set] for text in self.texts]
      # print(self.texts[0])
        self.corpus = [dictionary.doc2bow(text) for text in self.texts]

    def save(self, title):
        with open(title+".pickle", 'wb') as f:
            pickle.dump(self.articles, f)
        corpora.MmCorpus.serialize(title+".mm", self.corpus)

    @classmethod
    def load(cls, title):
        with open(title+".pickle", 'rb') as f:
            articles = pickle.load(f)
        corpus = cls(articles)
        corpus.corpus = corpora.MmCorpus(title+".mm")
        return corpus

    @classmethod
    def generate(cls, articles):
        corpus = cls(articles)
        corpus.texts = [articles[key].tokens for key in articles.keys()]
        return corpus

Up to this point, it is a technology that is commonly required regardless of what is used for on-premises tools.

❹ Classification by topic model

Prepare the above tool stand,

Creating an application using the topic model… (* 1)

We classified by topic model with reference to.

`test_view_LDA.py`


import pprint
import logging
import glob
import numpy as np
import matplotlib.pylab as plt
from collections import OrderedDict
from gensim import corpora, models, similarities
from pprint import pprint  # pretty-printer
from corpus import Corpus, Corpora
from article import Article

#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

topic_range = range(10, 11)
training_percent = 90
test_percent = 10
path_pattern = '/home/samba/suchowan/links/bookmarks.crawled/**/*.html'

def corpus_pair(path, training_range, test_range):
    all_paths         = glob.glob(path, recursive=True)
    training_paths    = [v for i, v in enumerate(all_paths) if ((i * 2017) % 100) in training_range]
    test_paths        = [v for i, v in enumerate(all_paths) if ((i * 2017) % 100) in test_range    ]
    training_articles = OrderedDict([(path,Article(path)) for path in training_paths])
    test_articles     = OrderedDict([(path,Article(path)) for path in test_paths])
    return  Corpora.generate(training_articles, test_articles)

def calc_perplexity(m, c):
    return np.exp(-m.log_perplexity(c))

def search_model(pair):
    most = [1.0e15, None]
    print("dataset: training/test = {0}/{1}".format(pair.training.size, pair.test.size))
    
    for t in topic_range:
        m  = models.LdaModel(corpus=pair.training.corpus, id2word=pair.dictionary, num_topics=t, iterations=500, passes=10)
        p1 = calc_perplexity(m, pair.training.corpus)
        p2 = calc_perplexity(m, pair.test.corpus)
        print("{0}: perplexity is {1}/{2}".format(t, p1, p2))
        if p2 < most[0]:
            most[0] = p2
            most[1] = m
    
    return most[0], most[1]

pair = corpus_pair(path_pattern, range(0, training_percent+1), range(training_percent, training_percent+test_percent+1))
pair.save('article_contents')
perplexity, model = search_model(pair)
print("Best model: topics={0}, perplexity={1}".format(model.num_topics, perplexity))

def show_document_topics(c, m, r):

    # make document/topics matrix
    t_documents = OrderedDict()
    for s in r:
      # ts = m.__getitem__(c[s], -1)
        ts = m[c[s]]
        max_topic = max(ts, key=lambda x: x[1])
        if max_topic[0] not in t_documents:
            t_documents[max_topic[0]] = []
        t_documents[max_topic[0]] += [(s, max_topic[1])]
    
    return t_documents
    
topic_documents = show_document_topics(pair.test.corpus, model, range(0,pair.test.size))

for topic in topic_documents.keys():
    print("Topic #{0}".format(topic))
    for article in topic_documents[topic]:
       print(article[0], pair.test.article(article[0]).path)

pprint(model.show_topics())

The library used was gensim and Similarity calculation for Twitter users using tfidf, lsi, lda. / 88) and [Try natural language processing with Python_topic model](http://esu-ko.hatenablog.com/entry/2016/03/24/Python%E3%81%A7%E8%87 % AA% E7% 84% B6% E8% A8% 80% E8% AA% 9E% E5% 87% A6% E7% 90% 86% E3% 82% 92% E3% 81% 97% E3% 81% A6 % E3% 81% BF% E3% 82% 8B_% E3% 83% 88% E3% 83% 94% E3% 83% 83% E3% 82% AF% E3% 83% A2% E3% 83% 87% E3 I also referred to% 83% AB). ★training

input:training corpus- list of (list of (Word ID,Number of appearances))And the number of topics
    list of (Word ID,Number of appearances) - 個々の article での単語のNumber of appearances (The order of appearance is not considered)

output:LDA model- gensim.models.ldamodel
    list of ((list of (Word ID,Number of appearances))Calculation formula to calculate the topic fit probability from)

★test

input:test corpus- list of (list of (Word ID,Number of appearances))
    list of (Word ID,Number of appearances) - 個々の article での単語のNumber of appearances (The order of appearance is not considered)


output: list of (list of fit probability)

★ Execution example

I tried to make a corpus and classify by topic model just by extracting words with janome and narrowing down by only parts_of_speech, but perplexity became an astronomical value and it didn't make any sense.

We may have omitted some of the required pre-processing, but the underlying reason is clear.

Number of word types >> Number of documents

This.

The topic model has variables that can be adjusted by the number of word types + α. Forcibly converging under the condition of "number of word types >> number of documents" will inevitably result in overfitting.

Number of word types << Number of documents

You have to narrow down the words so that

The following is the result of manually registering about 350 words that frequently appear in artificial intelligence in thesaurus.csv and creating a corpus using only them.

The number of topics is a training input, but you can automate your decision by looking for the number of topics that minimizes the persistence. In the operation example, it was confirmed in advance that the number of topics is 10 and the perplexity is minimized.

According to (* 1)

The reciprocal of perplexity indicates the degree to which a word can be predicted in a document, so the highest is 1. The worse the accuracy of the model, the larger the value (2 digits are fine, the first half of 3 digits is OK, and after that it is bad, In the case of 1 digit, it is better to review the model and the calculation method of perplexity for any mistakes. Good).

In the execution example, 1920 article (90%) is used for training, 210 article (10%) is used for test [^ 2], and the perplexity of test corpus is 68.4.

The list of formulas for calculating the topic fit probability is as follows

[(0,
  '0.268*image+ 0.124*Ｄｅｌｌ + 0.049*ＣＮＮ + 0.043*Deep learning+ 0.038*neural network+ '
  '0.026*Machine learning+ 0.025*Ｃｈａｉｎｅｒ + 0.024*ＧＰＵ + 0.023*article+ 0.022*Image recognition'),
 (1,
  '0.135*Machine learning+ 0.121*Ｐｙｔｈｏｎ + 0.102*article+ 0.055*Ｃｈａｉｎｅｒ + 0.052*Ｄｅｌｌ + '
  '0.037*Deep learning+ 0.033*ｎｕｍｐｙ + 0.023*Framework+ 0.019*neural network+ 0.019*Ｓｐａｒｋ'),
 (2,
  '0.111*article+ 0.097*Forecast+ 0.090*Ranking+ 0.071*University+ 0.055*Search+ 0.033*Artificial intelligence+ '
  '0.032*Ｙａｈｏｏ + 0.032*Ｄｅｌｌ + 0.029*Database+ 0.026*Patent'),
 (3,
  '0.121*Ｒｕｂｙ + 0.100*game+ 0.090*ＡｌｐｈａＧｏ + 0.085*Go+ 0.077*article+ 0.076*Artificial intelligence+ '
  '0.053*Ｇｏｏｇｌｅ + 0.052*Ｍｉｃｒｏｓｏｆｔ + 0.047*Ｔａｙ + 0.034*Ｔｗｉｔｔｅｒ'),
 (4,
  '0.113*ＴｅｎｓｏｒＦｌｏｗ + 0.103*ＬＳＴＭ + 0.070*Ｄｅｌｌ + 0.068*ＣＮＮ + 0.063*ｌｉｎｅ + '
  '0.058*Ｔｈｅａｎｏ + 0.043*ＳＰＡＲＱＬ + 0.038*Ｋｅｒａｓ + 0.037*Ｐｙｔｈｏｎ + 0.035*ＭＮＩＳＴ'),
 (5,
  '0.130*Cloud+ 0.096*Security+ 0.079*ＡＷＳ + 0.079*Ａｍａｚｏｎ + 0.075*article+ 0.057*ＩｏＴ '
  '+ 0.042*big data+ 0.031*Books+ 0.023*attack+ 0.022*ＩＢＭ'),
 (6,
  '0.177*Ｇｏｏｇｌｅ + 0.137*ＡＰＩ + 0.100*Search+ 0.071*article+ 0.055*Ｆａｃｅｂｏｏｋ + '
  '0.031*Ｗａｔｓｏｎ + 0.030*ＩＢＭ + 0.026*Ｂｌｕｅｍｉｘ + 0.026*Machine learning+ 0.025*Ｔｗｉｔｔｅｒ'),
 (7,
  '0.351*Artificial intelligence+ 0.093*robot+ 0.064*Deep learning+ 0.049*article+ 0.032*University+ 0.029*Machine learning+ '
  '0.020*University of Tokyo+ 0.019*Ｆａｃｅｂｏｏｋ + 0.019*movies+ 0.019*Ｇｏｏｇｌｅ'),
 (8,
  '0.188*ｂｏｔ + 0.180*Ｍｉｃｒｏｓｏｆｔ + 0.057*Ａｚｕｒｅ + 0.056*Ｅｌａｓｔｉｃｓｅａｒｃｈ + '
  '0.042*ｗｏｒｄ2ｖｅｃ + 0.038*Machine learning+ 0.033*ｌｉｎｅ + 0.030*Search+ 0.027*Ｋｉｂａｎａ + '
  '0.022*Natural language processing'),
 (9,
  '0.102*article+ 0.094*Ｔｗｉｔｔｅｒ + 0.079*robot+ 0.060*ＩｏＴ + 0.058*Sony+ 0.041*Reinforcement learning'
  '+ 0.038*ＴｅｎｓｏｒＦｌｏｗ + 0.029*Ｊａｖａ + 0.028*Ｄｅｅｐ\u3000Ｑ−Ｎｅｔｗｏｒｋ + 0.027*Ranking')]

A perplexity of 68.4 doesn't seem to be that bad, but looking at this formula, it seems quite difficult for the human eye to read the meaning of the topic.

Going back, in the example of (* 1), as a token extracted from the original article,

Large salon with 15 seats or more / Parking lot available / Reception is OK after 19:00 / Open all year round / Within 3 minutes walk from the nearest station / Hair set / Nail / Reception even before 10 am / Drink service available / Card payment OK / Many female staff / Private room available / No smoking / Semi-private room available

The explanation like the example of is divided by'/'. This is more of a direct feature than a token extracted from natural language. With this lucrative token, the perplexity on the two topics is 17.1, so I don't think this example was too clumsy. Conversely, with a dataset of scale and content like this example, it may be difficult to make a striking unsupervised classification with a topic model.

If improvements are possible, the following points can be considered.

-Extract the true text using Webstemmer. ・ Tuneup of thesaurus.csv

However, for the latter, I do not know what the automation is for by manually maintaining the synonyms. Also, as new companies enter the machine learning industry, they must decide and add them to thesaurus.csv.

The recently announced JUMAN ++

I tried to touch the new morphological analyzer JUMAN ++, but I thought I would switch from MeCab with higher accuracy than I expected

As far as I read, it may be effective in solving the problem, but it is a future task.

[^ 1]: Iterators etc. are implemented so that they have the same API.

[^ 2]: At the time of this survey, there were about 2000 articles, but now it has increased to about 5000 articles.