This article is related to part ❷ of Collection and classification of machine learning related information (concept).
It has been several months since the actual investigation, so it may be slightly different from the current situation. Also, please note in advance that the results are not satisfactory.
I'm new to Qiita and Python, so there may be a lot of strange descriptions, but I'd appreciate it if you could comment on those.
The process explained in this article is as follows.
❶ Site crawl Place the crawled documents under the bookmarks.crawled directory. ↓ ❷ Making article into Python object Make a Python object for each article. ↓ ❸ Make the corpus a Python object Convert the entire document group into a Python object as a corpus. ↓ ❹ Classification by topic model Use this corpus to try to classify by topic model.
The details of thesaurus are mixed up, but other than that, I will explain it step by step as much as possible.
In ❷ of Collection and classification of machine learning related information (concept), the scenario was to directly input the results collected by FESS, but this time [ Shortcut Directory and Plain Text Conversion](http://qiita.com/suchowan/items/6556756d2e816c7255b7#5-%E3%83%97%E3%83%AC%E3%82%A4%E3%83%B3 % E3% 83% 86% E3% 82% AD% E3% 82% B9% E3% 83% 88% E3% 82% AF% E3% 83% AD% E3% 83% BC% E3% 83% AB% E3 % 83% 89% E3% 82% B3% E3% 83% B3% E3% 83% 86% E3% 83% B3% E3% 83% 84% E3% 83% 87% E3% 82% A3% E3% 83 The content downloaded by crawl.rb of% AC% E3% 82% AF% E3% 83% 88% E3% 83% AA) is placed under the bookmarks.crawled directory and input.
If you directly input the results collected by FESS,
・ Old documents expire ・ Duplicate articles and less important articles excluded during manual article classification will be restored.
Because.
Read the HTML file under the bookmarks.crawled directory and store it in an object of Python Article class.
Article class
attribute
path HTML file path
contents An HTML file with HTML tags removed
List of nouns in tokens contents(list of string)
The library survey results of Extract text from HTML of blog with Python 2015 will be helpful.
If you want to implement it in earnest, you should use Webstemmer, but it is necessary to generate a template for each blog site in advance, which is complicated. I didn't use it this time.
The implemented Article class is based on the regular expression in extractcontent.
(1) janome
I tried using Pure Python's Japanese morphological analysis library janome. Since the dictionary has almost the same structure as MeCab and alphabetic words are defined in full-width, we used the half-width full-width conversion library mojimoji for preprocessing. ..
article_janome.py
import codecs
import re
import mojimoji
from janome.tokenizer import Tokenizer
class Article:
encodings = [
"utf-8",
"cp932",
"euc-jp",
"iso-2022-jp",
"latin_1"
]
tokenizer = Tokenizer("user_dic.csv", udic_type="simpledic", udic_enc="utf8")
def __init__(self,path):
print(path)
self.path = path
self.contents = self.preprocess(self.get_contents(path))
self.tokens = [token.surface for token in self.tokenizer.tokenize(self.contents) if re.match("Custom noun|noun,(Unique|General|Sa strange)", token.part_of_speech)]
def get_contents(self,path):
exceptions = []
for encoding in self.encodings:
try:
all = codecs.open(path, 'r', encoding).read()
parts = re.split("(?i)<(body|frame)[^>]*>", all, 1)
if len(parts) == 3:
head, void, body = parts
else:
print('Cannot split ' + path)
body = all
return re.sub("<[^>]+?>", "", re.sub(r"(?is)<(script|style|select|noscript)[^>]*>.*?</\1\s*>","", body))
except UnicodeDecodeError:
continue
print('Cannot detect encoding of ' + path)
print(exceptions)
return None
def get_title(self,path):
return re.split('\/', path)[-1]
def preprocess(self, text):
text = re.sub("&[^;]+;", " ", text)
text = mojimoji.han_to_zen(text, digit=False)
text = re.sub('(\s| |#)+', " ", text)
return text
In the default IPA dictionary, "artificial intelligence" is decomposed into two words such as "artificial" and "intelligence". Therefore, I registered the term I want to use as one word in user_dic.csv and tried to use it from janome.
afterwards,
mecab-ipadic-NEologd : Neologism dictionary for MeCab Added Wikipedia and Hatena words to mecab's dictionary on Ubuntu 14.04 Generate and use a user dictionary from Wikipedia and Hatena keywords for morphological analysis
I also found, but I haven't tried it yet because it was after switching to the policy of using thesaurus.csv.
(3) thesaurus
As will be described later, when token extraction using the Japanese morphological analysis library, perplexity did not fall within the permissible range, and topic extraction did not work. Therefore, thesaurus.csv contains about 350 words that frequently appear in artificial intelligence by hand in advance.
thesaurus.csv(Example)
Natural language processing,NLP,Natural Language Processing,natural language processing
Question answering
voice recognition
AlphaGo,Alphago
…
The process of registering in and cutting out only the hit words as tokens,
thesaurus.py
import re
import mojimoji
class Thesaurus:
def __init__(self,path):
map = dict()
with open(path, 'r') as thesaurus:
for line in thesaurus.readlines():
words = [mojimoji.han_to_zen(word, digit=False) for word in re.split(',', line.strip())]
for word in words:
if word in map:
print('Word duplicated: ' + word)
raise
map[word] = words[0]
self.words = map
self.re = re.compile("|".join(sorted(map.keys(), key=lambda x: -len(x))))
def tokenize(self,sentence):
for token in re.finditer(self.re, sentence):
yield(Token(self.words[token.group()]))
class Token:
def __init__(self, surface):
self.surface = surface
self.part_of_speech = "Custom noun"
I described it in and replaced the Japanese morphological analysis library [^ 1].
article.py
import codecs
import re
import mojimoji
from thesaurus import Thesaurus
class Article:
encodings = [
"utf-8",
"cp932",
"euc-jp",
"iso-2022-jp",
"latin_1"
]
tokenizer = Thesaurus('thesaurus.csv')
def __init__(self,path):
print(path)
self.path = path
self.contents = self.preprocess(self.get_contents(path))
self.tokens = [token.surface for token in self.tokenizer.tokenize(self.contents) if re.match("Custom noun|noun,(Unique|General|Sa strange)", token.part_of_speech)]
def get_contents(self,path):
exceptions = []
for encoding in self.encodings:
try:
all = codecs.open(path, 'r', encoding).read()
parts = re.split("(?i)<(body|frame)[^>]*>", all, 1)
if len(parts) == 3:
head, void, body = parts
else:
print('Cannot split ' + path)
body = all
return re.sub("<[^>]+?>", "", re.sub(r"(?is)<(script|style|select|noscript)[^>]*>.*?</\1\s*>","", body))
except UnicodeDecodeError:
continue
print('Cannot detect encoding of ' + path)
print(exceptions)
return None
def get_title(self,path):
return re.split('\/', path)[-1]
def preprocess(self, text):
text = re.sub("&[^;]+;", " ", text)
text = mojimoji.han_to_zen(text, digit=False)
return text
In the topic model, sentences are handled as BOW (Bag of Words, list of (word ID, number of occurrences)). Therefore, we have defined the following classes.
attribute
articles (HTML file path:Article object)OrderedDictionary consisting of
keys list of HTML file paths(list of string)
size Article Number of objects
texts Tokens that make up the corpus(list of (list of string))
A conversion of corpus texts to list of BOW
Class method save/It has a load and allows you to save objects in a file.
attribute
Training Corpus object for training
Corpus object for test test
dictionary training,test Commonly used gensim.corpora.Dictionary object
(Word ID(integer)Expressed as(string)Keep the correspondence of)
corpus.py
import pickle
from collections import defaultdict
from gensim import corpora
class Corpora:
def __init__(self, training, test, dictionary):
self.training = training
self.test = test
self.dictionary = dictionary
def save(self, title):
self.training.save(title+'_training')
self.test.save(title+'_test')
self.dictionary.save(title+".dict")
@classmethod
def load(cls, title):
training = Corpus.load(title+'_training')
test = Corpus.load(title+'_test')
dictionary = corpora.Dictionary.load(title+".dict")
return cls(training, test, dictionary)
@classmethod
def generate(cls, training, test):
training_corpus = Corpus.generate(training)
test_corpus = Corpus.generate(test)
all_texts = training_corpus.texts + test_corpus.texts
frequency = defaultdict(int)
for text in all_texts:
for token in text:
frequency[token] += 1
all_texts = [[token for token in text if frequency[token] > 1] for text in all_texts]
dictionary = corpora.Dictionary(all_texts)
training_corpus.mm(dictionary)
test_corpus.mm(dictionary)
return cls(training_corpus, test_corpus, dictionary)
class Corpus:
def __init__(self, articles):
self.articles = articles
self.keys = list(articles.keys())
self.size = len(articles.keys())
def article(self, index):
return self.articles[self.keys[index]]
def mm(self, dictionary):
values_set = set(dictionary.values())
self.texts = [[token for token in text if token in values_set] for text in self.texts]
# print(self.texts[0])
self.corpus = [dictionary.doc2bow(text) for text in self.texts]
def save(self, title):
with open(title+".pickle", 'wb') as f:
pickle.dump(self.articles, f)
corpora.MmCorpus.serialize(title+".mm", self.corpus)
@classmethod
def load(cls, title):
with open(title+".pickle", 'rb') as f:
articles = pickle.load(f)
corpus = cls(articles)
corpus.corpus = corpora.MmCorpus(title+".mm")
return corpus
@classmethod
def generate(cls, articles):
corpus = cls(articles)
corpus.texts = [articles[key].tokens for key in articles.keys()]
return corpus
Up to this point, it is a technology that is commonly required regardless of what is used for on-premises tools.
Prepare the above tool stand,
Creating an application using the topic model… (* 1)
We classified by topic model with reference to.
test_view_LDA.py
import pprint
import logging
import glob
import numpy as np
import matplotlib.pylab as plt
from collections import OrderedDict
from gensim import corpora, models, similarities
from pprint import pprint # pretty-printer
from corpus import Corpus, Corpora
from article import Article
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
topic_range = range(10, 11)
training_percent = 90
test_percent = 10
path_pattern = '/home/samba/suchowan/links/bookmarks.crawled/**/*.html'
def corpus_pair(path, training_range, test_range):
all_paths = glob.glob(path, recursive=True)
training_paths = [v for i, v in enumerate(all_paths) if ((i * 2017) % 100) in training_range]
test_paths = [v for i, v in enumerate(all_paths) if ((i * 2017) % 100) in test_range ]
training_articles = OrderedDict([(path,Article(path)) for path in training_paths])
test_articles = OrderedDict([(path,Article(path)) for path in test_paths])
return Corpora.generate(training_articles, test_articles)
def calc_perplexity(m, c):
return np.exp(-m.log_perplexity(c))
def search_model(pair):
most = [1.0e15, None]
print("dataset: training/test = {0}/{1}".format(pair.training.size, pair.test.size))
for t in topic_range:
m = models.LdaModel(corpus=pair.training.corpus, id2word=pair.dictionary, num_topics=t, iterations=500, passes=10)
p1 = calc_perplexity(m, pair.training.corpus)
p2 = calc_perplexity(m, pair.test.corpus)
print("{0}: perplexity is {1}/{2}".format(t, p1, p2))
if p2 < most[0]:
most[0] = p2
most[1] = m
return most[0], most[1]
pair = corpus_pair(path_pattern, range(0, training_percent+1), range(training_percent, training_percent+test_percent+1))
pair.save('article_contents')
perplexity, model = search_model(pair)
print("Best model: topics={0}, perplexity={1}".format(model.num_topics, perplexity))
def show_document_topics(c, m, r):
# make document/topics matrix
t_documents = OrderedDict()
for s in r:
# ts = m.__getitem__(c[s], -1)
ts = m[c[s]]
max_topic = max(ts, key=lambda x: x[1])
if max_topic[0] not in t_documents:
t_documents[max_topic[0]] = []
t_documents[max_topic[0]] += [(s, max_topic[1])]
return t_documents
topic_documents = show_document_topics(pair.test.corpus, model, range(0,pair.test.size))
for topic in topic_documents.keys():
print("Topic #{0}".format(topic))
for article in topic_documents[topic]:
print(article[0], pair.test.article(article[0]).path)
pprint(model.show_topics())
The library used was gensim and Similarity calculation for Twitter users using tfidf, lsi, lda. / 88) and [Try natural language processing with Python_topic model](http://esu-ko.hatenablog.com/entry/2016/03/24/Python%E3%81%A7%E8%87 % AA% E7% 84% B6% E8% A8% 80% E8% AA% 9E% E5% 87% A6% E7% 90% 86% E3% 82% 92% E3% 81% 97% E3% 81% A6 % E3% 81% BF% E3% 82% 8B_% E3% 83% 88% E3% 83% 94% E3% 83% 83% E3% 82% AF% E3% 83% A2% E3% 83% 87% E3 I also referred to% 83% AB). ★training
input:training corpus- list of (list of (Word ID,Number of appearances))And the number of topics
list of (Word ID,Number of appearances) - 個々の article での単語のNumber of appearances (The order of appearance is not considered)
output:LDA model- gensim.models.ldamodel
list of ((list of (Word ID,Number of appearances))Calculation formula to calculate the topic fit probability from)
★test
input:test corpus- list of (list of (Word ID,Number of appearances))
list of (Word ID,Number of appearances) - 個々の article での単語のNumber of appearances (The order of appearance is not considered)
output: list of (list of fit probability)
I tried to make a corpus and classify by topic model just by extracting words with janome and narrowing down by only parts_of_speech, but perplexity became an astronomical value and it didn't make any sense.
We may have omitted some of the required pre-processing, but the underlying reason is clear.
Number of word types >> Number of documents
This.
The topic model has variables that can be adjusted by the number of word types + α. Forcibly converging under the condition of "number of word types >> number of documents" will inevitably result in overfitting.
Number of word types << Number of documents
You have to narrow down the words so that
The following is the result of manually registering about 350 words that frequently appear in artificial intelligence in thesaurus.csv and creating a corpus using only them.
The number of topics is a training input, but you can automate your decision by looking for the number of topics that minimizes the persistence. In the operation example, it was confirmed in advance that the number of topics is 10 and the perplexity is minimized.
According to (* 1)
The reciprocal of perplexity indicates the degree to which a word can be predicted in a document, so the highest is 1. The worse the accuracy of the model, the larger the value (2 digits are fine, the first half of 3 digits is OK, and after that it is bad, In the case of 1 digit, it is better to review the model and the calculation method of perplexity for any mistakes. Good).
In the execution example, 1920 article (90%) is used for training, 210 article (10%) is used for test [^ 2], and the perplexity of test corpus is 68.4.
The list of formulas for calculating the topic fit probability is as follows
[(0,
'0.268*image+ 0.124*Dell + 0.049*CNN + 0.043*Deep learning+ 0.038*neural network+ '
'0.026*Machine learning+ 0.025*Chainer + 0.024*GPU + 0.023*article+ 0.022*Image recognition'),
(1,
'0.135*Machine learning+ 0.121*Python + 0.102*article+ 0.055*Chainer + 0.052*Dell + '
'0.037*Deep learning+ 0.033*numpy + 0.023*Framework+ 0.019*neural network+ 0.019*Spark'),
(2,
'0.111*article+ 0.097*Forecast+ 0.090*Ranking+ 0.071*University+ 0.055*Search+ 0.033*Artificial intelligence+ '
'0.032*Yahoo + 0.032*Dell + 0.029*Database+ 0.026*Patent'),
(3,
'0.121*Ruby + 0.100*game+ 0.090*AlphaGo + 0.085*Go+ 0.077*article+ 0.076*Artificial intelligence+ '
'0.053*Google + 0.052*Microsoft + 0.047*Tay + 0.034*Twitter'),
(4,
'0.113*TensorFlow + 0.103*LSTM + 0.070*Dell + 0.068*CNN + 0.063*line + '
'0.058*Theano + 0.043*SPARQL + 0.038*Keras + 0.037*Python + 0.035*MNIST'),
(5,
'0.130*Cloud+ 0.096*Security+ 0.079*AWS + 0.079*Amazon + 0.075*article+ 0.057*IoT '
'+ 0.042*big data+ 0.031*Books+ 0.023*attack+ 0.022*IBM'),
(6,
'0.177*Google + 0.137*API + 0.100*Search+ 0.071*article+ 0.055*Facebook + '
'0.031*Watson + 0.030*IBM + 0.026*Bluemix + 0.026*Machine learning+ 0.025*Twitter'),
(7,
'0.351*Artificial intelligence+ 0.093*robot+ 0.064*Deep learning+ 0.049*article+ 0.032*University+ 0.029*Machine learning+ '
'0.020*University of Tokyo+ 0.019*Facebook + 0.019*movies+ 0.019*Google'),
(8,
'0.188*bot + 0.180*Microsoft + 0.057*Azure + 0.056*Elasticsearch + '
'0.042*word2vec + 0.038*Machine learning+ 0.033*line + 0.030*Search+ 0.027*Kibana + '
'0.022*Natural language processing'),
(9,
'0.102*article+ 0.094*Twitter + 0.079*robot+ 0.060*IoT + 0.058*Sony+ 0.041*Reinforcement learning'
'+ 0.038*TensorFlow + 0.029*Java + 0.028*Deep\u3000Q−Network + 0.027*Ranking')]
A perplexity of 68.4 doesn't seem to be that bad, but looking at this formula, it seems quite difficult for the human eye to read the meaning of the topic.
Going back, in the example of (* 1), as a token extracted from the original article,
Large salon with 15 seats or more / Parking lot available / Reception is OK after 19:00 / Open all year round / Within 3 minutes walk from the nearest station / Hair set / Nail / Reception even before 10 am / Drink service available / Card payment OK / Many female staff / Private room available / No smoking / Semi-private room available
The explanation like the example of is divided by'/'. This is more of a direct feature than a token extracted from natural language. With this lucrative token, the perplexity on the two topics is 17.1, so I don't think this example was too clumsy. Conversely, with a dataset of scale and content like this example, it may be difficult to make a striking unsupervised classification with a topic model.
If improvements are possible, the following points can be considered.
-Extract the true text using Webstemmer. ・ Tuneup of thesaurus.csv
However, for the latter, I do not know what the automation is for by manually maintaining the synonyms. Also, as new companies enter the machine learning industry, they must decide and add them to thesaurus.csv.
The recently announced JUMAN ++
As far as I read, it may be effective in solving the problem, but it is a future task.
[^ 1]: Iterators etc. are implemented so that they have the same API.
[^ 2]: At the time of this survey, there were about 2000 articles, but now it has increased to about 5000 articles.
Recommended Posts