There is a Python library that handles topic models called gensim. Officially, it only supports Python version 2.5 <= Python <3.0.
However, Samantp has released a library called gensimPy3. It's a fork of gensim for Python 3.3.
This time, using this gensimPy3, it is the same as Shoto's Analyzing the ranking of becoming a novelist with a topic model (gensim) I experimented to see if I could do it.
https://github.com/samantp/gensimPy3
Clone the source code from.
git clone [email protected]:samantp/gensimPy3.git
so,
python setup.py test
if you do
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 3973: invalid start byte
I got an error and became anxious. But I ignored it and installed it.
python setup.py install
I got a Syntax Error for some reason, but the installation was successful.
The differences from Shoto's article http://sucrose.hatenablog.com/entry/2013/04/27/225218 are described below.
--Using Python3.3.1 on pyenv --gensim uses gensimPy3 -Use pyquery instead of BeautifulSoup --Use requests instead of urllib2
Since pyquery can use the same notation selector as JQuery, it is easy to make trial and error using the Chrome developer tool Console. Convenient.
topic_model_in_narou.py
# -*- coding: utf-8 -*-
import requests
from pyquery import PyQuery as pq
import gensim
import pdb
def fetch_narou_ranking_html():
r = requests.get('http://yomou.syosetu.com/rank/list/type/total_total/')
r.encoding = 'utf-8'
return r.text
def collect_tags(d):
d_novels = d('.s')
tags = []
for d_novel in d_novels:
d_tag_name_list = d_novel.findall('a')
tags_in_a_novel = [d_tag_name.text for d_tag_name in d_tag_name_list]
tags.append(tags_in_a_novel)
return tags
if __name__ == "__main__":
html = fetch_narou_ranking_html()
d = pq(html.encode('utf-8'))
tags = collect_tags(d)
dictionary = gensim.corpora.Dictionary(tags)
dictionary.filter_extremes(3)
corpus = [dictionary.doc2bow(text) for text in tags]
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)
for x in lda.show_topics(-1, 5):
print(x)
I made a model with this code and was able to display topics. The results are as follows.
0.115*Fantasy+ 0.068*magic+ 0.041*cheat+ 0.034*Harem+ 0.028*Reincarnation
0.173*magic+ 0.095*Fantasy+ 0.039*dark+ 0.033*Trip+ 0.027*Reincarnation
0.106*Reincarnation+ 0.087*Harem+ 0.074*The strongest hero+ 0.063*cheat+ 0.052*Fantasy
0.079*Fantasy+ 0.069*cheat+ 0.062*love+ 0.059*Reincarnation+ 0.041*Another world trip
0.088*Fantasy+ 0.063*Another world trip+ 0.051*Harem+ 0.039*adventure+ 0.039*OVL Bunko Grand Prize entry
0.105*cheat+ 0.103*Fantasy+ 0.062*The strongest hero+ 0.058*magic+ 0.044*Reincarnation
0.099*Fantasy+ 0.089*Reincarnation+ 0.045*The strongest hero+ 0.045*strongest+ 0.034*magic
0.051*magic+ 0.051*Upstart+ 0.051*monster+ 0.039*VRMMO + 0.039*serious
0.140*Fantasy+ 0.077*cheat+ 0.054*Reincarnation+ 0.043*magic+ 0.038*adventure
0.168*Fantasy+ 0.073*magic+ 0.052*love+ 0.026*adventure+ 0.026*war
Yeah, just fantasy ... There are too many of the same genres to divide into topics. It seems better to get the data from around the pixiv novel.
dictionary.filter_extremes (no_below = 5, no_above = 0.5, keep_n = 100000) I thought that it might be possible to change the fantasy-only situation by changing the value of the function and filtering, so I modified it.
topic_model_in_narou.py
# -*- coding: utf-8 -*-
import requests
from pyquery import PyQuery as pq
import gensim
import pdb
def fetch_narou_ranking_html():
r = requests.get('http://yomou.syosetu.com/rank/list/type/total_total/')
r.encoding = 'utf-8'
return r.text
def collect_tags(d):
d_novels = d('.s')
tags = []
for d_novel in d_novels:
d_tag_name_list = d_novel.findall('a')
tags_in_a_novel = [d_tag_name.text for d_tag_name in d_tag_name_list]
tags.append(tags_in_a_novel)
return tags
if __name__ == "__main__":
html = fetch_narou_ranking_html()
d = pq(html.encode('utf-8'))
tags = collect_tags(d)
dictionary = gensim.corpora.Dictionary(tags)
dictionary.filter_extremes(no_below=5, no_above=0.05, keep_n=10000) #Change
corpus = [dictionary.doc2bow(text) for text in tags]
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=20, id2word=dictionary)
for x in lda.show_topics(-1, 5):
print(x)
The value of no_above was set to 0.05. We decided not to count tags that appear in more than 5% of the total.
gensim official website http://radimrehurek.com/gensim/corpora/dictionary.html
Here are the results.
0.166*growth+ 0.133*comedy+ 0.100*battle+ 0.100*Upstart+ 0.067*Narokon Grand Prize
0.106*SF + 0.054*Trip+ 0.054*VRMMO + 0.054*Narokon Grand Prize+ 0.054*Upstart
0.142*Dragon+ 0.142*dark+ 0.073*slave+ 0.073*Aristocrat+ 0.073*battle
0.120*Wizard/witch+ 0.081*spirit+ 0.081*Elf+ 0.081*Beastman+ 0.081*Upstart
0.136*Brave+ 0.136*comedy+ 0.092*Labyrinth+ 0.092*strongest+ 0.092*VRMMO
0.140*Summon another world+ 0.106*Nation/People+ 0.071*VRMMO + 0.053*doting+ 0.036*slave
0.153*Knight+ 0.078*Misunderstanding+ 0.078*middle Ages+ 0.078*Wizard/witch+ 0.078*Summon another world
0.125*Summon+ 0.125*Adventurer+ 0.125*Beastman+ 0.125*strongest+ 0.125*Convenience
0.151*Brave+ 0.091*monster+ 0.091*Beautiful+ 0.061*war+ 0.061*doting
0.163*monster+ 0.122*friendship+ 0.082*Aristocrat+ 0.082*Upstart+ 0.082*strongest
0.260*spirit+ 0.054*Aristocrat+ 0.054*comedy+ 0.054*serious+ 0.054*Nation/People
0.143*Adventurer+ 0.096*battle+ 0.096*serious+ 0.096*Domestic affairs+ 0.049*Upstart
0.189*comedy+ 0.143*Misunderstanding+ 0.096*VRMMORPG + 0.096*comedy+ 0.049*High school student
0.147*Dragon+ 0.118*strongest+ 0.060*Elf+ 0.060*battle+ 0.060*war
0.173*slave+ 0.088*Trip+ 0.088*Magic+ 0.045*growth+ 0.045*monster
0.173*Domestic affairs+ 0.088*Brave+ 0.088*Trip+ 0.045*strongest+ 0.045*slave
0.130*serious+ 0.088*battle+ 0.088*High school student+ 0.088*Senki+ 0.088*Transfer to another world
0.143*skill+ 0.107*Template+ 0.072*war+ 0.072*Upstart+ 0.072*Magic
0.206*war+ 0.070*Domestic affairs+ 0.070*middle Ages+ 0.070*Summon+ 0.070*Nation/People
0.130*slave+ 0.130*guild+ 0.088*Aristocrat+ 0.088*Summon+ 0.045*war
After all it was just fantasy ...
But if you look closely, it's a little like "Serious, Battle, High School Student, Senki, Different World Transfer", "SF, Trip, VRMMO, Narurokon Grand Prize, Rise", "War, Domestic Affairs, Middle Ages, Summon, Nation / Ethnicity" Seems to be a different genre, so it's better than the default dictionary.filter_extremes () process.
Recommended Posts