Analyze the topic model of becoming a novelist with GensimPy3

There is a Python library that handles topic models called gensim. Officially, it only supports Python version 2.5 <= Python <3.0.

However, Samantp has released a library called gensimPy3. It's a fork of gensim for Python 3.3.

This time, using this gensimPy3, it is the same as Shoto's Analyzing the ranking of becoming a novelist with a topic model (gensim) I experimented to see if I could do it.

Reference for how to use gensim http://yuku-tech.hatenablog.com/entry/20110623/1308810518

Installation of GensimPy3

https://github.com/samantp/gensimPy3

Clone the source code from.

git clone [email protected]:samantp/gensimPy3.git

so,

python setup.py test

if you do

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 3973: invalid start byte

I got an error and became anxious. But I ignored it and installed it.

python setup.py install

I got a Syntax Error for some reason, but the installation was successful.

Differences in environment

The differences from Shoto's article http://sucrose.hatenablog.com/entry/2013/04/27/225218 are described below.

--Using Python3.3.1 on pyenv --gensim uses gensimPy3 -Use pyquery instead of BeautifulSoup --Use requests instead of urllib2

Since pyquery can use the same notation selector as JQuery, it is easy to make trial and error using the Chrome developer tool Console. Convenient.

Source code

`topic_model_in_narou.py`


# -*- coding: utf-8 -*-
import requests
from pyquery import PyQuery as pq
import gensim
import pdb


def fetch_narou_ranking_html():
    r = requests.get('http://yomou.syosetu.com/rank/list/type/total_total/')
    r.encoding = 'utf-8'
    return r.text


def collect_tags(d):
    d_novels = d('.s')
    tags = []
    for d_novel in d_novels:
        d_tag_name_list = d_novel.findall('a')
        tags_in_a_novel = [d_tag_name.text for d_tag_name in d_tag_name_list]
        tags.append(tags_in_a_novel)
    return tags


if __name__ == "__main__":
    html = fetch_narou_ranking_html()
    d = pq(html.encode('utf-8'))
    tags = collect_tags(d)

    dictionary = gensim.corpora.Dictionary(tags)
    dictionary.filter_extremes(3)
    corpus = [dictionary.doc2bow(text) for text in tags]
    lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)
    for x in lda.show_topics(-1, 5):
        print(x)

I made a model with this code and was able to display topics. The results are as follows.

0.115*Fantasy+ 0.068*magic+ 0.041*cheat+ 0.034*Harem+ 0.028*Reincarnation
0.173*magic+ 0.095*Fantasy+ 0.039*dark+ 0.033*Trip+ 0.027*Reincarnation
0.106*Reincarnation+ 0.087*Harem+ 0.074*The strongest hero+ 0.063*cheat+ 0.052*Fantasy
0.079*Fantasy+ 0.069*cheat+ 0.062*love+ 0.059*Reincarnation+ 0.041*Another world trip
0.088*Fantasy+ 0.063*Another world trip+ 0.051*Harem+ 0.039*adventure+ 0.039*OVL Bunko Grand Prize entry
0.105*cheat+ 0.103*Fantasy+ 0.062*The strongest hero+ 0.058*magic+ 0.044*Reincarnation
0.099*Fantasy+ 0.089*Reincarnation+ 0.045*The strongest hero+ 0.045*strongest+ 0.034*magic
0.051*magic+ 0.051*Upstart+ 0.051*monster+ 0.039*VRMMO + 0.039*serious
0.140*Fantasy+ 0.077*cheat+ 0.054*Reincarnation+ 0.043*magic+ 0.038*adventure
0.168*Fantasy+ 0.073*magic+ 0.052*love+ 0.026*adventure+ 0.026*war

Yeah, just fantasy ... There are too many of the same genres to divide into topics. It seems better to get the data from around the pixiv novel.

Postscript

dictionary.filter_extremes (no_below = 5, no_above = 0.5, keep_n = 100000) I thought that it might be possible to change the fantasy-only situation by changing the value of the function and filtering, so I modified it.

`topic_model_in_narou.py`


# -*- coding: utf-8 -*-
import requests
from pyquery import PyQuery as pq
import gensim
import pdb


def fetch_narou_ranking_html():
    r = requests.get('http://yomou.syosetu.com/rank/list/type/total_total/')
    r.encoding = 'utf-8'
    return r.text


def collect_tags(d):
    d_novels = d('.s')
    tags = []
    for d_novel in d_novels:
        d_tag_name_list = d_novel.findall('a')
        tags_in_a_novel = [d_tag_name.text for d_tag_name in d_tag_name_list]
        tags.append(tags_in_a_novel)
    return tags


if __name__ == "__main__":
    html = fetch_narou_ranking_html()
    d = pq(html.encode('utf-8'))
    tags = collect_tags(d)

    dictionary = gensim.corpora.Dictionary(tags)
    dictionary.filter_extremes(no_below=5, no_above=0.05, keep_n=10000)  #Change
    corpus = [dictionary.doc2bow(text) for text in tags]
    lda = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=20, id2word=dictionary)
    for x in lda.show_topics(-1, 5):
        print(x)

The value of no_above was set to 0.05. We decided not to count tags that appear in more than 5% of the total.

gensim official website http://radimrehurek.com/gensim/corpora/dictionary.html

Here are the results.

0.166*growth+ 0.133*comedy+ 0.100*battle+ 0.100*Upstart+ 0.067*Narokon Grand Prize
0.106*SF + 0.054*Trip+ 0.054*ＶＲＭＭＯ + 0.054*Narokon Grand Prize+ 0.054*Upstart
0.142*Dragon+ 0.142*dark+ 0.073*slave+ 0.073*Aristocrat+ 0.073*battle
0.120*Wizard/witch+ 0.081*spirit+ 0.081*Elf+ 0.081*Beastman+ 0.081*Upstart
0.136*Brave+ 0.136*comedy+ 0.092*Labyrinth+ 0.092*strongest+ 0.092*ＶＲＭＭＯ
0.140*Summon another world+ 0.106*Nation/People+ 0.071*ＶＲＭＭＯ + 0.053*doting+ 0.036*slave
0.153*Knight+ 0.078*Misunderstanding+ 0.078*middle Ages+ 0.078*Wizard/witch+ 0.078*Summon another world
0.125*Summon+ 0.125*Adventurer+ 0.125*Beastman+ 0.125*strongest+ 0.125*Convenience
0.151*Brave+ 0.091*monster+ 0.091*Beautiful+ 0.061*war+ 0.061*doting
0.163*monster+ 0.122*friendship+ 0.082*Aristocrat+ 0.082*Upstart+ 0.082*strongest
0.260*spirit+ 0.054*Aristocrat+ 0.054*comedy+ 0.054*serious+ 0.054*Nation/People
0.143*Adventurer+ 0.096*battle+ 0.096*serious+ 0.096*Domestic affairs+ 0.049*Upstart
0.189*comedy+ 0.143*Misunderstanding+ 0.096*VRMMORPG + 0.096*comedy+ 0.049*High school student
0.147*Dragon+ 0.118*strongest+ 0.060*Elf+ 0.060*battle+ 0.060*war
0.173*slave+ 0.088*Trip+ 0.088*Magic+ 0.045*growth+ 0.045*monster
0.173*Domestic affairs+ 0.088*Brave+ 0.088*Trip+ 0.045*strongest+ 0.045*slave
0.130*serious+ 0.088*battle+ 0.088*High school student+ 0.088*Senki+ 0.088*Transfer to another world
0.143*skill+ 0.107*Template+ 0.072*war+ 0.072*Upstart+ 0.072*Magic
0.206*war+ 0.070*Domestic affairs+ 0.070*middle Ages+ 0.070*Summon+ 0.070*Nation/People
0.130*slave+ 0.130*guild+ 0.088*Aristocrat+ 0.088*Summon+ 0.045*war

After all it was just fantasy ...

But if you look closely, it's a little like "Serious, Battle, High School Student, Senki, Different World Transfer", "SF, Trip, VRMMO, Narurokon Grand Prize, Rise", "War, Domestic Affairs, Middle Ages, Summon, Nation / Ethnicity" Seems to be a different genre, so it's better than the default dictionary.filter_extremes () process.