I tried to visualize the text of the novel "Weathering with You" with WordCloud

Introduction

This article is the 26th day of Tokyo City University Advent Calendar 2019!

Yesterday's article was Lapee's Story of running Windows 98 on a holy day.

Overview

This time, I tried to visualize the text of the novel version of "Weathering with You" </ font> with wordcloud of python, so I will summarize the simple method. wordcloud is a tool that selects words that appear frequently in a sentence and illustrates (visualizes) them in a size according to the frequency of appearance. It's faster to show an image than to explain it in words. It is like this. ↓ (By the way, this seems to be the wiki page of Gekiko Pun Pun Maru) You may have seen someone doing something like this in your tweet. This time I would like to do it with the weathering with you.

image.png Quote: https://www.pc-koubou.jp/magazine/2646

image.png [Weathering with You (Kadokawa Bunko)-Amazon](https://www.amazon.co.jp/s?k=%E5%A4%A9%E6%B0%97%E3%81%AE%E5% Quoted from AD% 90 +% E5% B0% 8F% E8% AA% AC & __ mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & ref = nb_sb_noss)

Rough flow

Create a corpus (text data)

First, create a corpus of the text of the original novel. I had an acquaintance help me. Copy one sentence at a time from kindle and insert it into excel. スクリーンショット 2019-12-08 15.38.22.png Also, on this site, you can download free text data of more than 13000 modern literary works such as "heart" and "human disqualification". You can do it, so I think you should try it instead. If you try it in a modern novel, you will find some delicious words, which is also interesting. The latest novels such as "Weathering with You" can't be downloaded, so don't be afraid. ~~ (If you want to do it, please make your own) ~~

Install the required libraries (Mecab, Neologd)

Install the libraries required for morphological analysis. This time, I will use Mecab. Reference: Morphological analysis with Python and MeCab

Also, in order to support new words on the Web, we also include Neologd, a system dictionary dictionary for MeCab. Reference: https://qiita.com/spiderx_jp/items/7f8cbfd762c9abab660b

I will omit the installation method.

Try morphological analysis

First,

What is morphological analysis?

I think that there are many people, so to explain, it is the most basic work in natural language processing, ** The process is to divide the sentence into words and determine the part of speech **.

For example

"The whistle that announces the departure of the ferry echoes for a long time in the rainy sky in March" </ font>

Suppose there is a sentence. If you divide this into words

'March','',' Rain Sky',' to',',',' Ferry','',' Departure','Notify','Notify',' Whistle',' is','long','sounds' </ font>

It can be divided like this. This is called ** "separate writing" **. And when you identify the part of speech of these words, March: Noun Of: Particles Rainy sky: noun To: Particles ,: Symbol Ferry: noun Of: Particles Departure: Noun A: Particles Inform: Verb Whistle: noun G: Particles Long: adjectives Sounds: Verbs </ font>

You can analyze it like this. The process up to this point is morphological analysis. For visualization, I think that the characteristics of the weathering with you appear in the words in the sentence are ** nouns **, so I will extract only ** nouns ** from the morphologically parsed words. The code looks like this

import numpy as np
import pandas as pd
import MeCab

#Apply Neologd to Mecab
tagger = MeCab.Tagger('-Owakati -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')

tagger.parse('')
def tokenize_ja(text, lower):
    node = tagger.parseToNode(str(text))
    while node:
        if lower and node.feature.split(',')[0] in ["noun"]:#Specify the part of speech to be acquired by word-separation
            yield node.surface.lower()
        node = node.next
def tokenize(content, token_min_len, token_max_len, lower):
    return [
        str(token) for token in tokenize_ja(content, lower)
        if token_min_len <= len(token) <= token_max_len and not token.startswith('_')
    ]

#Reading training data
path='../data/tenkinoko.csv'
df_tenki=pd.read_csv(path,encoding="SHIFT-JIS")


wakati_tenkinoko_text = []
for i in df_tenki['text']:
    txt = tokenize(i, 1, 10000, True)
    wakati_tenkinoko_text.append(txt)
np.savetxt("../work/tenki_corpus.txt", wakati_tenkinoko_text,fmt='%s', delimiter=',')

df_tenki['wakati_tenkinoko'] = wakati_tenkinoko_text

The result looks like this ↓ スクリーンショット 2019-12-08 17.23.34.png Now you can extract only nouns from the text!

Visualize with wordclud

Once you've done morphological analysis, it's finally time for wordcloud! So, it doesn't look like a word cloud like "this" or "there" ~~ (I just want to say) ~~ Some nonsense words come out, so I've removed them as stop_words. The code looks like this

from wordcloud import WordCloud

tenki_wordlist = df_tenki['wakati_tenkinoko'].values.tolist()
word_cloud_list = []
for i in tenki_wordlist:
    for j in i:
        word_cloud_list.append(j)
        
result = ','.join(word_cloud_list)

#Japanese font path
fpath = "../data/ipaexg.ttf"

stop_words = ["of","Hmm","what","Sa","!?","of","SaHmm","Yo","Etc.","thing","It","so","ちゃHmm","What","want to see","As it is","くHmm","もof","!?」","There","Where","By the way","this","pie","なHmm","here"]

wordcloud = WordCloud(background_color='white',
    font_path=fpath, width=800, height=600, stopwords=set(stop_words)).generate(result)

#Save image
wordcloud.to_file('./wordcloud.png')

Execution result

Here are the results! Hey!

image.png ** It looks like it! !! !! !! ** ** After all, the claim of the character name is very intense. That's right because it's a novel. Besides, "Sempai", "Sunny woman", "Rooftop" and so on. There are some words that you can think of. I don't feel like the scene of the movie comes to mind just by looking at this image. Yes. ~~ (I don't know what it is, but I was a little impressed) ~~

Summary

I think the result was pretty good. It's fun to play with data. Next, I'm going to analyze it with word2vec. Then ~

Recommended Posts