Introduction

This article is the 26th day of Tokyo City University Advent Calendar 2019!

Yesterday's article was Lapee's Story of running Windows 98 on a holy day.

Overview

This time, I tried to visualize the text of the novel version of "Weathering with You" </ font> with wordcloud of python, so I will summarize the simple method. wordcloud is a tool that selects words that appear frequently in a sentence and illustrates (visualizes) them in a size according to the frequency of appearance. It's faster to show an image than to explain it in words. It is like this. ↓ (By the way, this seems to be the wiki page of Gekiko Pun Pun Maru) You may have seen someone doing something like this in your tweet. This time I would like to do it with the weathering with you.

Quote: https://www.pc-koubou.jp/magazine/2646

[Weathering with You (Kadokawa Bunko)-Amazon](https://www.amazon.co.jp/s?k=%E5%A4%A9%E6%B0%97%E3%81%AE%E5% Quoted from AD% 90 +% E5% B0% 8F% E8% AA% AC & __ mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & ref = nb_sb_noss)

Rough flow

Create a corpus (text data).
Install the required libraries (Mecab, Neologd)
Morphological analysis
Visualize with wordcloud

Create a corpus (text data)

First, create a corpus of the text of the original novel. I had an acquaintance help me. Copy one sentence at a time from kindle and insert it into excel. スクリーンショット 2019-12-08 15.38.22.png Also, on this site, you can download free text data of more than 13000 modern literary works such as "heart" and "human disqualification". You can do it, so I think you should try it instead. If you try it in a modern novel, you will find some delicious words, which is also interesting. The latest novels such as "Weathering with You" can't be downloaded, so don't be afraid. ~~ (If you want to do it, please make your own) ~~

Install the required libraries (Mecab, Neologd)

Install the libraries required for morphological analysis. This time, I will use Mecab. Reference: Morphological analysis with Python and MeCab

Also, in order to support new words on the Web, we also include Neologd, a system dictionary dictionary for MeCab. Reference: https://qiita.com/spiderx_jp/items/7f8cbfd762c9abab660b

I will omit the installation method.

Try morphological analysis

First,

What is morphological analysis?

I think that there are many people, so to explain, it is the most basic work in natural language processing, ** The process is to divide the sentence into words and determine the part of speech **.

For example

"The whistle that announces the departure of the ferry echoes for a long time in the rainy sky in March" </ font>

Suppose there is a sentence. If you divide this into words

'March','',' Rain Sky',' to',',',' Ferry','',' Departure','Notify','Notify',' Whistle',' is','long','sounds' </ font>

It can be divided like this. This is called ** "separate writing" **. And when you identify the part of speech of these words, March: Noun Of: Particles Rainy sky: noun To: Particles ,: Symbol Ferry: noun Of: Particles Departure: Noun A: Particles Inform: Verb Whistle: noun G: Particles Long: adjectives Sounds: Verbs </ font>

You can analyze it like this. The process up to this point is morphological analysis. For visualization, I think that the characteristics of the weathering with you appear in the words in the sentence are ** nouns **, so I will extract only ** nouns ** from the morphologically parsed words. The code looks like this

import numpy as np
import pandas as pd
import MeCab

#Apply Neologd to Mecab
tagger = MeCab.Tagger('-Owakati -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')

tagger.parse('')
def tokenize_ja(text, lower):
    node = tagger.parseToNode(str(text))
    while node:
        if lower and node.feature.split(',')[0] in ["noun"]:#Specify the part of speech to be acquired by word-separation
            yield node.surface.lower()
        node = node.next
def tokenize(content, token_min_len, token_max_len, lower):
    return [
        str(token) for token in tokenize_ja(content, lower)
        if token_min_len <= len(token) <= token_max_len and not token.startswith('_')
    ]

#Reading training data
path='../data/tenkinoko.csv'
df_tenki=pd.read_csv(path,encoding="SHIFT-JIS")


wakati_tenkinoko_text = []
for i in df_tenki['text']:
    txt = tokenize(i, 1, 10000, True)
    wakati_tenkinoko_text.append(txt)
np.savetxt("../work/tenki_corpus.txt", wakati_tenkinoko_text,fmt='%s', delimiter=',')

df_tenki['wakati_tenkinoko'] = wakati_tenkinoko_text

The result looks like this ↓ スクリーンショット 2019-12-08 17.23.34.png Now you can extract only nouns from the text!

Visualize with wordclud

Once you've done morphological analysis, it's finally time for wordcloud! So, it doesn't look like a word cloud like "this" or "there" ~~ (I just want to say) ~~ Some nonsense words come out, so I've removed them as stop_words. The code looks like this

from wordcloud import WordCloud

tenki_wordlist = df_tenki['wakati_tenkinoko'].values.tolist()
word_cloud_list = []
for i in tenki_wordlist:
    for j in i:
        word_cloud_list.append(j)
        
result = ','.join(word_cloud_list)

#Japanese font path
fpath = "../data/ipaexg.ttf"

stop_words = ["of","Hmm","what","Sa","!?","of","SaHmm","Yo","Etc.","thing","It","so","ちゃHmm","What","want to see","As it is","くHmm","もof","!?」","There","Where","By the way","this","pie","なHmm","here"]

wordcloud = WordCloud(background_color='white',
    font_path=fpath, width=800, height=600, stopwords=set(stop_words)).generate(result)

#Save image
wordcloud.to_file('./wordcloud.png')

Execution result

Here are the results! Hey!

** It looks like it! !! !! !! ** ** After all, the claim of the character name is very intense. That's right because it's a novel. Besides, "Sempai", "Sunny woman", "Rooftop" and so on. There are some words that you can think of. I don't feel like the scene of the movie comes to mind just by looking at this image. Yes. ~~ (I don't know what it is, but I was a little impressed) ~~

Summary

I think the result was pretty good. It's fun to play with data. Next, I'm going to analyze it with word2vec. Then ~

I tried to visualize the text of the novel "Weathering with You" with WordCloud