Since it is the 20th anniversary of the formation, I tried to visualize the lyrics of Perfume with Word Cloud

This article is the sixth day of estie Advent Calendar 2019. I'm an engineer at a real estate venture estie.inc.

Introduction

Recently, this article has become a hot topic! [Python] I visualized Arashi's lyrics on WordCloud and tried to unravel what I wanted to convey to fans in the 20th year of formation

It makes me really happy when my favorite idols and artists have been active and loved for many years. As a fan, I really understand their words and the desire to confirm what I wanted to convey.

By the way, by chance, there is an artist who also celebrated the 20th anniversary of its formation.

That's right, everyone loves Perfume.

As you know, Perfume has a high affinity with technology, and [Google's Machine Learning](https://cloud.google.com/blog/ja/products/gcp/nhk-perfume-technology-reframe-your-photo- We continue to send out cutting-edge expressions such as live production using google-tensorflow) and live distribution on 5G. I will. Mr. Rhizomatiks.

So, as one of the fans who have been to Perfume's live for about 10 years I will try morphological analysis + WordCloud visualization of Perfume lyrics.

environment

MacOS Mojave
Python3.8

manner

Like our predecessors Acquisition of lyrics → Morphological analysis → WordCloud I will do it. For details, please see [Reference Site](#Reference Site)

Morphological analysis tool

I haven't done much text mining, so I thought it was MeCab when it came to morphological analysis. It seems that there are various morphological analysis tools when I look it up.

This time from among them

I would like to try this trio.

MeCab This is a standard morphological analysis tool developed by the current Google Japanese Input developer. It works in any environment, but a separate dictionary is required for analysis. This time I used the officially recommended IPA dictionary + new word dictionary

`macab_.py`


import MeCab

#Reading lyrics file
text_data = open("perfume.txt", "rb").read()
text = text_data.decode('utf-8')

#Morphological analysis
mecab = MeCab.Tagger("-ochasen")
node = mecab.parseToNode(text)

perfume_list = []
tags = ["noun","verb", "adverb", "adjective", "形容verb"]

while node:
    #Word extraction
    word = node.surface
    #Extraction of part of speech
    word_class = node.feature.split(",")[0]
    
    #Extract only specific part of speech
    if word_class in tags:
        perfume_list.append(word)

    node = node.next

print(perfume_list)

Janome This is also the second most popular analysis tool after MeCab. Execution speed is slower than MeCab, but there are few dictionary inclusions and dependent libraries pip install janome The ease with which the installation is completed is attractive. It seems that it is often used in the verification of the previous stage of MeCab.

`janome_.py`


from janome.tokenizer import Tokenizer

#Reading lyrics file
text_data = open("perfume.txt", "rb").read()
text = text_data.decode('utf-8')

#Morphological analysis
t = Tokenizer()
seps = t.tokenize(text)

perfume_list = []
tags = ["noun","verb", "adverb", "adjective", "形容verb"]

for _ in seps:
    #Word extraction
    if _.base_form == '*':
        word = _.surface
    else:
        word = _.base_form

    #Extraction of part of speech
    ps = _.part_of_speech
    word_class = ps.split(',')[0]

    #Extract only specific part of speech
    if word_class in tags:
        perfume_list.append(word)

print(perfume_list)

Nagisa This is a relatively new tool. Easy to build environment like Janome pip install nagisa Installation is complete. This time it's lyrics, so I can't make use of it, but it seems that it can perform robust analysis for emoticons and URLs. There is a filtering method for output words by part of speech, so it can be easily extracted.

`nagisa_.py`


import nagisa

#Reading lyrics file
text_data = open("perfume.txt", "rb").read()
text = text_data.decode('utf-8')

#Morphological analysis / word extraction by specifying part of speech
tags = ["noun","verb", "adverb", "adjective", "形容verb"]
perfume_list = nagisa.extract(text, extract_postags=tags).words

print(perfume_list)

result

Mecab
Janome
Nagisa

Mecab and Janome, which use the same dictionary, gave similar results.

in conclusion

Pa Pa I'm sure you're loving you today, isn't it a disco disco? There are many songs that repeat the song titles, so that influence is also reflected!

The number of text mining tools is abundant and easy to use, and I'm happy to be able to easily visualize this. Why don't you try it with your favorite artist?

By the way, in estie I'm currently joining, by visualizing office data We offer a variety of real estate x technology services. If you are thinking of moving your office, please use estie! We also provide a real estate data platform estie pro.

Also, estie is looking for a web engineer Wantedly Please feel free to come visit us at the office!

Reference site

--Lyrics obtained from uta-net -[Python] Visualized Arashi's lyrics with WordCloud and tried to unravel what I wanted to convey to fans in 20 years of formation -[Python] I tried to visualize the night on the Galactic Railroad with WordCloud! -nagisa: Japanese word division and part-of-speech tagging tool by RNN