This article is the sixth day of estie Advent Calendar 2019. I'm an engineer at a real estate venture estie.inc.
Recently, this article has become a hot topic! [Python] I visualized Arashi's lyrics on WordCloud and tried to unravel what I wanted to convey to fans in the 20th year of formation
It makes me really happy when my favorite idols and artists have been active and loved for many years. As a fan, I really understand their words and the desire to confirm what I wanted to convey.
By the way, by chance, there is an artist who also celebrated the 20th anniversary of its formation.
That's right, everyone loves Perfume.
As you know, Perfume has a high affinity with technology, and [Google's Machine Learning](https://cloud.google.com/blog/ja/products/gcp/nhk-perfume-technology-reframe-your-photo- We continue to send out cutting-edge expressions such as live production using google-tensorflow) and live distribution on 5G. I will. Mr. Rhizomatiks.
So, as one of the fans who have been to Perfume's live for about 10 years I will try morphological analysis + WordCloud visualization of Perfume lyrics.
Like our predecessors Acquisition of lyrics → Morphological analysis → WordCloud I will do it. For details, please see [Reference Site](#Reference Site)
I haven't done much text mining, so I thought it was MeCab when it came to morphological analysis. It seems that there are various morphological analysis tools when I look it up.
This time from among them
I would like to try this trio.
MeCab This is a standard morphological analysis tool developed by the current Google Japanese Input developer. It works in any environment, but a separate dictionary is required for analysis. This time I used the officially recommended IPA dictionary + new word dictionary
macab_.py
import MeCab
#Reading lyrics file
text_data = open("perfume.txt", "rb").read()
text = text_data.decode('utf-8')
#Morphological analysis
mecab = MeCab.Tagger("-ochasen")
node = mecab.parseToNode(text)
perfume_list = []
tags = ["noun","verb", "adverb", "adjective", "形容verb"]
while node:
#Word extraction
word = node.surface
#Extraction of part of speech
word_class = node.feature.split(",")[0]
#Extract only specific part of speech
if word_class in tags:
perfume_list.append(word)
node = node.next
print(perfume_list)
Janome
This is also the second most popular analysis tool after MeCab.
Execution speed is slower than MeCab, but there are few dictionary inclusions and dependent libraries
pip install janome
The ease with which the installation is completed is attractive.
It seems that it is often used in the verification of the previous stage of MeCab.
janome_.py
from janome.tokenizer import Tokenizer
#Reading lyrics file
text_data = open("perfume.txt", "rb").read()
text = text_data.decode('utf-8')
#Morphological analysis
t = Tokenizer()
seps = t.tokenize(text)
perfume_list = []
tags = ["noun","verb", "adverb", "adjective", "形容verb"]
for _ in seps:
#Word extraction
if _.base_form == '*':
word = _.surface
else:
word = _.base_form
#Extraction of part of speech
ps = _.part_of_speech
word_class = ps.split(',')[0]
#Extract only specific part of speech
if word_class in tags:
perfume_list.append(word)
print(perfume_list)
Nagisa
This is a relatively new tool. Easy to build environment like Janome
pip install nagisa
Installation is complete.
This time it's lyrics, so I can't make use of it, but it seems that it can perform robust analysis for emoticons and URLs.
There is a filtering method for output words by part of speech, so it can be easily extracted.
nagisa_.py
import nagisa
#Reading lyrics file
text_data = open("perfume.txt", "rb").read()
text = text_data.decode('utf-8')
#Morphological analysis / word extraction by specifying part of speech
tags = ["noun","verb", "adverb", "adjective", "形容verb"]
perfume_list = nagisa.extract(text, extract_postags=tags).words
print(perfume_list)
Mecab
Janome
Nagisa
Mecab and Janome, which use the same dictionary, gave similar results.
Pa Pa I'm sure you're loving you today, isn't it a disco disco? There are many songs that repeat the song titles, so that influence is also reflected!
The number of text mining tools is abundant and easy to use, and I'm happy to be able to easily visualize this. Why don't you try it with your favorite artist?
By the way, in estie I'm currently joining, by visualizing office data We offer a variety of real estate x technology services. If you are thinking of moving your office, please use estie! We also provide a real estate data platform estie pro.
Also, estie is looking for a web engineer Wantedly Please feel free to come visit us at the office!
--Lyrics obtained from uta-net -[Python] Visualized Arashi's lyrics with WordCloud and tried to unravel what I wanted to convey to fans in 20 years of formation -[Python] I tried to visualize the night on the Galactic Railroad with WordCloud! -nagisa: Japanese word division and part-of-speech tagging tool by RNN
Recommended Posts