https://www.news-postseven.com/archives/20201203_1617538.html/2
In the article, it is written as follows.
Mr. Nakajima sings "I" and Yumin sings "you". Mr. Nakajima sings "night," "crying," and "lie." Yumin sings "morning," "love," and "like."
Certainly, I feel like that. I don't know much about it, so I think so. Except for the unpleasant double quotes in the original article, it's smooth.
But I think it's a little too catchy. Let's analyze a little and make a copy of ourselves to see if it really is.
If this is the case, I may be able to do it, as I am sitting in the gym in the corner of the IT industry.
Thanks to the people who make a nice library.
Scraping is a beautiful soup that I have always been indebted to. I scrape the lyrics quickly, but it can be a little worrisome. It is a copyright.
https://dailytextmining.hatenablog.com/entry/2018/08/02/065500
Hmmm, it seems to be a problem if it is used for data analysis. However, it is a problem if the lyrics are spilled. Recently, I often mention it in git, but if I inadvertently upload the original data after scraping, Be careful as you are likely to enter in ↓. https://qiita.com/advent-calendar/2020/yarakashi-production
I used to use MeCab a lot until now, but I was always worried about stocking it here and there. I will try using GINZA immediately. Japanese NLP Library GiNZA Recommendation
Because I use it mostly at work, I long for this kind of thing ... Wai "User dictionary, it's hard to make by hand. Santa Maria"
One evening elephant is in an elephant hut, looking up at the moon of the tenth while eating three straws, He said, "It's painful. Santa Maria." Source: Aozora Bunko Obbel and the Elephant Kenji Miyazawa
A word of the whole body of an elephant can be visualized like this.
what's this? 3 Do you want to go?
nlp = spacy.load('ja_ginza')
doc = nlp('One evening, an elephant was in an elephant hut, eating three straws, looking up at the moon on the tenth day, and saying, "It's painful. Santa Maria."')
displacy.serve(doc, style='dep')
As a salaryman, I have a lot of sympathy for elephants, but I will wipe my tears and proceed with my work at GINZA. What I want to do in morphological analysis is to return to the part of speech and basic type. In the first place, the original material is lyrics, so I thought it would be okay to just use nouns, but I was so lonely. With nouns, adjectives, and verbs, the acquired words are returned to the basic type.
def make_words_list(text: str) -> list:
rs = []
doc = nlp(text)
for sent in doc.sents:
for token in sent:
tag = token.tag_.split('-')[0]
if tag in ['noun','adjective','verb']:
# if tag in ['noun']:
rs.append(token.lemma_)
return rs
As mentioned above, spaCy is also wonderful, but I am grateful to the people of GINZA who have improved the Japanese language.
The point is the Pandas data frame. After this, I will use nlplot to visualize it nicely, but it is very comfortable because ** DataSeries direct delivery ** is possible.
I have omitted it, but the title and lyrics are in the state of being acquired by scraping.
title | lyrics | words |
---|---|---|
For kindness... | Good poem | [Word 1,Word 2,Word 3] |
Airplane... | Good poem | [Word 4,Word 5,Word 6] |
This time I will use nlplot. I've been interested in this for a long time, but I haven't had a chance to use it until now, so I'll take this opportunity.
Especially 3-5 is something I've never done before.
N-gram bar chart Oh, that's good! It's clean. It is also interactive because it is displayed in the browser with pyplot.
N-gram tree Map It's more flashy than the bar chart. This is good when you want to see a rough atmosphere rather than a small number. It may be good to use the presentation as a quiet talk or as a chapter cover.
wordcloud It looks like this in the word cloud Word cloud doesn't look good without long words to some extent.
Yumi Matsutoya
Miyuki Nakajima
This is also displayed in the browser with pyplot, so it is also interactive. The co-occurrence network is interesting when you look at the relationships between words like this one. Above all, I'm glad that it's easy to make.
Yumi Matsutoya
Miyuki Nakajima
sunburst chart This is also amazing, the output is pretty clean. The view is the same, but it would be nice if there was a stronger message, but it was my fault. I should have put in a stop word. .. ..
** "Yumin sings time, Miyuki Nakajima sings place." **. I was surprised when I analyzed it, but the top words are quite the same, aren't they?
That is, the one with a small number of cases may have more characteristics, so let's look at the one with a smaller number of cases. Yumin tends to have many verbs, and Miyuki Nakajima tends to have many nouns. And I think Miyuki Nakajima has many words related to nature such as "sky" and "sea", and Yumin has many words related to personal names such as "two people" and "you".
The age is a little lower than the Yumin generation, and it is indistinguishable between "Michopa" and "Yuki Poyo".
About Yumi Matsutoya ・ Wind crossing the pier ・ Refrain is screaming I like It is said that the wind across the pier is tuned at 450Hz, which is higher than the standard pitch. That refreshing feeling may be something that can be achieved by analyzing the voice system rather than analyzing the natural language.
About Miyuki Nakajima ·Fight! ·light sleep I like. We also provide music to many artists.
github https://github.com/Katsutoshi-Inuga/qiita_2020_advent_cal_lyrics_nlp