0: What you want to do

Using ** word2vec ** based on the lyrics data of all songs of Hinatazaka46, ** Natural language ⇒ Numerical value ** I would like to convert it to and play with it.

What is Word2vec?

** Magic to convert words to vectors **

** How words turn into vectors? ** **

The weights W, W "are calculated by using the words around the input word (** this time, the distance is 1 **) as the teacher data. The calculated weight W represents the vector of each word.

Flow of natural language processing

1: Data collection

Collect data according to the task you want to solve

2: Cleaning process

Remove meaningless noise such as HTML tags ・ ** Beautiful Soup ** ・ ** Standard library re module **

1,2: Data collection & cleaning process

#1.Scraping
import requests
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')

target_url = "https://www.uta-net.com/search/?Aselect=1&Keyword=%E6%97%A5%E5%90%91%E5%9D%82&Bselect=3&x=0&y=0"
r = requests.get(target_url)
soup = BeautifulSoup(r.text,"html.parser")

music_list = soup.find_all('td', class_='side td1')
url_list = [] #Extract the URL of each song name from the song list and put it in the list
for elem in music_list:
    a = elem.find("a")
    b = a.attrs['href']
    url_list.append(b)

#＜td class="side td1"＞
#    ＜a href="/song/291307/"＞ Azato Kawaii </a＞
#＜/td＞  
#＜td class="side td1"＞
#    ＜a href="/song/250797/"＞ Uncomfortable and grown up </a＞
#＜/td＞

hinataza_kashi = "" #Send Request for each song and extract lyrics
base_url = "https://www.uta-net.com"
for i in range(len(url_list)):
    target_url = base_url + url_list[i]
    r = requests.get(target_url)
    soup = BeautifulSoup(r.text,"html.parser")
    div_list = soup.find_all("div", id = "kashi_area")

    for i in div_list:
        tmp = i.text
        hinatazaka_kashi += tmp

#＜div id="kashi_area" itemprop="text"＞
#I was caught(Yeah at a glance, Yeah, Yeah)
#＜br＞
#I fell in love without permission
#＜br＞
#It's not your fault

#Preprocessing(Remove English and symbols with regular expressions)
import re
kashi=re.sub("[a-xA-Z0-9_]","",hinatazaka_kashi)#Delete alphanumeric characters
kashi=re.sub("[!-/:-@[-`{-~]","",kashi)#Remove sign
kashi=re.sub(u"\n\n","\n",kashi)#Remove line breaks
kashi=re.sub(u"\r","",kashi)#Remove whitespace
kashi=re.sub(u"\u3000","",kashi)#Remove double-byte spaces

kashi=kashi.replace(' ','')
kashi=kashi.replace('　','')
kashi=kashi.replace('？','')
kashi=kashi.replace('。','')
kashi=kashi.replace('…','')
kashi=kashi.replace('！','')
kashi=kashi.replace('！','')
kashi=kashi.replace('「','')
kashi=kashi.replace('」','')
kashi=kashi.replace('y','')
kashi=kashi.replace('“','')
kashi=kashi.replace('”','')
kashi=kashi.replace('、','')
kashi=kashi.replace('・','')
kashi=kashi.replace('\u3000','')

with open("hinatazaka_kashi_1.txt",mode="w",encoding="utf-8") as fw:
    fw.write(kashi)

3: Word normalization

Unify half-width, full-width, lowercase and uppercase letters, etc. ・ ** Okurigana ** "Do" and "Do" "Reception" and "Reception"

・ ** Character type ** "Apples" and "apples" "Dog", "Inu" and "Dog"

・ ** Uppercase and lowercase ** Apple and apple

** * Ignore this time **

4: Morphological analysis (word division)

Divide sentences word by word ・ ** MeCab ** ・ ** Janome ** ・ ** Juman +++

5: Conversion to uninflected word

Unify to the stem (the part that is not used) Example: Learn → Learn In recent implementations, it may not be converted to the basic form.

4,5: Morphological analysis (word division) & conversion to basic form

path="hinatazaka_kashi_1.txt"
f = open(path,encoding="utf-8")
data = f.read()  #Returns the data read all the way to the end of the file
f.close()

#3.Morphological analysis
import MeCab

text = data
m = MeCab.Tagger("-Ochasen")#Tagger instance creation for parsing text

nouns = [line for line in m.parse(text).splitlines()#Using the parse method of the Tagger class returns the result of morphological analysis of the text
               if "noun" or "Adjectival noun" or "adjective" or"Adjectival noun" or "verb" or "固有noun" in line.split()[-1]]

nouns = [line.split()[0] for line in m.parse(text).splitlines()
               if "noun" or "Adjectival noun" or "adjective" or "Adjectival noun" or "verb" or "固有noun" in line.split()[-1]]

6: Stop word removal

Remove useless words, such as words that appear too often It may not be removed in recent implementations


my_stop_word=["To do","Teru","Become","Is","thing","of","Hmm","y","one","Sa","so","To be","Good","is there","Yo","もof","Absent","End up",
                 "Be","Give me","From","I wonder","That","but","Only","Tsu","hand","Until","Tsuhand","See you","Want","If","Cod","Without","Be","As it is","Taku"]

nouns_new=[]
for i in nouns:
    if i in my_stop_word:
        continue
    else:
        nouns_new.append(i)

import codecs
with codecs.open("hinatazaka_kashi_2.txt", "w", "utf-8") as f:
    f.write("\n".join(nouns_new))

7: Word digitization

Convert strings to numbers so that they can be handled by machine learning

8: Model learning

Classic machine learning ~ Neural network selection according to the task Now, let's grasp what corresponds to preprocessing in this flow.

7,8: Word quantification & model learning

from gensim.models import word2vec

corpus = word2vec.LineSentence("hinatazaka_kashi_2.txt")
model = word2vec.Word2Vec(corpus, size=100 ,min_count=3,window=5,iter=30)
model.save("hinatazaka.model")

model = word2vec.Word2Vec.load("hinatazaka.model")

#See words that are similar to the driver
print('Top 10 words related to likes')
similar_words = model.wv.most_similar(positive=[u"Like"], topn=10)
for key,value in similar_words:
    print('{}\t\t{:.2f}'.format(key, value))

print('-----')
# #Calculate the similarity between two words
similarity = model.wv.similarity(w1=u"Smile", w2=u"summer")
print('Similarity between "smile" and "summer"=>' + str(similarity))

similarity = model.wv.similarity(w1=u"friend", w2=u"summer")
print("Similarity between "friends" and "summer"=>" + str(similarity))

similarity = model.wv.similarity(w1=u"girl", w2=u"Man")
print('Similarity between "girl" and "man"=>' + str(similarity))

Degree of similarity

The similarity that appears in this is ** cos similarity **. To put it simply, cos similarity is a numerical value of how much two vectors point in the same direction (similarity). A cos similarity of 0 indicates low similarity, and a cos similarity of 1 indicates low similarity. The cos similarity is expressed by the following formula.

[Overview]

*** "Papa Duwa Duwa Duwa Duwa Duwa Duwa Duwa Papa Papa" *** What is it? If you get the data from the member's blog instead of the lyrics, you can see how close the members are. (Let's try next time ...)

References

1: Basics of Natural Language Processing (TensorFlow)

2: [uepon daily memorandum] (https://uepon.hatenadiary.com/entry/2019/02/10/150804)

3: [Np-Ur data analysis class] (https://www.randpy.tokyo/entry/word2vec_skip_gram_model)

I tried to vectorize the lyrics of Hinatazaka46!