Kenshi Yonezu sells every time he composes. The lyrics that are spun out seem to have the power to fascinate people. This time, I decided to let deep learning learn its charm.
This article is up to "Data preprocessing". Here are the general steps:
Generally, "pre-processing" refers to data processing for improving accuracy such as normalization, but this "pre-processing" means shaping so that it becomes an input or output of deep learning. ..
Framework: Pytorch Model: seq2seq with Attention
It is one of the methods used for "machine translation". The following is an image of seq2seq.
Quoted article: [Encoder-decoder model and Teacher Forcing, Scheduled Sampling, Professor Forcing](https://satopirka.com/2018/02/encoder-decoder%E3%83%A2%E3%83%87%E3%83] % AB% E3% 81% A8teacher-forcingscheduled-samplingprofessor-forcing /)
This makes it possible to generate sentences with Decoder based on the information encoded on the Encoder side, but there are actually some problems. That is, the Decoder input can only be represented by a fixed length vector. The output of the Encoder is a hidden layer $ h $, but this size is fixed. Therefore, a dataset with an input sequence length that is too long will not be able to properly compress the information into $ h $, and a dataset with an input sequence length that is too short will incorporate wasteful information into $ h $. I will. Therefore, you will want to use ** not only the state of the last hidden layer of the Encoder, but also the state of the hidden layer in the middle **.
This is the background behind Attention's inventor.
Attention is a method for paying attention to important points in the past (= Attention) when dealing with time series data. This time, the "next passage" is predicted for the "lyric passage" of a certain song, so in order to predict the next passage, what should we pay attention to in the previous passage **? Become. Below is an image of Attention.
source: Effective Approaches to Attention-based Neural Machine Translation
According to the reference paper, it is more accurately called the Global Attention model. By collecting all the hidden states of the Encoder as a vector and taking the inner product of them and the output of the Decoder, ** "similarity between all the hidden states of the Encoder and the output of the Decoder" ** can be obtained. Measuring this similarity by inner product is the reason why it is called Attention that "focuses on important factors".
After uploading the required self-made module to Google colab Copy and execute main.py described later.
** Required self-made module **
As shown below, Kenshi Yonezu predicts the "next passage" from the "one passage" of the songs that have been released so far.
|Input text|Output text| |-------+-------| |I'm really happy to see you| _All of them are sad as a matter of course| |All of them are sad as a matter of course| _I have painfully happy memories now| |I have painfully happy memories now| _Raise and walk the farewell that will come someday| |Raise and walk the farewell that will come someday| _It ’s already enough to take someone's place|
This was created by scraping from Lyrics Net.
Get the lyrics by scraping with the code below In addition, these are executed by Google Colab.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.select import Select
import requests
from bs4 import BeautifulSoup
import re
import time
setting
# Chrome option to launch Selenium in any environment
options = Options()
options.add_argument('--disable-gpu');
options.add_argument('--disable-extensions');
options.add_argument('--proxy-server="direct://"');
options.add_argument('--proxy-bypass-list=*');
options.add_argument('--start-maximized');
options.add_argument('--headless');
class DriverConrol():
def __init__(self, driver):
self.driver = driver
def get(self, url):
self.driver.get(url)
def get_text(self, selector):
element = self.driver.find_element_by_css_selector(selector)
return element.text
def get_text_by_attribute(self, selector, attribute='value'):
element = self.driver.find_element_by_css_selector(selector)
return element.get_attribute(attribute)
def input_text(self, selector, text):
element = self.driver.find_element_by_css_selector(selector)
element.clear()
element.send_keys(text)
def select_option(self, selector, text):
element = driver.find_element_by_css_selector(selector)
Select(element).select_by_visible_text(text)
def click(self, selector):
element = self.driver.find_element_by_css_selector(selector)
element.click()
def get_lyric(self, url):
self.get(url)
time.sleep(2)
element = self.driver.find_element_by_css_selector('#kashi_area')
lyric = element.text
return lyric
def get_url(self):
return self.driver.current_url
def quit(self):
self.driver.quit()
BASE_URL = 'https://www.uta-net.com/'
search_word ='Kenshi Yonezu'
search_jenre ='Lyricist name'
driver = webdriver.Chrome(chrome_options=options)
dc = DriverConrol(driver)
dc.get (BASE_URL) #access
# Search
dc.input_text('#search_form > div:nth-child(1) > input.search_input', search_word)
dc.select_option('#search_form > div:nth-child(2) > select', search_jenre)
dc.click('#search_form > div:nth-child(1) > input.search_submit')
time.sleep(2)
# Get url at once with requests
response = requests.get(dc.get_url())
response.encoding = response.apparent_encoding # Anti-garbled characters
soup = BeautifulSoup(response.text, "html.parser")
side_td1s = soup.find_all (class_="side td1") # Get all td elements with class side td1
lyric_urls = [side_td1.find ('a', href = re.compile ('song')). get ('href') for side_td1 in side_td1s] # side_td1s contains, href contains''song a tag Get the href element of
music_names = [side_td1.find ('a', href = re.compile ('song')). text for side_td1 in side_td1s] # Get all song titles
# Get the lyrics and add them to lyrics_lis
lyric_lis = list()
for lyric_url in lyric_urls:
lyric_lis.append(dc.get_lyric(BASE_URL + lyric_url))
with open(search_word + '_lyrics.txt', 'wt') as f_lyric, open(search_word + '_musics.txt', 'wt') as f_music:
for lyric, music in zip(lyric_lis, music_names):
f_lyric.write(lyric + '\n\n')
f_music.write(music + '\n')
** Excerpt from the acquired lyrics **
I'm really happy to see you
All of them are sad as a matter of course
I have painfully happy memories now
Raise and walk the farewell that will come someday
It ’s enough to take someone's place and live
I wish I could be a stone
If so, there is no misunderstanding or confusion
That way without even knowing you
...
At present, it is far from the data shown in [Problem setting], so "format data" is performed.
That is, it does this.
Format the data with the following code The code is confusing, but this completes the pre-processing.
from datasets import LyricDataset
import torch
import torch.optim as optim
from modules import *
from device import device
from utils import *
from dataloaders import SeqDataLoader
import math
import os
from utils
==========================================
# Data preparation
==========================================
# Kenshi Yonezu_lyrics.txt path
file_path = "lyric / Kenshi Yonezu_lyrics.txt"
edited_file_path = "lyric / Kenshi Yonezu_lyrics_edit.txt"
yonedu_dataset = LyricDataset(file_path, edited_file_path)
yonedu_dataset.prepare()
check
print(yonedu_dataset[0])
# Divide into train and test at 8: 2
train_rate = 0.8
data_num = len(yonedu_dataset)
train_set = yonedu_dataset[:math.floor(data_num * train_rate)]
test_set = yonedu_dataset[math.floor(data_num * train_rate):]
from sklearn.model_selection import train_test_split
from janome.tokenizer import Tokenizer
import torch
from utils import *
class LyricDataset(torch.utils.data.Dataset):
def __init__(self, file_path, edited_file_path, transform=None):
self.file_path = file_path
self.edited_file_path = edited_file_path
self.tokenizer = Tokenizer(wakati=True)
self.input_lines = [] # NN input array (each element is text)
self.output_lines = [] # Array that is the correct answer data for NN (each element is text)
self.word2id = {} # e.g.) {'word0': 0, 'word1': 1, ...}
self.input_data = [] # A passage of lyrics in which each word is ID
self.output_data = [] # The next passage where each word is ID
self.word_num_max = None
self.transform = transform
self._no_brank()
def prepare(self):
# Returns an array (text) that is the input of NN and an array that is the correct data (text) of NN.
self.get_text_lines()
Assign an ID to all characters that appear in # date.txt
for line in self.input_lines + self.output_lines: # First passage and subsequent passages
self.get_word2id(line)
# Find the maximum number of words in a passage
self.get_word_num_max()
# Returns an array (ID) that is the input of NN and an array that is the correct answer data (ID) of NN.
for input_line, output_line in zip(self.input_lines, self.output_lines):
self.input_data.append([self.word2id[word] for word in self.line2words(input_line)] \
+ [self.word2id[" "] for _ in range(self.word_num_max - len(self.line2words(input_line)))])
self.output_data.append([self.word2id[word] for word in self.line2words(output_line)] \
+ [self.word2id[" "] for _ in range(self.word_num_max - len(self.line2words(output_line)))])
def _no_brank(self):
# Take whitespace between lines
with open(self.file_path, "r") as fr, open(self.edited_file_path, "w") as fw:
for line in fr.readlines():
if isAlpha(line) or line == "\n":
continue # Skip letters and spaces
fw.write(line)
def get_text_lines(self, to_file=True):
"""
Takes the path file_path of the lyrics file with no blank lines and returns an array like this
"""
#Read Kenshi Yonezu_lyrics.txt line by line, divide it into "lyric passage" and "next passage", and divide by input and output
with open(self.edited_file_path, "r") as f:
line_list = f.readlines () #Lyrics passage ... line
line_num = len(line_list)
for i, line in enumerate(line_list):
if i == line_num - 1:
continue # There is no "next passage" at the end
self.input_lines.append(line.replace("\n", ""))
self.output_lines.append("_" + line_list[i+1].replace("\n", ""))
if to_file:
with open(self.edited_file_path, "w") as f:
for input_line, output_line in zip(self.input_lines, self.output_lines):
f.write(input_line + " " + output_line + "\n")
def line2words(self, line: str) -> list:
word_list = [token for token in self.tokenizer.tokenize(line)]
return word_list
def get_word2id(self, line: str) -> dict:
word_list = self.line2words(line)
for word in word_list:
if not word in self.word2id.keys():
self.word2id[word] = len(self.word2id)
def get_word_num_max(self):
# Find the one with the longest length
word_num_list = []
for line in self.input_lines + self.output_lines:
word_num_list.append(len([self.word2id[word] for word in self.line2words(line)]))
self.word_num_max = max(word_num_list)
def __len__(self):
return len(self.input_data)
def __getitem__(self, idx):
out_data = self.input_data[idx]
out_label = self.output_data[idx]
if self.transform:
out_data = self.transform(out_data)
return out_data, out_label
The code seems to be longer than I expected, so this time I will limit it to "data preprocessing".
Recommended Posts