I wanted to analyze the lyrics, but I tried scraping for the first time because it was difficult to collect the lyrics. To be honest, I was a little worried because I had never written HTML properly, but I was able to do what I wanted to do, so I would like to summarize it. I would appreciate it if you could give me some advice and mistakes.
This is the article that I referred to this time.
[I tried to find out where I want to go by using word2vec and lyrics for "Kenshi Yonezu's theory that I can't go anywhere"](https://qiita.com/k_eita/items/456895942c3dda4dc059#%E6%AD % 8C% E8% A9% 9E% E3% 81% AE% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3 % E3% 82% B0)
The lyrics are available as text files in this article. This time, I rewrote it with reference to this code.
What we will get this time
- Song title
These are the above five. The output format is csv.
The site scraped this time is "Uta-Net: Lyrics Search Service".
import re
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Get the website and output it in text format
def load(url):
res = requests.get(url)
#HTTPError throws HTTPError if the HTTP request returns a failed status code
res.raise_for_status()
#Get response body in text format
return res.text
# Get html tag
def get_tag(html, find_tag):
soup = BeautifulSoup(str(html), 'html.parser')
tag = soup.find_all(find_tag)
return tag
# Convert to a data structure that can be handled by the program
def parse(html):
soup = BeautifulSoup(str(html), 'html.parser')
Remove #html tag
simple_row = soup.getText()
simple_row = simple_row.replace('\n', '')
simple_row = simple_row.replace(' ', '')
#Delete alphanumeric characters (if needed)
#simple_row = re.sub(r'[a-zA-Z0-9]', '', music_row)
#Delete sign
simple_row = re.sub (r'[<> ♪ `''" "・… _!?!-/:-@ [-` {-~]','', simple_row)
#Delete notice
simple_row = re.sub (r'Note:. +','', Simple_row)
return simple_row
# Acquisition of song information for each
def get_info(url):
base_url = 'https://www.uta-net.com/'
html = load(url)
#Store url for each song
song_url = []
#Store song
song_info = []
songs_info = []
#Get song url
Store url of #td
for td in get_tag(html, 'td'):
Get #a element
for a in get_tag(td, 'a'):
Whether the #href attribute contains song
if 'song' in a.get ('href'):
Add #url to array
song_url.append(base_url + a.get('href'))
#Get song information
for i, page in enumerate(song_url):
print ('{} song: {}'. format (i + 1, page))
html = load(page)
song_info = []
#Song_Title
for h2 in get_tag(html, 'h2'):
Cast to str once to do #id search
h2 = str(h2)
#Whether or not it is a class element that stores lyrics
if r'class="prev_pad"' in h2:
#Remove unnecessary data
simple_row = parse(h2)
#print(simple_row, end = '\n')
song_info.append(simple_row)
else:
for h2 in get_tag(html, 'h2'):
h2 = str(h2)
simple_row = parse(h2)
song_info.append(simple_row)
#Artist
for h3 in get_tag(html, 'h3'):
h3 = str(h3)
if r'itemprop="byArtist"' in h3:
simple_row = parse(h3)
song_info.append(simple_row)
#Lyricist
for h4 in get_tag(html, 'h4'):
h4 = str(h4)
if r'itemprop="lyricist"' in h4:
music = parse(h4)
song_info.append(simple_row)
#Composer
for h4 in get_tag(html, 'h4'):
h4 = str(h4)
if r'itemprop="composer"' in h4:
simple_row = parse(h4)
song_info.append(simple_row)
#Lyric
for div in get_tag(html, 'div'):
div = str(div)
if r'itemprop="text"' in div:
simple_row = parse(div)
song_info.append(simple_row)
songs_info.append(song_info)
# 1 second wait (reduces server load)
time.sleep(1)
break
return songs_info
def create_df(file_name, url):
#Create a data frame
#df = pd.DataFrame('Song_Title', 'Artist', 'Lyricist', 'Composer', 'Lyric')
df = pd.DataFrame(get_info(url))
df = df.rename(columns={0:'Song_Title', 1:'Artist', 2:'Lyricist', 3:'Composer', 4:'Lyric'})
# CSV file output
csv = df.to_csv("csv/{}.csv".format(file_name))
return csv
By running the above code, you are ready for scraping. You can actually get the lyrics etc. by executing the code below. This time, I got the music of Minami-san. I also tried to make it easier to change the file name and url.
file_name = 'sample'
url = 'https://www.uta-net.com/artist/26099/'
url = 'https://www.uta-net.com/user/ranking/daily.html'
url = 'https://www.uta-net.com/user/ranking/monthly.html'
create_df(file_name, url)
Here is the data of the music acquired this time. Now you can analyze as many songs as you like.
I found it fun to make something that works as I intended. It has become an article with a strong self-satisfaction element, so I would like to update it later. (Since the explanation of the code is only commented out, ...) I also want to unify the writing style of Qiita in my own way. Next, I think I'll try natural language processing.
Recommended Posts