This article is an explanation of the internal code of the boat race triple prediction site "Today, do you have a good prediction?" that was created by myself and released on the Web. It will be. This time I will summarize about web scraping.
I want to make a triple prediction site for boat races by machine learning, so I would like to somehow obtain past race results as learning data. The minimum information I want is ...
--Kyoteiba
Is it? Any other information you want
--How many races --The weather of the day --Motor information
And so on. The latest Boat Race official website has well-maintained data, and you can also refer to past race results.
This time, I would like to get the race results that are the source of the learning data from here!
As a prior knowledge of boat races, races are basically held 365 days a year at some of the 24 racecourses. Therefore, after understanding the URL structure, I decided to acquire race information for the desired number of days x 24 boat racetracks. (If the race is not held, the process will be skipped)
I grasped the URL structure and prepared a box containing the URL as follows. In the code, only the data of 2020/6/22 is acquired, but if you increase the list of year, month, day, it is an image that you can also acquire URLs of other dates.
import pandas as pd
import numpy as np
list = []
year = ['2020']
month = ['06']
day = ['22']
site = ['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24']
for i in year:
for j in month:
for k in day:
for l in site:
list.append("url name is described here")
Here is the code. When scraping, be sure to set a certain period of time ** to avoid overloading the other web server.
import requests
from time import sleep
from bs4 import BeautifulSoup
path_w = 'bs_2020_0622.txt'
list_errorlog = [] #Make a note of the boat race track where there was no match that day, for the time being.
for m in range(len(list)):
try:
res = requests.get(list[m])
res.encoding = res.apparent_encoding
res.raise_for_status()
with open(path_w, mode='a', encoding='utf-8') as f:
txt = res.text
soup = BeautifulSoup(txt)
f.writelines(soup.get_text())
sleep(5) #Do not erase!
except:
sleep(5) #Do not erase!
list_errorlog.append(list[m]+"is not existing")
print(list_errorlog)
In this code
I am doing. This is fine because the reference destination is a fairly simple structure, but I think that it is necessary to scrape the HTML tags well for more elaborate pages.
it is a good feeling. Next time, I would like to convert this text data into a DataFrame format that allows machine learning. Well, scraping is amazing. (Although there is a feeling of being late for fashion ..)
Recommended Posts