I don't know how many brews it will be, but I will describe how to put the information on the horse racing site into CSV format. The language used is Python 3.6, and the environment is Jupyter Notebook.
I'm new to Python, so I'm wondering if there's verbose code or smarter techniques. However, this time the purpose is not to create beautiful code, so it is good to recognize it as a future improvement.
Extract information from the following sites. Site name: netkeiba.com (https://www.netkeiba.com/?rf=logo)
At netkeiba.com, there is a web page for each race. The URL of this web page is determined by the following rules.
https://race.netkeiba.com/race/result.html?race_id=開催年+競馬場コード+開催回数+日数+レース数+&rf=race_list
Let's take the first race of the 4th day of Tokyo, which was held on May 3, 2020 at Tokyo Racecourse, as an example.
--Date: 2020 (= 2020) --Racetrack code: 05 (= Tokyo) --Number of times: 02 (= 2 times) --Days: 04 (= 4th day) --Number of races: 01 (= 1st race)
Racetrack name | Racetrack code |
---|---|
Sapporo | 01 |
Hakodate | 02 |
Fukushima | 03 |
Niigata | 04 |
Tokyo | 05 |
Nakayama | 06 |
Chukyo | 07 |
Kyoto | 08 |
Hanshin | 09 |
Kokura | 10 |
When the above is applied, it becomes as follows. https://race.netkeiba.com/race/result.html?race_id=202005020401&rf=race_list
This is the code from URL generation to information acquisition.
web1.ipynb
# -*- coding: utf-8 -*-
import csv
import requests
import codecs
import time
from datetime import datetime as dt
from collections import Counter
from bs4 import BeautifulSoup
import re
import pandas
race_date ="2020"
race_course_num="06"
race_info ="03"
race_count ="05"
race_no="01"
url = "https://race.netkeiba.com/race/result.html?race_id="+race_date+race_course_num+race_info+race_count+race_no+"&rf=race_list"
#Get the data of the corresponding URL in HTML format
race_html=requests.get(url)
race_html.encoding = race_html.apparent_encoding
race_soup=BeautifulSoup(race_html.text,'html.parser')
print(url)
After doing the above, you will see the generated URL.
It is the code to get the table from the obtained HTML text. (Added to the above code)
web1.ipynb
#Get and save only the race table
HorseList = race_soup.find_all("tr",class_="HorseList")
#Lace table shaping
#Create a list to include the race table
Race_lists = []
#Number of rows in the table=15("Order of arrival,frame,Horse number,Horse name,Sexual age,Weight,Jockey,time,Difference,Popular,Win odds,After 3F,Corner passing order,stable,Horse weight(Increase / decrease))
Race_row = 15
#Count the number of runners
uma_num = len(HorseList)
#Remove unnecessary strings and store in list
for i in range(uma_num):
Race_lists.insert(1+i, HorseList[i])
Race_lists[i] = re.sub(r"\n","",str(Race_lists[i]))
Race_lists[i] = re.sub(r" ","",str(Race_lists[i]))
Race_lists[i] = re.sub(r"</td>",",",str(Race_lists[i]))
Race_lists[i] = re.sub(r"<[^>]*?>","",str(Race_lists[i]))
Race_lists[i] = re.sub(r"\[","",str(Race_lists[i]))
Race_lists[i] = re.sub(r"\]","",str(Race_lists[i]))
print(Race_lists[i])
When the above is executed, the output will be as follows. 1,1,1, Red Calm, Female 3,54.0, Shu Ishibashi, 1: 25.7 ,, 3,4.6,37.1 ,, Takeshi Miho Okumura, 512 (-4), 2,6,12, Sanky West, Female 3,54.0, Iwabe, 1: 25.7, Hana, 2,3.2,36.5 ,, Miho Kayano, 442 (-8), (Omitted below)
Other tables can be obtained in a similar way, with some differences.
Now that you have the information you want, save it as a CSV file. (Added to the above code)
web1.ipynb
#open csv
out=codecs.open("./this_race_table"+race_date+race_course_num+race_info+race_count+race_no+".csv","w")
#This time, the column name is described in CSV for the sake of clarity..(Note that you don't really need it)
out.write("Order of arrival,frame,Horse number,Horse name,Sexual age,Weight,Jockey,time,Difference,Popular,Win odds,After 3F,Corner passing order,stable,Horse weight(Increase / decrease)\n")
#Fill in the contents of the race table list in csv
for i in range(uma_num):
out.write(str(Race_lists[i]+"\n"))
out.close()
When you execute the above, CSV will be created in the folder where the source code file exists.
This is the end of scraping.
Please note that web scraping (crawlers) can be illegal, as represented by the Librahack case.
Recommended Posts