Currently for horse racing x machine learning, I am working on scraping horse racing information from a web page.
I am trying to incorporate the past results of a racehorse into the input content. Therefore, get the URL of the racehorse page from the race result table of the horse race, access it, and I decided to get the past results of the race day.
The idea is the following three points. ・ Unified acquisition method with CSS selector ・ Obtain the URL of each horse from the race result table and obtain the past results posted on that URL. ・ Although there are 12 horses running this time, variable length variables are used so that other horses can be acquired.
Before this article, I introduced Extracting the racing environment from the Python scraping horse racing site. I used Beautiful Soup for the race environment. However, it feels better to use the css selector for unified processing, so I rewrote it.
scraping_of_race_and_past_horse_result.py
import requests
import lxml.html
import csv
rlt = [] #result
horse_URL = []#Get the name of the horse
#Get text with scalping
def get_scarping_data(key_page,css_select_str,*URL):
#Get information for the number of acquired URLs
for i in range(len(URL)):
#Get a string from a URL
r = requests.get(URL[i])#Specify URL
r.encoding = r.apparent_encoding #Prevent garbled characters
html = lxml.html.fromstring(r.text) #Acquired character string data
#Get item
for css_id in html.cssselect(css_select_str):
#Race environment of race result site
if key_page == "condition_in_race_page" :
#Element number text
css_id = css_id.text_content()
#Inclusive notation When extracting"weather"Is immutable, so specify the condition
css_id_ = [css_id for t in css_id if "weather" in css_id]
css_id_ = css_id_[0].split('\xa0/\xa0')
#Row data in the list(Add list)
rlt.append(css_id_)
#Race Results Site Race Results
if key_page == "result_in_race_page" :
#Element number text
css_id = css_id.text_content()
#new line("\n")Divide based on
css_id = css_id.split("\n")
#Comprehension notation Excludes empty strings
css_id_ = [tag for tag in css_id if tag != '']
#1st place has no time record#8th 0 that needs to be forcibly added
if len(css_id_) != 13 : css_id_.insert(8,0)
#Row data in the list(Add list)
rlt.append(css_id_)
#Past performance site of raced horses
if key_page == "horse_race_data" :
#Element number text
css_id = css_id.text_content()
#Of the acquired Element number
css_id = css_id.split("\n")
#Inclusive notation empty"\xa0"When"Video"Remove
css_id_ = [tag for tag in css_id
if tag != '' and tag != "\xa0" and tag != "Video" and tag != "Stable comment" and tag != "Remarks" ]
#Row data in the list(Add list)
rlt.append(css_id_)
#Extraction result
return rlt
#Get the URL of the raced horse
def get_scarping_past_horse_date(URL):
response = requests.get(URL)
root = lxml.html.fromstring(response.content)
#1~Get information on up to 12 horses
for i in range(2,14):#2~13 13 - 2 + 1 = 12
css_select_str = "div#race_main tr:nth-child({}) > td:nth-child(4) > a".format(i)
#Get information on raced horses
for a in root.cssselect(css_select_str):
horse_URL.append(a.get('href'))
#Extraction result
return horse_URL
#First, get the race environment from the original race site
URL = "https://nar.netkeiba.com/?pid=race&id=p201942100701"
#Get a racing environment
rlt = get_scarping_data("condition_in_race_page","div#main span",URL)
#Get race results
rlt = get_scarping_data("result_in_race_page","#race_main > div > table > tr",URL)
#Get the URL of the raced horse
horse_URL = get_scarping_past_horse_date(URL)
print(len(horse_URL))#The output result is 12 for 12 racehorses
#Get past results from the URL of the horse you got
#item(1 line)Get
rlt = get_scarping_data("horse_race_data", "#contents > div.db_main_race.fc > div > table > thead > tr",horse_URL[0])
#Grade data other than items
rlt = get_scarping_data("horse_race_data", "#contents > div.db_main_race.fc > div > table > tbody > tr",*horse_URL)
#Save to CSV file
with open("scraping_of_race_and_past_horse_result.csv", 'w', newline='') as f:
wrt = csv.writer(f)
wrt.writerows(rlt) #Writing the extraction result
・ I got the URL where the horse information is posted from the race result, and I was able to get the past results. ・ Although I was able to obtain past results, the latest information is also posted, so it is necessary to obtain results that go back to the date of the race. -The acquired information has not been input to the neural network. ・ The day of the race is automatically acquired. Don't check every time → As a method, the race information numerical value (12 digits at the end of the URL) is incremented in consideration of regularity.
To scrape, you need to master the following code (character string). (#contents > div.db_main_race.fc > div > table > tbody > tr') ↑ Summarize how to get this part.
how_to_search_css_sector.py
for h in html.cssselect('#contents > div.db_main_race.fc > div > table > tbody > tr'):#Specify the scraping location with the CSS selector
At first, I just copied or wrote a code that looked like it, but it didn't work. .. .. When I look it up, CSS is involved. To find out which CSS selector the item you want is made up of Use Chrome's Copy Css Selector tool.
Once installed, right-click to create a Copy Css Selector item. You can find out the CSS selector by executing Copy Css Selector on the item you want and pasting it on the text to check it.
Using other developer tools, You can copy the CCS sector with Copy → Copy Selector within the desired information range. Sometimes it works.
If you use these two, you should be able to get the SCC selector.
Recommended Posts