Get past performance of runners from Python scraping horse racing site

background

Currently for horse racing x machine learning, I am working on scraping horse racing information from a web page.

I am trying to incorporate the past results of a racehorse into the input content. Therefore, get the URL of the racehorse page from the race result table of the horse race, access it, and I decided to get the past results of the race day.

Concept

ss図1.png

Sample code

The idea is the following three points. ・ Unified acquisition method with CSS selector ・ Obtain the URL of each horse from the race result table and obtain the past results posted on that URL. ・ Although there are 12 horses running this time, variable length variables are used so that other horses can be acquired.

Before this article, I introduced Extracting the racing environment from the Python scraping horse racing site. I used Beautiful Soup for the race environment. However, it feels better to use the css selector for unified processing, so I rewrote it.

scraping_of_race_and_past_horse_result.py


import requests
import lxml.html 
import csv

rlt = [] #result
horse_URL = []#Get the name of the horse

#Get text with scalping
def get_scarping_data(key_page,css_select_str,*URL):
    #Get information for the number of acquired URLs
    for i in range(len(URL)):
        #Get a string from a URL
        r = requests.get(URL[i])#Specify URL
        r.encoding = r.apparent_encoding #Prevent garbled characters
        html = lxml.html.fromstring(r.text) #Acquired character string data
        #Get item
        for css_id in html.cssselect(css_select_str):        
            #Race environment of race result site
            if key_page == "condition_in_race_page" :
                #Element number text
                css_id = css_id.text_content()
                #Inclusive notation When extracting"weather"Is immutable, so specify the condition
                css_id_ = [css_id for t in css_id if "weather" in css_id]
                css_id_ = css_id_[0].split('\xa0/\xa0')
                #Row data in the list(Add list)
                rlt.append(css_id_)
            #Race Results Site Race Results
            if key_page == "result_in_race_page" :
                #Element number text
                css_id = css_id.text_content()
                #new line("\n")Divide based on
                css_id = css_id.split("\n")
                #Comprehension notation Excludes empty strings
                css_id_ = [tag for tag in css_id if tag != '']
                #1st place has no time record#8th 0 that needs to be forcibly added
                if len(css_id_) != 13 : css_id_.insert(8,0)
                #Row data in the list(Add list)
                rlt.append(css_id_) 
            #Past performance site of raced horses
            if key_page == "horse_race_data" : 
                #Element number text
                css_id = css_id.text_content()
                #Of the acquired Element number
                css_id = css_id.split("\n")
                #Inclusive notation empty"\xa0"When"Video"Remove
                css_id_ = [tag for tag in css_id 
                if tag != '' and tag != "\xa0" and tag != "Video" and tag != "Stable comment" and tag != "Remarks" ]
                #Row data in the list(Add list)
                rlt.append(css_id_)        
    #Extraction result
    return rlt

#Get the URL of the raced horse
def get_scarping_past_horse_date(URL):            
    response = requests.get(URL)
    root = lxml.html.fromstring(response.content)
    #1~Get information on up to 12 horses
    for i in range(2,14):#2~13  13 - 2 + 1 = 12 
        css_select_str = "div#race_main tr:nth-child({}) > td:nth-child(4) > a".format(i)
        #Get information on raced horses
        for a in root.cssselect(css_select_str):
            horse_URL.append(a.get('href'))
    #Extraction result
    return horse_URL

#First, get the race environment from the original race site
URL = "https://nar.netkeiba.com/?pid=race&id=p201942100701" 
#Get a racing environment
rlt = get_scarping_data("condition_in_race_page","div#main span",URL)
#Get race results
rlt = get_scarping_data("result_in_race_page","#race_main > div > table > tr",URL)
#Get the URL of the raced horse
horse_URL = get_scarping_past_horse_date(URL)
print(len(horse_URL))#The output result is 12 for 12 racehorses
#Get past results from the URL of the horse you got
#item(1 line)Get
rlt = get_scarping_data("horse_race_data", "#contents > div.db_main_race.fc > div > table > thead > tr",horse_URL[0])
#Grade data other than items
rlt = get_scarping_data("horse_race_data", "#contents > div.db_main_race.fc > div > table > tbody > tr",*horse_URL)
#Save to CSV file
with open("scraping_of_race_and_past_horse_result.csv", 'w', newline='') as f: 
    wrt = csv.writer(f) 
    wrt.writerows(rlt) #Writing the extraction result

Execution result

sss.png

Reflections

・ I got the URL where the horse information is posted from the race result, and I was able to get the past results. ・ Although I was able to obtain past results, the latest information is also posted, so it is necessary to obtain results that go back to the date of the race. -The acquired information has not been input to the neural network. ・ The day of the race is automatically acquired. Don't check every time → As a method, the race information numerical value (12 digits at the end of the URL) is incremented in consideration of regularity.

How to check CSS selector

To scrape, you need to master the following code (character string). (#contents > div.db_main_race.fc > div > table > tbody > tr') ↑ Summarize how to get this part.

how_to_search_css_sector.py


for h in html.cssselect('#contents > div.db_main_race.fc > div > table > tbody > tr'):#Specify the scraping location with the CSS selector

At first, I just copied or wrote a code that looked like it, but it didn't work. .. .. When I look it up, CSS is involved. To find out which CSS selector the item you want is made up of Use Chrome's Copy Css Selector tool. 無題4.png

Once installed, right-click to create a Copy Css Selector item. You can find out the CSS selector by executing Copy Css Selector on the item you want and pasting it on the text to check it.

無題4.png

Using other developer tools, You can copy the CCS sector with Copy → Copy Selector within the desired information range. Sometimes it works. 無題5.png

If you use these two, you should be able to get the SCC selector.

Recommended Posts

Get past performance of runners from Python scraping horse racing site
Python scraping Extract racing environment from horse racing site
Horse Racing Site Web Scraping with Python
Scraping from an authenticated site with python
Get the contents of git diff from python
[Python] Get the text of the law from the e-GOV Law API
Basics of Python scraping basics
Get the return code of the Python script from bat
The definitive edition of python scraping! (Target site: BicCamera)
Automatic scraping of reCAPTCHA site every day (1/7: python environment construction)
I tried crawling and scraping a horse racing site Part 2
Horse Racing Data Scraping Flow
Get PowerShell commands from malware dynamic analysis site with BeautifulSoup + Python
I tried to get a database of horse racing using Pandas
[Python] Get the update date of a news article from HTML
Get data from Quandl in Python
Horse Racing Data Scraping at Colaboratory
Existence from the viewpoint of Python
Get Qiita trends with Python scraping
Get upcoming weather from python weather api
Get weather information with Python & scraping
[Python] Scraping lens information from Kakaku.com