Horse Racing Data Scraping Flow

First of all, as for the previous article, surprisingly many people have read and liked it. I am very grateful and trembling From this time, I will briefly write an article about what kind of code I wrote.

Horse racing data can be broadly divided into information on the entire race and information on horses that have entered the race. Do not cut out only the target part suddenly, but try it once with a big squeeze. I haven't added .text yet

For example, in the case of the following page (From netkeiba.com)

The type of course, mileage, and riding conditions are written in the red frame, so I would like to acquire it. If you use beautiful suop

`scr1.py`


from bs4 import BeautifulSoup

id = '201806010101'#Race ID for which you want to acquire data
url = ('https://db.netkeiba.com/race/%s/' % (id))
response = request.urlopen(url)
bs = BeautifulSoup(response, 'html.parser')

raceinfo = bs.select("span")[6]
print(raceinfo)
#<span>Da right 1200m/the weather:Fine/dirt:Good/Start: 09:55</span>

So, for the first time here, add .text or .split

`scr2.py`


import re

racetype = raceinfo.text.split()[0][:1]
length = re.sub("\\D", "", raceinfo.text.split()[0])
conde = raceinfo.text.split()[8]
print(racetype,length,conde)
#Da 1200 good

I was able to get the desired information such as course type, mileage, and riding conditions.

The advantage of doing this is that if you use variables in the first big loop when looping, the others Is it easy to get data smoothly as it is, and to easily add the numbers in the list? Information on other races and information on each horse should be obtained in the same way.

Also, it's best not to scrape data from the last 10 years at once. Divide it into several times, and when you have the data, attach it with .concat or .append. It feels good to do it every year (If you run it when you go to bed or go to work, it will usually time out ...)

Also, when you get it, you will want to save something with some calculations added, but let's do it later. It's a time-consuming task, so ...

The race and horse data were acquired separately according to the above flow.

This time it's short, but I'm just fetching information and I haven't done anything special, so that's about it. Next, I will write about how to organize data, race and evaluate horses. From the next article, there will be a lot of horse racing terms, but I will explain as much as possible.