Horse racing has a play called POG. This is a fictitious owner who competes for the success of his horse. In general, the prize money won during the period from debut to derby is often used as an index of activity.
I have been doing POG for about 5 years now. Until now, I have been able to select somehow good horses by relying on Aomoto, which will be released around April every year.
However, the results are brilliant, and by the end of this year, only 2 out of 10 have won.
In order to overcome this situation where the swelling does not rise, I decided to put my heart into data analysis. However, the author does not have the skills to freely manipulate the machine learning that is popular these days. Therefore, the immediate goal here is to find a causal relationship between the basic horse information (stables, producers, pedigree) and the prize money earned during the POG period.
What is important here is the "prize money won during the POG period". As far as I know, no website publishes this information. Maybe some institutions offer it for a fee, but I don't want to do it in situations where it's unclear if it's worth the cost.
Therefore, here, we decided to take the means of acquiring basic horse information and run history data from netkeiba and calculating the prize money during the POG period using the run history data.
The language used is Python. This is just something I'm used to.
Collect data on horses born during the four years from 2010 to 2013. Here, data for four years is acquired at the same time by parallel processing of four cores. ~~ It depends on the machine specs, but my MBA finished collecting the data in about an hour. ~~
MakeUmaDB_151229.py
#!/usr/bin/env python
# encoding: utf-8
import urllib2 as ul
import pandas as pd
import os
import time
import datetime
from lxml import html
import multiprocessing as mp
__PROC__ = 4
def MakeDir(dname):
if not os.path.exists(dname):
os.mkdir(dname)
print 'Make directory:%s' % dname
else:
print '%s is exist' % dname
return 0
def subMakeHorseDB(year):
# Set Directory
o_dname = 'horse_db'
MakeDir(o_dname)
# horse_prof
prof_keys = [
u'Horse name',
u'Birthday',
u'Trainer',
u'Horse owner',
u'Producer',
u'Origin',
u'Auction transaction price',
u'father',
u'mother',
u'Mother father',
u'POG period prize_half period',
u'POG period prize_Year-round'
]
# get Uma data from web site
base_url = 'http://db.netkeiba.com/horse/'
idx_from = 100000
idx_to = 111000
masta_d = {}
for idx in range(idx_from, idx_to + 1):
try:
# get html from web
time.sleep(10)
s_idx = str(year)+str(idx).zfill(6)
url = base_url + s_idx
src_html = ul.urlopen(url).read()# get html from url
root = html.fromstring(src_html)
# show progress
print 'idx: %s, (%d, %d/%d)' % (s_idx, year, idx, idx_to)
# not found db
if root.xpath('//title')[0].text.startswith(u'|'):
#print 'DB not found'
continue
# html parse
masta_d[s_idx] = {}
for prof in prof_keys:
if prof == u'Horse name':
horse_name = root.xpath('//div[@class="horse_title"]')[0].text_content().split('\n')[1]
masta_d[s_idx][prof] = horse_name
elif prof == u'father':
masta_d[s_idx][prof] = root.xpath('//td[@rowspan="2"][@class="b_ml"]')[0].text_content().split('\n')[1]
elif prof == u'mother':
masta_d[s_idx][prof] = root.xpath('//td[@rowspan="2"][@class="b_fml"]')[0].text_content().split('\n')[1]
elif prof == u'Mother father':
masta_d[s_idx][prof] = root.xpath('//td[@class="b_ml"]')[2].text_content().split('\n')[1]
elif prof == u'POG period prize_half period' or prof == u'POG period prize_Year-round':
continue
elif prof == u'Birthday':
masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[0].text_content()
elif prof == u'Trainer':
masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[1].text_content()
elif prof == u'Horse owner':
masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[2].text_content()
elif prof == u'Producer':
masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[3].text_content()
elif prof == u'Origin':
masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[4].text_content()
elif prof == u'Auction transaction price':
masta_d[s_idx][prof] = root.xpath('//table[@class="db_prof_table"]/tr/td')[5].text_content()
# calc POG prize
prize_all = 0.0
prize_half = 0.0
deadline_all = datetime.datetime.strptime('%d-07-01'%(year+3), '%Y-%m-%d')
deadline_half = datetime.datetime.strptime('%d-01-01'%(year+3), '%Y-%m-%d')
r_hist = root.xpath('//table[@class="db_h_race_results nk_tb_common"]')
if len(r_hist) == 0:
masta_d[s_idx][u'POG period prize_half period'] = '%d' % prize_half
masta_d[s_idx][u'POG period prize_Year-round'] = '%d' % prize_all
else:
r_hist_l = root.xpath('//table[@class="db_h_race_results nk_tb_common"]/tbody/tr')
for race in r_hist_l:
r_date = datetime.datetime.strptime(race.text_content().split('\n')[1],'%Y/%m/%d')
try:
prize = float(race.text_content().split('\n')[-2].replace(',',''))
except:
prize = 0.0
if r_date < deadline_all:
prize_all += prize
if r_date < deadline_half:
prize_half += prize
masta_d[s_idx][u'POG period prize_half period'] = '%.2f' % prize_half
masta_d[s_idx][u'POG period prize_Year-round'] = '%.2f' % prize_all
except:
pass
# make data frame
df = pd.DataFrame(masta_d).T
o_df = pd.DataFrame()
# sort columns
for prof in prof_keys:
o_df = pd.concat([o_df, df[prof]], axis=1)
o_df.index.name = 'Index'
o_fname = 'horse_prof_%d.csv' % year
o_fpath = os.path.join(o_dname, o_fname)
o_df.to_csv(o_fpath, encoding='utf-8')
def main():
year_l = [2010, 2011, 2012, 2013]
pool = mp.Pool(__PROC__)
pool.map(subMakeHorseDB, year_l)
if __name__ == '__main__':
main()
raw_input('Press Enter to Exit¥n')
I want to find the law of POG winning by kneading the spit out csv file.
Recommended Posts