Do you like horse racing? I'm a beginner who started this year, but it's really fun to collect various information, predict and guess!
At first, it was fun just to predict, but the desire to ** "I don't want to lose" ** overflowed.
So, when I was surfing the internet, I was wondering if there was a delicious way to win, and it seemed interesting to predict horse racing using machine learning, so I decided to try it after studying.
The return rate of horse racing seems to be about 70 to 80%, so if you buy it properly, the recovery rate is likely to converge to this level.
So, for the time being, I would like to aim for a ** recovery rate of 100% or more ** using the data obtained before the race!
I think there are various ways to predict horse racing, such as simply predicting the ranking or optimizing the betting method by considering the odds. There are also various types of betting tickets to buy.
This time, I would like to divide the horse racing ranking into 3 groups within 3rd place, middle and lower, and perform ** multi-class classification **.
And I will buy a ** win-type ** betting ticket for the horse that ranked first in the expected results. The reason is that the win-win return rate is set higher than those that are easy to get high-priced betting tickets such as triples. (Reference: Buena's Horse Racing Blog-Knowledge for not losing with betting tickets)
Also, I will not use information on popularity and odds for features. How about using information that is not decided until just before the race? And that's because I thought it wouldn't be interesting to simply buy a popular horse. (The horse weight data determined about 50 minutes before the start of the race is treated as a feature.)
This time, I would like to focus on the race at Tokyo Racecourse and proceed as follows. The reason for narrowing down the racetrack is that it takes time to scrape the race data due to the poor algorithm and the time.sleep of 1 second. (It took about 50 hours to collect the data from 2008 to 2019 ...)
It's a hassle, but if you take the time, you can collect data on all racetracks.
Collect by scraping from this site (netkeiba.com). As far as I read robots.txt (which wasn't there in the first place) and the terms of use when scraping, it seemed to be okay, so I was careful not to overload it. I referred to the following article for the scraping method.
-python3 Crawling & Scraping -Scraping with Python and Beautiful Soup
The result of collecting the data is like this.
When scraping, we have removed information about horses that have not performed in the last three races. This is because we think that the future cannot be predicted for things that do not have past information.
In addition, horses that ran in rural areas or overseas may lack the time index, etc., but that part is filled with the average value.
This time, the following items were treated as features. Data of the day
Variable name | Contents |
---|---|
kai | How many times |
day | What day is it held |
race_num | What R |
field | Turf or dirt |
dist | distance |
turn | Which way |
weather | weather |
field_cond | Going |
~~place~~ | ~~Venue~~ |
sum_num | How many heads |
prize | Winning prize |
horse_num | Horse number |
sex | sex |
age | age |
weight_carry | Weight |
horse_weight | Horse weight |
weight_change | Change in horse weight |
l_days | How many days have passed since the previous run |
Variable name | Contents |
---|---|
p_place | Venue |
p_weather | weather |
p_race_num | What R |
p_sum_num | How many heads |
p_horse_num | Horse number |
p_rank | Ranking |
p_field | Turf or dirt |
p_dist | distance |
p_condi | Going |
p_condi_num | Baba index |
p_time_num | Time index |
I just put the time in seconds and label-encoded the categorical variables. Below is the code to label encode the weather as an example.
encode.py
num = df1.columns.get_loc('weather')
for i in range(df1['weather'].size):
copy = df1.iat[i, num]
if copy == 'Fine':
copy = '6'
elif copy == 'rain':
copy = '1'
elif copy == 'light rain':
copy = '2'
elif copy == 'Koyuki':
copy = '3'
elif copy == 'Cloudy':
copy = '4'
elif copy == 'snow':
copy = '5'
else:
copy = '0'
df1.iat[i, num] = int(copy)
df1['weather'] = df1['weather'].astype('int64')
Label-encode each categorical variable in this way.
I thought it would be easier with LabelEncoder, but I didn't use it because it seemed impossible to unify the compatibility of converted numbers and variables among multiple data files.
Also, LightGBM, the machine learning framework used this time, seems to use a decision tree for the weak classifier, so there is no need to standardize it. (Reference: Introduction to LightGBM)
Build your model with LightGBM, a gradient boosting framework. I chose this because it's fast and (likely) the strongest non-deep.
And as for the prediction method, this time we made it a multi-class classification, which is classified into one of the three groups of 3rd place or less, middle 1/3 excluding 3rd place and lower 1/3. For example, in the case of 15 heads, 1st to 3rd place is group 0, 4th to 8th place is group 1, and 9th to 15th place is group 2.
I referred to the following site for how to use it. (Reference: [[For beginners] LightGBM (Multi-class classification) [Python] [Machine learning]](https://mathmatical22.xyz/2020/04/11/%E3%80%90%E5%88%9D % E5% AD% A6% E8% 80% 85% E5% 90% 91% E3% 81% 91% E3% 80% 91lightgbm-% E5% 9F% BA% E6% 9C% AC% E7% 9A% 84% E3% 81% AA% E4% BD% BF% E3% 81% 84% E6% 96% B9-% E5% A4% 9A% E3% 82% AF% E3% 83% A9% E3% 82% B9% E5 % 88% 86% E9% A1% 9E% E7% B7% A8 /))
The training data and verification data are as follows.
Training data / verification data | test data |
---|---|
Tokyo_2008_2018 | Tokyo_2019 |
The training data is divided into training data and model evaluation data by train_test_split.
The parameters are not tuned in particular.
train.py
train_set, test_set = train_test_split(keiba_data, test_size=0.2, random_state=0)
#Explain training data Variable data(X_train)And objective variable data(y_train)Divided into
X_train = train_set.drop('rank', axis=1)
y_train = train_set['rank']
#Explanatory variable data for model evaluation data(X_test)And objective variable data(y_test)Divided into
X_test = test_set.drop('rank', axis=1)
y_test = test_set['rank']
#Set the data used for learning
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'multiclassova',
'num_class': 3,
'metric': {'multi_error'},
}
model = lgb.train(params,
train_set=lgb_train, #Designation of training data
valid_sets=lgb_eval, #Specifying validation data
verbose_eval=10
)
The correct answer rate is about 54%. You're guessing more than half. This value did not change much even if I tampered with the parameters, so this time I will keep it as it is.
We will post the data verified at Tokyo Racecourse for one year in 2019.
Here, as a condition
-** Buy a win for 100 yen per race. (Hit = odds x 100-100 or miss = -100) ** -** 1 Do not buy if the number of horses for which data remains during the race is less than half the number of horses in the race. (± 0) **
It is said. The reason for the second condition is to exclude races that almost certainly miss, such as the 2-year-old race, where only one horse has three or more past race data.
Below is the resulting graph.
The hit rate is about 26%, which is a good hit.
Due to the second condition, there were races where I couldn't bet about 100 races, but I think there is no complaint about this recovery rate after participating in about 80% of the races.
It's pretty good ...! The recovery rate for horses is close to 200%, which is a wonderful result. However, betting tickets with a large return instead of a low hit rate will have a large fluctuation in the collection rate, so I would like to keep it as a reference.
The dissatisfaction point is that the recovery rate of double wins is less than 100%, even though we evaluate whether or not we are in the third place. I would like to do something about this area.
So far, the only condition for buying a betting ticket is the number of horses, but in practical use, I think that it will be decided by the good or bad of the expected value rather than buying all the races.
Therefore, I would like to add the following new conditions.
** ・ Buy only when the difference between the 1st and 2nd predicted numbers classified in group 0 is 0.1 or more. ** **
In other words, buy only in such cases I don't buy at such times about it.
The reason for this condition
Below are the results of verification under these conditions.
**: relaxed: It feels really good: relaxed: **
The hit rate has improved significantly from 26% to 39%, and the recovery rate has improved significantly from 130% to 168%.
The number of target races has decreased by about 250, and it has been narrowed down to 100 races a year, but considering that 1/4 is still participating, I think this recovery rate is good.
I will try other betting tickets for the time being.
It is good! It is worth mentioning that the hit rate of double wins exceeds 70% and the recovery rate exceeds 100%. The most popular horse has a double win rate of about 60 to 65% (Reference: Developer Blog | AlphaImpact Co., Ltd.) This seems to be very good.
Let's also look at the importance of features when creating a model.
You can see that the time index is treated as a fairly important feature. Obviously, horses that have run well in past races have a high probability of winning.
What was surprising was that how many days had passed since the last race was treated as an as important feature as the climbing time and horse weight. I was surprised that I don't see many people who are predicting horse racing and who emphasize rotation. Also, this is interesting because it overlaps with Almond Eye, who lost by forcing Rote for the middle two weeks. Well, I'm not sure if the index is getting worse because the space is short, but lol
Nowadays, horse racing AI seems to be steadily getting excited, with some sites operating as a service and Dwango's cyber award. Under such circumstances, I was able to practice horse racing prediction using machine learning, which was a lot of fun.
However, the future cannot be predicted perfectly, so using this model does not mean that you can definitely win. It is possible that this year and next year's race will be defeated.
I don't think it's good to expect too much from this kind of thing, but I think there are dreams because some people make money from horse racing programs.
Since you pointed out that it is a guidance article for financial purposes, I deleted the url. : bow:
Recommended Posts