Introduction

This article

・ I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (1)

It will be a continuation of the article.

In Part 1, I wrote about the model with momentum, but in Part 2, I will write the result of actually predicting the future, and finally publish the code.

Predicted value after model creation

Actually, as of July, the predicted value was already published in note. However, the code used for prediction is continuing to improve while issuing the prediction note, and the code released this time is only the basic part, so the prediction of the note here and the prediction value of the code to be published are not necessarily It does not match.

[Horse Racing Forecast] July 25, 2020 [Horse Racing Forecast] July 26, 2020 [Horse Racing Forecast] August 01, 2020 [[Horse Racing Forecast] 08/08/2020] (https://note.com/km_takao/n/n9d2acf507e60) [Horse Racing Forecast] August 09, 2020 [Horse Racing Forecast] August 15, 2020 [Horse Racing Forecast] August 22, 2020

(August 2, 16 and 23, 2020 could not be predicted due to the need.)

The recovery rate when these are purchased in double win is as follows. Regarding the amount to be bet, following the method of Mr. Ushi explained in Part 1, "total budget x odds of 0.01/30 minutes ago" is used, and the total budget is calculated at 100,000 yen.

Race date	Total amount bet	Refund amount	Recovery rate
July 25, 2020	7,500 yen	9,440 yen	125％
July 26, 2020	6,700 yen	7,350 yen	109％
August 01, 2020	10,100 yen	10,110 yen	100％
08/08/2020	23,700 yen	23,200 yen	98％
August 09, 2020	14,900 yen	15,210 yen	102％
August 15, 2020	23,200 yen	26,260 yen	113％
August 22, 2020	31,000 Yen	30,540 yen	99％

As a result of the improvement, we were able to increase the number of purchases, but the recovery rate is worse (and as a supplement, it has been 3 Baba since the 15th). It's currently under consideration whether such a race just happened to come at this time or if further improvements are needed.

Similarly, the recovery rate when purchasing a win is as follows.

Race date	Total amount bet	Refund amount	Recovery rate
July 25, 2020	2,800 yen	4,390 yen	156％
July 26, 2020	1,900 yen	1,580 yen	83％
August 01, 2020	4,700 yen	4,410 yen	93％
08/08/2020	9,800 yen	7,600 yen	78％
August 09, 2020	4,500 yen	3,380 yen	75％
August 15, 2020	9,600 yen	15,060 yen	157％
August 22, 2020	12,700 yen	13,900 yen	109％

The number of purchases is increasing here as well, but there are days when the recovery rate has dropped. By the way, here is the result of the win if you change to the method of always purchasing only 100 yen regardless of the budget instead of Mr. Ushi's betting method.

Race date	Total amount bet	Refund amount	Recovery rate
July 25, 2020	1,600 yen	1,850 yen	115％
July 26, 2020	1,100 yen	1,080 yen	98％
August 01, 2020	1,600 yen	3,500 yen	218％
08/08/2020	3,900 yen	11,610 yen	297％
August 09, 2020	2,400 yen	8,190 yen	341％
August 15, 2020	3,700 yen	4,530 yen	122％
August 22, 2020	4,800 yen	5,980 yen	125％

In other words, the model can predict the winning of Anoma, but with Mr. Ushi's betting method, the maximum odds that can be bet will decrease depending on the amount of the budget, and it will not be possible to bet on Anoma. As a result, only popular horses with low odds can be bet, which seems to be a factor in lowering the recovery rate. However, on the contrary, for races that did not get rough, it is a factor to increase the recovery rate by sloping like Mr. Ushi's betting method. If the budget you are thinking about is 100,000 yen, it will not have much effect if the odds are low (at most around 5 times) like a double win. However, if the odds are about 10 times or more in a win, the minimum stake is 100 yen, so it seems to be particularly affected. In this area, it is necessary to consider the constant of the stake calculation formula (0.01 in this case), your own budget, and the predicted value of the model.

Publish code

I will publish it in note. A detailed explanation of the code is given in the notes and comments in the notebook. Here, we will explain the simple flow.

Database scraping of past results

Scraping past race results, odds, etc. from netkeiba's database for model creation. As I wrote in Part 1, the scraping here is based on "How to scrape horse racing data using pandas read_html". The features to be scraped include information on each participating horse such as order of arrival and jockey name, information on the race itself such as distance, riding ground information, weather, and odds of each horse before the start of the race.

スクリーンショット 2020-08-21 22.33.09.png

Feature creation

The code I publish is the foundation of the code I'm still improving, and I think it's even more accurate if you create your own features or ensemble with other algorithms, for example. .. Of course, even in the code to be published, a new feature amount related to the aggregation of past grades is created from the scraped feature amount.

An example of the features to be created is the aggregation of horses' past performance. It is necessary to aggregate so that future grades will not be included at the past time, so here we will sort race_id in ascending order so that future grades will not be aggregated from the time of aggregation. For example, if you look at the aggregated results for almond eyes,

スクリーンショット 2020-08-21 22.27.54.png

Therefore, only the past data is properly aggregated and added as a new feature quantity. (Note that only the data from 2018 to 2020 is used here to check the public code.)

Modeling

As the title suggests, we will create a model using lightGBM. The parameters are automatically adjusted by optuna. I think that the accuracy can be further improved by performing ensemble and cross-validation in this part.

Scraping of race information before the event, display of predicted values

Pre-race race information to let the model predict is not the netkeiba database but the race information page Get from / top /? Rf = navi). The basic code is almost the same as scraping past grades.

Predicted values are displayed for each race. The bet amount similar to Mr. 卍's calculation is also displayed in the column using the odds at the time of scraping.

For example, if you display the Niigata 4R on Saturday, August 8, 2020 the other day

スクリーンショット 2020-08-26 9.07.58.png

In this case, bet on horses whose predict value exceeds a certain value, or bet on 3 horses from the one with the largest value.

Full code

The full text is available at here. We also provide more detailed explanations.

I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (Part 2)