Introduction

Do you like horse racing? I'm a beginner who started this year, but it's really fun to collect various information, predict and guess!

At first, it was fun just to predict, but the desire to ** "I don't want to lose" ** overflowed.

So, when I was surfing the internet, I was wondering if there was a delicious way to win, and it seemed interesting to predict horse racing using machine learning, so I decided to try it after studying.

Target

The return rate of horse racing seems to be about 70 to 80%, so if you buy it properly, the recovery rate is likely to converge to this level.

So, for the time being, I would like to aim for a ** recovery rate of 100% or more ** using the data obtained before the race!

Decide the setting

I think there are various ways to predict horse racing, such as simply predicting the ranking or optimizing the betting method by considering the odds. There are also various types of betting tickets to buy.

This time, I would like to divide the horse racing ranking into 3 groups within 3rd place, middle and lower, and perform ** multi-class classification **.

And I will buy a ** win-type ** betting ticket for the horse that ranked first in the expected results. The reason is that the win-win return rate is set higher than those that are easy to get high-priced betting tickets such as triples. (Reference: Buena's Horse Racing Blog-Knowledge for not losing with betting tickets)

Also, I will not use information on popularity and odds for features. How about using information that is not decided until just before the race? And that's because I thought it wouldn't be interesting to simply buy a popular horse. (The horse weight data determined about 50 minutes before the start of the race is treated as a feature.)

This time, I would like to focus on the race at Tokyo Racecourse and proceed as follows. The reason for narrowing down the racetrack is that it takes time to scrape the race data due to the poor algorithm and the time.sleep of 1 second. (It took about 50 hours to collect the data from 2008 to 2019 ...)

It's a hassle, but if you take the time, you can collect data on all racetracks.

procedure

Collect race data by scraping from this site (netkeiba.com).
Preprocess the data.
Create a model by training with LightGBM.
Using the created model, check what the recovery rate is for one year.

Scraping

Collect by scraping from this site (netkeiba.com). As far as I read robots.txt (which wasn't there in the first place) and the terms of use when scraping, it seemed to be okay, so I was careful not to overload it. I referred to the following article for the scraping method.

-python3 Crawling & Scraping -Scraping with Python and Beautiful Soup

The result of collecting the data is like this.

スクリーンショット 2020-09-02 20.06.59.png

When scraping, we have removed information about horses that have not performed in the last three races. This is because we think that the future cannot be predicted for things that do not have past information.

In addition, horses that ran in rural areas or overseas may lack the time index, etc., but that part is filled with the average value.

Feature value

This time, the following items were treated as features. Data of the day

Variable name	Contents
kai	How many times
day	What day is it held
race_num	What R
field	Turf or dirt
dist	distance
turn	Which way
weather	weather
field_cond	Going
~~place~~	~~Venue~~
sum_num	How many heads
prize	Winning prize
horse_num	Horse number
sex	sex
age	age
weight_carry	Weight
horse_weight	Horse weight
weight_change	Change in horse weight
l_days	How many days have passed since the previous run

Data of the past 3 races (01 → previous race, 02 → 2 races before, 03 → 3 races before)

Variable name	Contents
p_place	Venue
p_weather	weather
p_race_num	What R
p_sum_num	How many heads
p_horse_num	Horse number
p_rank	Ranking
p_field	Turf or dirt
p_dist	distance
p_condi	Going
p_condi_num	Baba index
p_time_num	Time index

Preprocessing

I just put the time in seconds and label-encoded the categorical variables. Below is the code to label encode the weather as an example.

`encode.py`


num = df1.columns.get_loc('weather')
    for i in range(df1['weather'].size):
        copy = df1.iat[i, num]
        if copy == 'Fine':
            copy = '6'
        elif copy == 'rain':
            copy = '1'
        elif copy == 'light rain':
            copy = '2'
        elif copy == 'Koyuki':
            copy = '3'
        elif copy == 'Cloudy':
            copy = '4'
        elif copy == 'snow':
            copy = '5'
        else:
            copy = '0'
        df1.iat[i, num] = int(copy)

df1['weather'] = df1['weather'].astype('int64')

Label-encode each categorical variable in this way.

I thought it would be easier with LabelEncoder, but I didn't use it because it seemed impossible to unify the compatibility of converted numbers and variables among multiple data files.

Also, LightGBM, the machine learning framework used this time, seems to use a decision tree for the weak classifier, so there is no need to standardize it. (Reference: Introduction to LightGBM)

Predictive model

Build your model with LightGBM, a gradient boosting framework. I chose this because it's fast and (likely) the strongest non-deep.

And as for the prediction method, this time we made it a multi-class classification, which is classified into one of the three groups of 3rd place or less, middle 1/3 excluding 3rd place and lower 1/3. For example, in the case of 15 heads, 1st to 3rd place is group 0, 4th to 8th place is group 1, and 9th to 15th place is group 2.

I referred to the following site for how to use it. (Reference: [[For beginners] LightGBM (Multi-class classification) [Python] [Machine learning]](https://mathmatical22.xyz/2020/04/11/%E3%80%90%E5%88%9D % E5% AD% A6% E8% 80% 85% E5% 90% 91% E3% 81% 91% E3% 80% 91lightgbm-% E5% 9F% BA% E6% 9C% AC% E7% 9A% 84% E3% 81% AA% E4% BD% BF% E3% 81% 84% E6% 96% B9-% E5% A4% 9A% E3% 82% AF% E3% 83% A9% E3% 82% B9% E5 % 88% 86% E9% A1% 9E% E7% B7% A8 /))

The training data and verification data are as follows.

Training data / verification data	test data
Tokyo_2008_2018	Tokyo_2019

The training data is divided into training data and model evaluation data by train_test_split. The parameters are not tuned in particular.

`train.py`


train_set, test_set = train_test_split(keiba_data, test_size=0.2, random_state=0)

#Explain training data Variable data(X_train)And objective variable data(y_train)Divided into
X_train = train_set.drop('rank', axis=1)
y_train = train_set['rank']

#Explanatory variable data for model evaluation data(X_test)And objective variable data(y_test)Divided into
X_test = test_set.drop('rank', axis=1)
y_test = test_set['rank']

#Set the data used for learning
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

params = {
        'task': 'train',
        'boosting_type': 'gbdt',
        'objective': 'multiclassova',
        'num_class': 3,
        'metric': {'multi_error'},
}

model = lgb.train(params,
        train_set=lgb_train, #Designation of training data
        valid_sets=lgb_eval, #Specifying validation data
        verbose_eval=10
)

### Try to move I actually moved it. ![model_pic](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/86df2b27-05bc-cc96-fc27-cc34733ac103.png)

The correct answer rate is about 54%. You're guessing more than half. This value did not change much even if I tampered with the parameters, so this time I will keep it as it is.

Verification

We will post the data verified at Tokyo Racecourse for one year in 2019.

Here, as a condition

-** Buy a win for 100 yen per race. (Hit = odds x 100-100 or miss = -100) ** -** 1 Do not buy if the number of horses for which data remains during the race is less than half the number of horses in the race. (± 0) **

It is said. The reason for the second condition is to exclude races that almost certainly miss, such as the 2-year-old race, where only one horse has three or more past race data.

Below is the resulting graph.

**: relaxed: It feels good: relaxed: **
To be honest, I didn't think that the recovery rate would exceed 100% so easily.

The hit rate is about 26%, which is a good hit.

Due to the second condition, there were races where I couldn't bet about 100 races, but I think there is no complaint about this recovery rate after participating in about 80% of the races.

Since it is a great deal, I would like to verify it with other betting tickets. In addition, only 3 units are on the premise of buying 3 BOXes (6 ways). ![fukusho.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/0a27b45c-5aa9-2c7b-844a-410e7d077f0c.png) ![umaren00.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/6c12ed8d-dd26-c9dc-d707-0d1f3cc4d877.png) ![umatan00.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/7a6a1813-35ba-876b-d776-1e8fe57ab257.png) ![renpuku00.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/613b1601-51e4-6b1c-534a-36dc666aa98a.png) ![3rentan_0.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/666677/7f5151f8-4e9a-6e9c-c912-3dfca7635443.png)

It's pretty good ...! The recovery rate for horses is close to 200%, which is a wonderful result. However, betting tickets with a large return instead of a low hit rate will have a large fluctuation in the collection rate, so I would like to keep it as a reference.

The dissatisfaction point is that the recovery rate of double wins is less than 100%, even though we evaluate whether or not we are in the third place. I would like to do something about this area.

Let's add more conditions

So far, the only condition for buying a betting ticket is the number of horses, but in practical use, I think that it will be decided by the good or bad of the expected value rather than buying all the races.
Therefore, I would like to add the following new conditions.

** ・ Buy only when the difference between the 1st and 2nd predicted numbers classified in group 0 is 0.1 or more. ** **

In other words, buy only in such cases スクリーンショット 2020-09-02 21.36.00.png I don't buy at such times スクリーンショット 2020-09-02 21.36.34.png about it.

The reason for this condition

It is difficult to predict whether or not the horse will be in the 3rd place, because it tends to be large when there are many strong horses or the number of horses running is small.
If there is a gap in the predicted numbers, it is expected that the horse will be fairly strong in the target race. Because.

Below are the results of verification under these conditions.

**: relaxed: It feels really good: relaxed: **

The hit rate has improved significantly from 26% to 39%, and the recovery rate has improved significantly from 130% to 168%. The number of target races has decreased by about 250, and it has been narrowed down to 100 races a year, but considering that 1/4 is still participating, I think this recovery rate is good.
I will try other betting tickets for the time being.

It is good! It is worth mentioning that the hit rate of double wins exceeds 70% and the recovery rate exceeds 100%. The most popular horse has a double win rate of about 60 to 65% (Reference: Developer Blog | AlphaImpact Co., Ltd.) This seems to be very good.

About features

Let's also look at the importance of features when creating a model.

You can see that the time index is treated as a fairly important feature. Obviously, horses that have run well in past races have a high probability of winning.

What was surprising was that how many days had passed since the last race was treated as an as important feature as the climbing time and horse weight. I was surprised that I don't see many people who are predicting horse racing and who emphasize rotation. Also, this is interesting because it overlaps with Almond Eye, who lost by forcing Rote for the middle two weeks. Well, I'm not sure if the index is getting worse because the space is short, but lol

in conclusion

Nowadays, horse racing AI seems to be steadily getting excited, with some sites operating as a service and Dwango's cyber award. Under such circumstances, I was able to practice horse racing prediction using machine learning, which was a lot of fun.

However, the future cannot be predicted perfectly, so using this model does not mean that you can definitely win. It is possible that this year and next year's race will be defeated.

I don't think it's good to expect too much from this kind of thing, but I think there are dreams because some people make money from horse racing programs.

Since you pointed out that it is a guidance article for financial purposes, I deleted the url. : bow:

A story about achieving a horse racing recovery rate of over 100% through machine learning