Now it's the beginning of solving the mystery

Original article: If you have deep learning, you can exceed 100% recovery rate in horse racing

Let's move it first

I will purchase the program immediately and try it. As written in the explanation, it basically worked with copy and paste, but the following two places did not work as it is, so I fixed it here.

--Parsing date data --Column name of features used for learning / inference

It is not exactly the same because it contains random elements, but the graph behaves in much the same way, so it seems to have been reproduced. As stated in the article, ** "A part of the range (around 55-60) where the 3rd place index is 60 or more and the odds are not too high" ** is 100% at hand. Exceeded. It seems that the number of races and records has increased last week (Queen Elizabeth Cup week).

The result of my execution

item	result
Number of target races (*)	3672
Number of target records	42299
Purchase number	74
Hit number	13
Hit rate	17.57%
Recovery rate	172.97%

Results in the original article

item	result
Number of target races (*)	3639
Number of target records	41871
Purchase number	98
Hit number	20
Hit rate	20.4%
Recovery rate	213.3%

What kind of horse is predicted

What kind of horse did you buy and exceeded 100%? Are you aiming for conditional races rather than main races? Horse racing fans will be curious. However, the data frame of the verification data did not contain the horse name, only the preprocessed values such as horse number and popularity were included, and the odds were not raw data, so it was very difficult to understand. I couldn't find out without creating separate data.

Try changing from deep learning to another model

This is the code I modified. Let's change to a simpler logistic regression compared to a simple neural network.

`python`


from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=2.0, penalty='l1', random_state=42, multi_class="auto")
model.fit(df_train, train_labels)

This time, with the output model.predict_proba (df) [:, 1] of this model as the 3rd place index, as before, the "3rd place index is 60 or more, and the odds are not too high. (Around 55-60) Try to buy ".

item	result
Purchase number	175
Hit number	21
Hit rate	12.0%
Recovery rate	127.26%

** Amazing. Logistic regression exceeded 100%! ** By the way, it was 196% in Random Forest!

Is the data wrong?

I feel that there is some bias in the data, not deep learning. The places where we are considering the purchase method are as follows.

`python`


#0 if you buy odds in the range 1-8.01〜0.Set 08
if win3_pred >= 0.60 and 0.01 <= odds < 0.08:
    return 1
else:
    return 0

win3_pred is the 3rd place index before multiplying by 100. I'm curious that the odds are still standardized (0.01-0.08 is equivalent to normal win odds 55-60), but here I'll rewrite it as follows.

`python`


if 0.01 <= odds < 0.08:
    return 1
else:
    return 0

This is a simulation of purchasing a double win betting ticket when the winning odds are 55 to 60 times without using the 3rd place index at all.

item	result
Purchase number	778
Hit number	68
Hit rate	8.74%
Recovery rate	90.67%

Since the deduction rate for double-winning betting tickets is 20%, it is natural that the recovery rate will be around 80% to some extent even if the range is specified simply by the odds. If it is a large hole betting ticket, one shot is big, so there may be some blurring, but I feel that 90% is a little high. Perhaps there is a problem with the validation data itself. Let's check where pre-processing and data processing are performed.

Discard horses that do not have the results of the last 5 runs

This is the first place I got caught

#Delete missing lines
df = df.dropna(subset=[
    'past_time_sec1', 'past_time_sec2', 'past_time_sec3',
    'past_time_sec4', 'past_time_sec5'
]).reset_index(drop=True)

past_time_sec1 to past_time_sec5 represent the time of the horse's last 5 runs. This means that horses that do not have all the times of the last 5 runs are abandoned here. Especially in the 2-3 year old race, the number of horses running varies. For example, last week on November 10, 2019, Fukushima 10R Fukushima 2-year-old S has 14 racehorses (https://race.netkeiba.com/?pid=race_old&id=c201903030410), but the time of the past 5 runs is complete. There were three, and in fact there were only three in the deleted dataframe. With this dropna, the number of records is ** 471500-> 252885 **. ** Nearly half of the data has been discarded. ** The data discarded here are 2-3 year old horses, local horses (because local horse racing has not acquired data), and 2010 data (previous run information cannot be acquired because there is no data for 2009). It was the center. It doesn't seem appropriate, but it's not a fatal mistake because it can be excluded by the same rules when inferring.

How many double wins will be refunded?

Double-winning betting tickets will win up to 3rd place if the number of runners is 8 or more, up to 2nd place if 5 to 7 horses, and will not be sold if 4 or less horses. The following processing was ** performed on the verification data **.

`python`


#Focus on races with 2 or more double wins and 5 or more in total
win3_sums = df.groupby('race_id')['win3'].sum()
win3_races = win3_sums[win3_sums >= 2]
win3_races_indexs = win3_races.index.tolist()

win3_counts = df.groupby('race_id')['win3'].count()
win3_races2 = win3_counts[win3_counts >= 5]
win3_races_indexs2 = win3_races2.index.tolist()

race_id_list = list(set(win3_races_indexs) & set(win3_races_indexs2))

By this process, the number of records becomes ** 48555-> 42999 **, and 11.4% of the data is discarded. What you really want to throw away is a race with no refund for double wins, but it's too much by any means. In fact, there is no competition for less than 4 JRA in 2018-2019 (at least in my keibadb) This process is a problem.

What's going on

What's wrong? Since win3 is an objective variable that indicates whether or not the horses have entered the double-winning order, in the above processing, the number of horses that have entered the double-winning range is 2 or more, and the number of runners is 5 or more. But remember. ** Horses that do not have the results of the last 5 runs have already been thrown away **, so ** horses that should not be erased have disappeared **. It's a little confusing, but let's take a concrete look. For example, Kyoto 12R on November 2, 2019. https://race.netkeiba.com/?pid=race&id=p201908050112&mode=shutuba The 5 horses of 4th Boccherini, 7th Sunray Pocket, 9th Narita Blue, 10th Theo Amazon, and 12th Metropole have been excluded from the record in advance because the race times of the last 5 runs are not aligned, and 8 heads are standing. It will be regarded as a race. The order of arrival in this race was 4-3-7. Since it is a 13-headed race, the double win is 4, 3 and 7. However, since the horses ** 4 and 7 have already been deleted from the DataFrame, only one horse has become a double-winning betting ticket in this race **, so in this race there are horses within the double-winning range. It will be one and will be excluded from the validation data. This operation cannot be done for future races, as we naturally do not know which horse will be the double-winning betting ticket before the race ** By the way, the 5th Nihon Pillow Halo in this race had a win odds of 60 times, but lost. (Verification data shows that it is no longer necessary to buy a horse that loses at 55 to 60 times). I can see it little by little. Regardless of the training model, some horses have disappeared from the validation data, which seems to be biased.

Let's verify with correct data

Let's make an inference without improper narrowing down. There are no races without double win refunds in 2018-2019, so there is no need to narrow down the validation data above. Let's simulate the purchase for the verification data that has not been commented out and narrowed down.

item	result
Number of target races (*)	5384
Number of target records	48555
Purchase number	88
Hit number	13
Hit rate	14.77%
Recovery rate	145.45%

This recovery rate of 145% is

――Third place index (predicted value of deep learning model) ――Do not buy horses that do not have the same time for the previous 5 runs --Winning odds 55-60 times

It is achieved under these three conditions. Both the hit rate and the recovery rate have decreased, but it is a calculation that is normally profitable. Is this the power of deep learning?

What kind of horse do you appreciate?

When will the 3rd place index be higher? I tried learning with Decision Tree using the class of whether the 3rd place index is larger or smaller than 0.5 as the correct answer data.

`python`


from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=3)
clf = clf.fit(df[all_columns], df.win3_pred > 0.5)

Let's visualize the resulting tree.

Apparently, past_odds1, that is, ** the winning odds of the previous run ** and the order of arrival of the previous run seem to be important.

This time, let's simply specify the 3rd place index of the purchase condition as 10 times or less of the winning odds of the previous run based on the rule.

`python`


# if win3_pred >= 0.60 and 0.01 <= odds < 0.08:
if raw_past_odds1 <= 10 and 55 <= raw_odds <= 60:

item	result
Purchase number	115
Hit number	15
Hit rate	13.04%
Recovery rate	147.22%

Instead of using the output of the model learned by deep learning, ** using just one rule gave the same recovery rate. ** **

Review all data again

Let's go back to 2010 as well as 2018-2019 for data with win odds of 55-60 times.

Horses with winning odds of 55-60 times for all 471,500 records

pivot_df = df2[(df2['odds'] >= 55) & (df2['odds'] <= 60)].groupby('year') \
          .fukusho.agg(["mean", "count", "sum"])
pivot_df.columns = ["Recovery rate", "Purchase number", "Refund"]
pivot_df['Income and expenditure'] = pivot_df['Purchase number'] * (-100) + pivot_df['Refund']
pivot_df.style.background_gradient()

The recovery rate in 2015 is high, but it is about 80%. The verification data for 2018-2019 is also below 80%, which is not particularly high.

Horses that do not have the same time for all 5 runs and horses with winning odds of 55 to 60 times

The verification data is in this state. The recovery rate decreased only in 2016, but increased in other years. The verification data for 2018-2019 also increased, exceeding 80%.

Horses with winning odds of 55-60 times and previous run odds of 10 times or less

　　 The left is for all data, and the right is for all 5 runs with narrowing down. The results for each year (147.22%) under the conditions that gave the same numbers as deep learning earlier are the combined data for 2018 and 2019 in the table on the right. ** You can see that even if you purchase under the same conditions, the recovery rate may be less than 60% in some years **.

Conclusion

The condition for replacing the 3rd place index with the odds of winning the previous race within 10 times was 147.22% because it was ** 2018-2019 **, and this rule will achieve a recovery rate of 100% in the future. It seems to be difficult to do. So what about the 3rd place index? If you plot the relationship between the 3rd place index and the odds of the previous run ... ダウンロード (6).png Of course, I also use other features, but ** I can see that the lower the odds of the previous run, the higher the 3rd place index. ** ** That was exactly what Decision Tree taught me.

The original article was titled ** "If you have deep learning, you can exceed 100% recovery rate in horse racing" , In fact, ** "If you remove horses that do not have the same time for the past 5 runs and purchase a double-winning betting ticket for horses with winning odds of 55 to 60 times from November 2018 to November 2019, the recovery rate is 100 You can exceed% by rules instead of deep learning, and it happens to be " I think that it is the content. ** Isn't it easy to exceed 100% recovery rate in horse racing just because there is deep learning? **

(I analyzed it all at once and wrote it all at once, so I may have made a mistake. Please let me know if you made a mistake.)

Lack of love for data

What I want to say throughout this article is to look at the data better. I enjoyed this verification and the feeling that the mystery was being solved. ** Basically, I think the data is interesting **. There are various perspectives, and there are various discoveries. Even the data of the survival analysis of the Titanic is interesting just by looking at it. As we deal with the data properly, we can see various things. That's how you can make a good model, ** no matter how deep learning or good algorithms you use, a model that doesn't care about your data won't be a good AI. ** ** If you want to get along with the data but don't know what to start with, go to the racetrack !! It's fun even if you don't make money !!!

[Verification] Just because there is deep learning, it does not mean that the recovery rate can easily exceed 100% in horse racing.

Now it's the beginning of solving the mystery

Let's move it first

The result of my execution

Results in the original article

What kind of horse is predicted

Try changing from deep learning to another model

python

Is the data wrong?

python

python

Discard horses that do not have the results of the last 5 runs

How many double wins will be refunded?

python

What's going on

Let's verify with correct data

What kind of horse do you appreciate?

python

python

Review all data again

Horses with winning odds of 55-60 times for all 471,500 records

Horses that do not have the same time for all 5 runs and horses with winning odds of 55 to 60 times

Horses with winning odds of 55-60 times and previous run odds of 10 times or less

Conclusion

Lack of love for data

`python`

`python`

`python`

`python`

`python`

`python`