Kyotei forecast with TensorFlow

Motivation

At an event, I had the opportunity to touch TensorFlow, and at that time I was asked if I could predict a boat race by machine learning, so I tried it because it seemed interesting.

environment

Ubuntu 16.04 + python 2.7.12 + TensorFlow 0.7.1

1. Condition setting

In the boat race, 6 boats compete for the ranking. The person who purchases the boat ticket predicts the order of arrival at the goal based on the player's battle history. This time, we will challenge the prediction of "Nirenmono", which predicts the 1st and 2nd place at the goal including the order of arrival.

2. Create past race database

Past race results are provided as text files at the following sites. http://www1.mbrace.or.jp/od2/K/pindex.html The results from 2014 to the present (2016/11) were acquired in a batch and made into a database with Python + SQLite3.

3. Create input features

I calculated the features to be input when training. The features used are as follows. --Race venue --Whether or not the approach is fixed --Athlete approach distribution --Distribution of order of arrival by player frame --Athlete start timing distribution --Kimarite distribution of players The feature quantities of the players were created for the past one and a half months. In addition, races in which players with extremely little history are participating are excluded from the forecast. It seems that some people use the boat motors used by athletes as a reference for their predictions, but this time they were excluded.

I implemented the network by referring to the following article. [Machine learning (TensorFlow) + Lotto 6] http://qiita.com/yai/items/a128727ffdd334a4bc57

4. Training

The training was conducted for 97200 races from January 2014 to March 2016, and the number of steps was 300. As a result, the hit rate in the training data was about 20%. After all it seems difficult to predict the boat race.

5. Simulation

We tested the race for 6 months from May 2016 to October 2016. In each race, the simulation is performed assuming that the one with the highest output label (= expected result) is bought for 100 yen each. For the convenience of the created program, races with 5 or less boats that have scored goals due to fouls or dropped boats of athletes are excluded from the test cases. In addition, we do not anticipate any decrease in odds due to the purchase of boat tickets. Therefore, please note that the results such as the hit rate shown below may be slightly higher than the actual results.

6. Simulation (1) All race forecast

I will try it in all the expected races during the period.

period	Expected number of races	Number of hit races	Hit rate	Income and expenditure(Circle)
2016/5	4178	856	0.204	-63,010
2016/6	3589	723	0.201	-54,460
2016/7	3940	752	0.190	-75,450
2016/8	4336	816	0.188	-61,120
2016/9	3598	672	0.186	-64,610
2016/10	3750	688	0.183	-74,940
Total	23391	4507		-393,590

It's a disappointing result. Since the hit rate is low and only races with low odds are hit, the balance is significantly negative.

7. Simulation (2) Select a race and predict

Only try races where the output label exceeds a certain threshold (0.45 this time). I feel like I'm focusing on the races I'm confident about.

period	Expected number of races	Number of hit races	Hit rate	Income and expenditure(Circle)
2016/5	55	28	0.509	+190
2016/6	53	24	0.452	+1,050
2016/7	63	29	0.460	+790
2016/8	47	24	0.510	+530
2016/9	30	13	0.433	-170
2016/10	30	14	0.466	+450
Total	278	132		+2,840

The hit rate is over 40%, and the income and expenditure is subtle but positive in 5 months out of 6 months. After all it seems that only races with low odds are hit, but it seems that it is covered by a high hit rate. Considering that the average recovery rate of boat races is 75%, it seems to be a reasonable result.

Summary

Since a boat race is a person-to-person race, there are many irregular elements, and it seems difficult to predict the finish order result itself by machine learning. One of the reasons is that I am an amateur in machine learning and boat racing in the first place. It may be used to extract so-called "hard races" where there is an overwhelming difference in ability between athletes from a large number of races. As I mentioned before, the simulation I did this time was done under conditions different from reality, and I'm not sure if it will work in the actual race, so I'm not sure.