I am a student majoring in information systems at a certain T university. When I was looking at various articles on Qiita, I found this article.
-If you have deep learning, you can exceed 100% recovery rate in horse racing
Regarding the achievement of 100% recovery rate in this article, since the number of betting tickets simulated for purchase is small, it is unknown whether it will be established in other periods. The source code is also charged, so I don't know the details of how to do it. However, I thought it would be interesting to predict horse racing myself, so I actually tried it with the intention of studying.
It will be a lot of learning because you will be doing all of the data collection, analysis, and forecasting.
There was a desire that it might be money, but horse racing seems to have a high deduction rate, so I can not expect much. The main reason is that it has been talked about recently and I wanted to try deep learning.
Another reason for choosing horse racing is
――The race result is less influenced by the spectators --If there are enough explanatory variables, it seems that you can make predictions with reasonable accuracy.
That is mentioned.
It seems good to make the theme of stocks, but since the price fluctuates due to the decision making of many people, it is difficult to predict with good accuracy unless information such as news that traders often see is incorporated. That's right. In addition, many institutional investors place orders automatically according to the algorithm, which is likely to depend on this.
From the above, I thought that it would not be easy with the current technology, so I thought that horse racing was more suitable for deep learning.
The number of horses running in horse racing varies from race to race, but it seems that the number of participating horses is constant in boat racing. It seems that machine learning will be easier if detailed data can be obtained.
"Horse racing (horse racing) is a race in which horses with horses compete, and a gambling that predicts the order of arrival" (quote: [Horse Racing-Wikipedia](https: //) ja.wikipedia.org/wiki/horse racing)).
I had little knowledge about horse racing until I analyzed this data, so I will summarize the knowledge that I thought was necessary to read this article.
First, let's know about the types of betting tickets as basic knowledge. It's okay to just read a single win or a double win. Reference: [Type of betting ticket: JRA for first-time users](https://www.google.com/url?sa=t&rct=j&q=1esrc=s&source=web&cd=1&ved=2ahUKEwjj9cC71eHlAhXgy4sBHXKtA0QQFjAAegQIAhAB&url=http%3 jra.go.jp%2Fkouza%2Fbeginner%2Fbaken%2F&usg=AOvVaw12f8T5GSlozZG9tnRspGtC)
For other terms, refer to the following
--Odds: Magnification that shows how many times the money you get in a win is the number of money you spend ――Rise: The end of the race and training --Umaban: A number uniquely assigned to a racehorse --Frame number: There are 1 to 8. One number for every two gates at the start --Order of arrival: Order to reach the goal --Central Horse Racing: Horse racing held by the Japan Racing Association. There are 10 locations in Sapporo, Hakodate, Fukushima, Niigata, Nakayama, Tokyo, Chukyo, Kyoto, Hanshin, and Ogura. --Local Horse Racing: Unlike central horse racing, horse racing hosted by local governments
Reference: Horse Racing Glossary JRA
I'm not so familiar with it so please let me know if you make a mistake ...
Domain knowledge is said to be important in machine learning, so it will be necessary to become familiar with horse racing in order to improve prediction accuracy.
Even if you predict horse racing, there are a lot of things to think about and do. The procedure can be roughly divided as follows.
The first major issue for those who want to predict horse racing is the data collection and shaping. In competitions like Kaggle, it's pretty easy because the dataset is given from the beginning, but this time we need to start by collecting the data.
Also, it is difficult to create a model because various methods can be considered. Nowadays, you can easily use gradient boosting, deep learning, etc. in the library, but you will need to try various methods to improve the prediction accuracy.
--Basic knowledge of HTML, CSS, etc. --Basic usage of Selenium --Basic usage of Beautifulsoup --Basic usage of pandas --Basic usage of keras
Usage data
--Learning data: January 2008-July 23, 2017 --Verification data: July 23, 2017-November 2019
result
--Winning accuracy rate: 0.2450 --Double win correct answer rate: 0.5434
I made a model with higher accuracy than myself as a horse racing beginner
Machine learning is not possible suddenly even though there is no data. Let's do crawling scraping.
First, get information on past race results and horses from the target site.
The data obtained here should be as close to the raw data as possible, and the data will be formatted later for learning.
It is the largest horse racing information site in Japan. From past race data to horse pedigree information, you can get pretty detailed data for free.
It seems that more detailed data can be obtained by becoming a paid member. It is effective when you want to improve the accuracy of the model.
This time, we decided to collect data focusing on the race results at the Central Racecourse, which has a large amount of information and a unified system.
Since there is a lot of data, you can make a good model by collecting and using various data. However, it is quite troublesome to collect pedigree information and data such as owners and trainers, so I decided not to do it this time. It seems that the prediction accuracy will improve if you add data around here.
From the Detailed Race Search Screen on the site, use Selenium to get all the URLs to the race results.
The reason for not using requests and BeautifulSoup, which are often used when crawling and scraping in Python, is that both the search URL and the search result URL are [https://db.netkeiba.com/?pid=race_search_detail](https:: //db.netkeiba.com/?pid=race_search_detail) hasn't changed.
If the screen is dynamically generated by JavaScript or PHP, you cannot get the desired data by simply downloading the html.
With Selenium, screen transitions can be performed by actual browser operations, so web crawling can be performed even on sites where the display changes by clicking such a button or sites that require login. (Please note that many sites that require login prohibit crawling due to membership agreements, etc.).
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select,WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless') #In headless mode
driver = webdriver.Chrome(chrome_options=options)
wait = WebDriverWait(driver,10)
Fill in the required fields on the form. After sending, wait until the search results are displayed.
URL = "https://db.netkeiba.com/?pid=race_search_detail"
driver.get(URL)
time.sleep(1)
wait.until(EC.presence_of_all_elements_located)
#Search by month
year = 2019
month = 1
#Select a period
start_year_element = driver.find_element_by_name('start_year')
start_year_select = Select(start_year_element)
start_year_select.select_by_value(str(year))
start_mon_element = driver.find_element_by_name('start_mon')
start_mon_select = Select(start_mon_element)
start_mon_select.select_by_value(str(month))
end_year_element = driver.find_element_by_name('end_year')
end_year_select = Select(end_year_element)
end_year_select.select_by_value(str(year))
end_mon_element = driver.find_element_by_name('end_mon')
end_mon_select = Select(end_mon_element)
end_mon_select.select_by_value(str(month))
#Check out the Central Racecourse
for i in range(1,11):
terms = driver.find_element_by_id("check_Jyo_"+ str(i).zfill(2))
terms.click()
#Select the number to be displayed(20,50,From 100 to the maximum 100)
list_element = driver.find_element_by_name('list')
list_select = Select(list_element)
list_select.select_by_value("100")
#Submit form
frm = driver.find_element_by_css_selector("#db_search_detail_form > form")
frm.submit()
time.sleep(5)
wait.until(EC.presence_of_all_elements_located)
For the sake of simplicity, I am trying to get the URL for January 2019. If you want a wider range of data, do one of the following:
--Do not fill out the year / month form --Get the URL for each year and month in a loop --Change the range of selected years
(In the code on github, we are trying to collect race data that has not been acquired since 2008.)
If you don't fill in the selection of racetracks, data on races held overseas will be included. Let's check 10 central racecourses properly.
I decided not to use the data other than the Central Racecourse this time because there may be few horses running or the data may be incomplete.
Click the button in Selenium and save the URL displayed 100 times at a time.
with open(str(year)+"-"+str(month)+".txt", mode='w') as f:
while True:
time.sleep(5)
wait.until(EC.presence_of_all_elements_located)
all_rows = driver.find_element_by_class_name('race_table_01').find_elements_by_tag_name("tr")
for row in range(1, len(all_rows)):
race_href=all_rows[row].find_elements_by_tag_name("td")[4].find_element_by_tag_name("a").get_attribute("href")
f.write(race_href+"\n")
try:
target = driver.find_elements_by_link_text("Next")[0]
driver.execute_script("arguments[0].click();", target) #Click processing with javascript
except IndexError:
break
Open the file and write the obtained URL line by line. The race URL is in the 5th column of the table, so in Python where array elements start at 0, select something like find_elements_by_tag_name ("td") [4]
.
Page feed is performed in a while loop. I'm using try
to catch the exception because I can't click on the last page.
The driver.execute_script ("arguments [0] .click (); ", target)
part of the try, but if you make it a simple target.click ()
, you will get a ʻElementClickInterceptedException` in headless mode. It has occurred.
Apparently it was recognized that the elements overlapped and I could not click it well. Here had a solution, but I was able to do it well by clicking with JavaScript as above.
The html obtained earlier does not seem to make much use of PHP or JavaScript for displaying the page, so I will finally use requests here. I get the html based on the information in the URL above and save it, but it takes a few seconds to get each page, so it takes a lot of time.
import os
import requests
save_dir = "html"+"/"+str(year)+"/"+str(month)
if not os.path.isdir(save_dir):
os.makedirs(save_dir)
with open(str(year)+"-"+str(month)+".txt", "r") as f:
urls = f.read().splitlines()
for url in urls:
list = url.split("/")
race_id = list[-2]
save_file_path = save_dir+"/"+race_id+'.html'
response = requests.get(url)
response.encoding = response.apparent_encoding
html = response.text
time.sleep(5)
with open(save_file_path, 'w') as file:
file.write(html)
Due to the character code, if you get it obediently, the characters may be garbled. I did it with response.encoding = response.apparent_encoding
and it worked.
Reference: Correct garbled characters when handling Japanese in Requests
Details of the race ・ Information on each racehorse will be stored in csv. I decided to create a csv with the following format.
--Race details --Race ID ――How many rounds --Race title --About the course
--Horse details --Race ID --Ranking --Horse ID --Horse number --Frame number --Gender Age --Burden weight --Weight and weight difference --Time ――Difference ――Rising time --Odds --Popular
There is other information that can be obtained. It seems that paid members can also get what is called a speed index.
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
CSV_DIR = "csv"
if not os.path.isdir(CSV_DIR):
os.makedirs(CSV_DIR)
save_race_csv = CSV_DIR+"/race-"+str(year)+"-"+str(month)+".csv"
horse_race_csv = CSV_DIR+"/horse-"+str(year)+"-"+str(month)+".csv"
# race_data_columns, horse_data_Since columns will be long, omit it
race_df = pd.DataFrame(columns=race_data_columns )
horse_df = pd.DataFrame(columns=horse_data_columns )
html_dir = "html"+"/"+str(year)+"/"+str(month)
if os.path.isdir(html_dir):
file_list = os.listdir(html_dir)
for file_name in file_list:
with open(html_dir+"/"+file_name, "r") as f:
html = f.read()
list = file_name.split(".")
race_id = list[-2]
race_list, horse_list_list = get_rade_and_horse_data_by_html(race_id, html) #Omitted because it will be long
for horse_list in horse_list_list:
horse_se = pd.Series( horse_list, index=horse_df.columns)
horse_df = horse_df.append(horse_se, ignore_index=True)
race_se = pd.Series(race_list, index=race_df.columns )
race_df = race_df.append(race_se, ignore_index=True )
race_df.to_csv(save_race_csv, header=True, index=False)
horse_df.to_csv(horse_race_csv, header=True, index=False)
For each race, add the details of the race, information on each racehorse, etc. to the list and add one line to the pandas data frame.
The get_rade_and_horse_data_by_html
function, race_data_columns
, and horse_data_columns
will be complicated and will not be included here.
To briefly explain, the get_rade_and_horse_data_by_html
function is a function that uses BeautifulSoup to list and return the desired data from html.
race_data_columns
, horse_data_columns
are the column names of the data to be acquired.
When crawling, make sure to allow time to access it so that it does not attack the server.
There are other people who have summarized detailed legal precautions, so if you actually do it, Web scraping precautions list-Qiita ) Etc., please refer to.
Now that we have the data in csv format, let's clean it up so that it is easy to handle.
Next, think about what kind of model to make while looking at the state of the data. After that, let's create train data according to the model you want to create.
Let's format the data so that it is easy to handle.
For example, convert date data or numbers in a string to a datetime object or int. Also, since it will be easier in the future if the data in one column is as simple as possible, gender and age are divided into two columns. There are many things to do.
It might have been better to do it at the same time as scraping, but since the scraping code seemed to be complicated, I decided to do it separately this time.
Below are some of them.
#Extract time information and combine it with date information. Make it a datetime type
race_df["time"] = race_df["time"].str.replace('Start: (\d\d):(\d\d)(.|\n)*', r'\1 o'clock\2 minutes')
race_df["date"] = race_df["date"] + race_df["time"]
race_df["date"] = pd.to_datetime(race_df['date'], format='%Y year%m month%d day%H o'clock%M minutes')
#The original time is unnecessary, so delete it
race_df.drop(['time'], axis=1, inplace=True)
#Remove extra R, blanks, and line breaks in some round columns
race_df['race_round'] = race_df['race_round'].str.strip('R \n')
We will analyze the formatted data and roughly check what kind of distribution it has. When creating a model, it is necessary to train it so that the data is not biased as much as possible, so it is also important for problem setting of the model.
Data analysis is also important when considering how to make features. In the case of deep learning etc., it seems that it is not necessary to stick to feature quantity engineering so much, but when doing ordinary machine learning that is not deep such as gradient boosting such as LightGBM, it is necessary to think carefully about what the feature quantity should be. there is.
Even with Kaggle, if you can find a good feature, it seems that the possibility of getting into the upper ranks will increase.
After deciding what kind of model to make while referring to the data analysis mentioned earlier, let's create train data.
Although it is input data, it is roughly as follows.
--Information on the race you want to predict --Horse number --Frame number --Age --Burden weight --Weight --Weight change from the last time --Burden weight / weight
The odds for the race you want to predict will fluctuate until just before the match, so we won't include them in the data.
First of all, I will give an overview, this time I will do deep learning with keras. Using data from a horse as input
--A model that predicts the probability of becoming number one --A model that predicts the probability of being in the third place
I made two of them.
It is necessary to consider whether to solve the classification problem or the regression problem.
In the case of regression problems, I think it will predict how much the horse will be (it will allow something like 1.2) and the time.
In the case of a classification problem, you will be asked to predict how many horses will be (this is classified by a natural number from 1 to 16), whether it will be number one, whether it will be in the top, etc. ..
Times and speeds vary greatly depending on the racetrack and course, so it will be difficult if you do not predict them separately. This time, we will simply predict "whether or not to be in the top" as a classification problem.
I will write about various things I tried when creating the model.
In addition, even if you create a model, it is indispensable to devise ways to prevent overfitting and to verify whether or not overfitting is occurring. Even if you do machine learning and get good results from your data, that model may not be able to predict other data with good accuracy.
First of all, from the basics. There is no point in creating a model unless you can evaluate whether it is good or not.
80% of the collected and formatted data was used as training data, and 20% was used as test data. In other words
--Learning data: January 2008-July 23, 2017 --Test data: July 23, 2017-November 2019
It is in the form of. This test data is used for the correct answer rate written at the beginning.
At the time of training, the training data was further divided into one for train and one for validation.
There are weight regularizations and dropouts as a means of suppressing overfitting, which keras makes easy to use.
Adding the cost according to the weight to the loss function of the network is the regularization of the weight, and the dropout is to randomly reduce (drop) the feature amount from the layer during training.
We used L2 regularization for weight regularization.
Reference: Learn about overfitting and lack of learning | TensorFlow Core
model = tf.keras.Sequential([
tf.keras.layers.Dense(300, kernel_regularizer=tf.keras.regularizers.l2(0.001), activation=tf.nn.relu, input_dim=df_columns_len), #l2 Regularized layer
tf.keras.layers.Dropout(0.2), #Drop out
tf.keras.layers.Dense(100, kernel_regularizer=tf.keras.regularizers.l2(0.001), activation=tf.nn.relu), #l2 Regularized layer
tf.keras.layers.Dropout(0.2), #Drop out
tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
])
A simple holdout validation using only a specific time period may happen to be overfitting for good results during that period.
Let's verify whether the model is good with the data at hand, as cross-validation is done in competitions such as Kaggle.
The problem is that it's time series data, so you can't just use KFold to split the data. When inputting time series data, if future information is set to train and past information is set to validation, the result may be better than it should be. Actually, I made a mistake at first and learned by inputting future data, but the probability of predicting a double win exceeded 70%.
So, this time, I used the split method used for cross-validation of time series data (sklearn's [TimeSeries Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit. html)).
Roughly speaking, as shown in the figure below, the data set is divided by adding time series, and part of it is used as verification data.
In this figure, you will learn three times. However, some training data will be reduced, so if the number of data is small, a simple holdout may be better.
tscv = TimeSeriesSplit(n_splits=3)
for train_index, val_index in tscv.split(X_train,Y_train):
train_data=X_train[train_index]
train_label=Y_train[train_index]
val_data=X_train[val_index]
val_label=Y_train[val_index]
model = train_model(train_data,train_label,val_data,val_label,target_name)
Hyperparameters in machine learning are important. For example, in deep learning, the larger the layer in between, the more intermediate variables there are, and the less training data there is, the easier it is to overfoot. On the other hand, if it is small, even if the amount of data is sufficient, it may not be flexible enough to learn correctly.
There will be a lot of controversy about how to do it. This seems to vary from person to person.
This time, there was a library called hyperas that automatically adjusts the parameter tuning of keras, so I decided to use it. It was relatively intuitive and easy to understand.
To use it simply, pass the data preparation function and the function that returns the value you want to minimize by training to ʻoptim.minimize`.
Specify the width you want to adjust with choice
for integer values and ʻuniform` for real numbers.
For details, refer to here: https://github.com/maxpumperla/hyperas
import keras
from keras.callbacks import EarlyStopping
from keras.callbacks import CSVLogger
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from hyperopt import Trials, STATUS_OK, tpe
from hyperas import optim
from hyperas.distributions import choice, uniform
def prepare_data_is_hukusyo():
"""
I will prepare various data here
"""
return X_train, Y_train, X_test, Y_test
def create_model_is_hukusyo(X_train, Y_train, X_test, Y_test):
train_size = int(len(Y_train) * 0.8)
train_data = X_train[0:train_size]
train_label = Y_train[0:train_size]
val_data = X_train[train_size:len(Y_train)]
val_label = Y_train[train_size:len(Y_train)]
callbacks = []
callbacks.append(EarlyStopping(monitor='val_loss', patience=2))
model = Sequential()
model.add(Dense({{choice([512,1024])}}, kernel_regularizer=keras.regularizers.l2(0.001), activation="relu", input_dim=train_data.shape[1]))
model.add(Dropout({{uniform(0, 0.3)}}))
model.add(Dense({{choice([128, 256, 512])}}, kernel_regularizer=keras.regularizers.l2(0.001), activation="relu"))
model.add(Dropout({{uniform(0, 0.5)}}))
if {{choice(['three', 'four'])}} == 'three':
pass
elif {{choice(['three', 'four'])}} == 'four':
model.add(Dense(8, kernel_regularizer=keras.regularizers.l2(0.001), activation="relu"))
model.add(Dropout({{uniform(0, 0.5)}}))
model.add(Dense(1, activation="sigmoid"))
model.compile(
loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'])
history = model.fit(train_data,
train_label,
validation_data=(val_data, val_label),
epochs=30,
batch_size=256,
callbacks=callbacks)
val_loss, val_acc = model.evaluate(X_test, Y_test, verbose=0)
print('Best validation loss of epoch:', val_loss)
return {'loss': val_loss, 'status': STATUS_OK, 'model': model}
#Actually adjust with hyperas
best_run, best_model = optim.minimize(model=create_model_is_hukusyo,
data=prepare_data_is_hukusyo,
algo=tpe.suggest,
max_evals=15,
trials=Trials())
You may be able to make better accurate predictions by mixing the outputs from different models.
By averaging the 1st and 3rd place predictions, we were able to obtain a slightly higher value than the original prediction value.
The characteristics of horses that are likely to be ranked first and the characteristics of horses that are likely to be ranked high may be slightly different, and it is thought that a more accurate prediction is possible by mixing the two. I will.
For example, a horse that may be ranked first but does not overdo it if it seems to fail in the middle of the race and a horse that stably enters the top are likely to have slightly different characteristics.
In the end, I made a model with higher accuracy than myself, a horse racing beginner.
--Winning accuracy rate: 0.2450 --Double win correct answer rate: 0.5434
There is still more information that seems to be important in horse racing, so there seems to be room for improvement.
The balance when I keep buying the 1st place in a win is as follows. I plotted it properly using pandas.
In the double win, it became as follows.
It's a big deficit. It will be a little better if you buy only the ones with high predictions or not the ones with low odds.
In making this horse racing prediction, I will leave some of the things I tried that have nothing to do with the main line.
GCP's free credits were about to expire around the end of November, so my second goal was to consume them.
You can throw the program before going to bed and check it when you wake up in the morning.
Free instances don't have enough memory for CSV creation and deep learning, so be careful if you use GCP.
Regarding GCP, I used to send LINE Notify if the program ended or if an error occurred.
As soon as I was done, I could see the results and run the next program, which was a lot of work.
It's a suitable sideshow for students, so if you're familiar with it, I think there are a lot of things to do. If you make a mistake, it will be a learning experience, so I would be grateful if you could kindly point it out in the comments or on Twitter.
Twitter ID (I don't tweet too much): @ 634kami
It is published on github. I wanted to make something that works for the time being, so it's not something that people can see, but please see only those who say that it's okay.
The code on Qiita has been partially modified to make it easier to read.
--Since the missing data of the input value is filled with 0, only the complete data is predicted. --Enter pedigree data --Enter jockey data --Try using gradient boosting such as LightGBM ――Since scraping of the tie-up race failed, make it complete data
-If you have deep learning, you can exceed 100% recovery rate in horse racing -Horse Racing Prediction with Deep Learning -Story of winning the Teio Sho by machine learning at Oi Horse Racing -I tried to predict horse racing -7th Method and Evaluation Method for Solving Horse Racing Prediction by Machine Learning -Various ways to cut validation (summary of sklearn functions) [kaggle Advent Calendar Day 4]
I added the following because I did the following.
--Deleted because the train data contains information on races of 7 or less. --Remove fault races from trian data --Calculate the correct answer rate for each number of races
Below are the results.
total: 8, random tansyo accuracy:0.125, hukusyo accuracy:0.375
tansyo accuracy: 0.3497536945812808
hukusyo accuracy: 0.7044334975369458
total: 9, random tansyo accuracy:0.1111111111111111, hukusyo accuracy:0.3333333333333333
tansyo accuracy: 0.2693726937269373
hukusyo accuracy: 0.6568265682656826
total: 10, random tansyo accuracy:0.1, hukusyo accuracy:0.3
tansyo accuracy: 0.30563002680965146
hukusyo accuracy: 0.6407506702412868
total: 11, random tansyo accuracy:0.09090909090909091, hukusyo accuracy:0.2727272727272727
tansyo accuracy: 0.2582278481012658
hukusyo accuracy: 0.5468354430379747
total: 12, random tansyo accuracy:0.08333333333333333, hukusyo accuracy:0.25
tansyo accuracy: 0.2600806451612903
hukusyo accuracy: 0.5826612903225806
total: 13, random tansyo accuracy:0.07692307692307693, hukusyo accuracy:0.23076923076923078
tansyo accuracy: 0.2894736842105263
hukusyo accuracy: 0.5855263157894737
total: 14, random tansyo accuracy:0.07142857142857142, hukusyo accuracy:0.21428571428571427
tansyo accuracy: 0.23014586709886548
hukusyo accuracy: 0.5380875202593193
total: 15, random tansyo accuracy:0.06666666666666667, hukusyo accuracy:0.2
tansyo accuracy: 0.2525399129172714
hukusyo accuracy: 0.532656023222061
In each case, the accuracy rate was better than the completely random selection method.
Recommended Posts