Overview

I am a student majoring in information systems at a certain T university. When I was looking at various articles on Qiita, I found this article.

-If you have deep learning, you can exceed 100% recovery rate in horse racing

Regarding the achievement of 100% recovery rate in this article, since the number of betting tickets simulated for purchase is small, it is unknown whether it will be established in other periods. The source code is also charged, so I don't know the details of how to do it. However, I thought it would be interesting to predict horse racing myself, so I actually tried it with the intention of studying.

It will be a lot of learning because you will be doing all of the data collection, analysis, and forecasting.

Why horse racing?

There was a desire that it might be money, but horse racing seems to have a high deduction rate, so I can not expect much. The main reason is that it has been talked about recently and I wanted to try deep learning.

Another reason for choosing horse racing is

――The race result is less influenced by the spectators --If there are enough explanatory variables, it seems that you can make predictions with reasonable accuracy.

That is mentioned.

It seems good to make the theme of stocks, but since the price fluctuates due to the decision making of many people, it is difficult to predict with good accuracy unless information such as news that traders often see is incorporated. That's right. In addition, many institutional investors place orders automatically according to the algorithm, which is likely to depend on this.

From the above, I thought that it would not be easy with the current technology, so I thought that horse racing was more suitable for deep learning.

The number of horses running in horse racing varies from race to race, but it seems that the number of participating horses is constant in boat racing. It seems that machine learning will be easier if detailed data can be obtained.

Explanation to those who are new to horse racing

"Horse racing (horse racing) is a race in which horses with horses compete, and a gambling that predicts the order of arrival" (quote: [Horse Racing-Wikipedia](https: //) ja.wikipedia.org/wiki/horse racing)).

I had little knowledge about horse racing until I analyzed this data, so I will summarize the knowledge that I thought was necessary to read this article.

First, let's know about the types of betting tickets as basic knowledge. It's okay to just read a single win or a double win. Reference: [Type of betting ticket: JRA for first-time users](https://www.google.com/url?sa=t&rct=j&q=1esrc=s&source=web&cd=1&ved=2ahUKEwjj9cC71eHlAhXgy4sBHXKtA0QQFjAAegQIAhAB&url=http%3 jra.go.jp%2Fkouza%2Fbeginner%2Fbaken%2F&usg=AOvVaw12f8T5GSlozZG9tnRspGtC)

For other terms, refer to the following

--Odds: Magnification that shows how many times the money you get in a win is the number of money you spend ――Rise: The end of the race and training --Umaban: A number uniquely assigned to a racehorse --Frame number: There are 1 to 8. One number for every two gates at the start --Order of arrival: Order to reach the goal --Central Horse Racing: Horse racing held by the Japan Racing Association. There are 10 locations in Sapporo, Hakodate, Fukushima, Niigata, Nakayama, Tokyo, Chukyo, Kyoto, Hanshin, and Ogura. --Local Horse Racing: Unlike central horse racing, horse racing hosted by local governments

Reference: Horse Racing Glossary JRA

I'm not so familiar with it so please let me know if you make a mistake ...

Domain knowledge is said to be important in machine learning, so it will be necessary to become familiar with horse racing in order to improve prediction accuracy.

Rough procedure

Even if you predict horse racing, there are a lot of things to think about and do. The procedure can be roughly divided as follows.

Data collection (crawling / scraping)
Data shaping (pandas, SQL, etc.)
Modeling (machine learning)

The first major issue for those who want to predict horse racing is the data collection and shaping. In competitions like Kaggle, it's pretty easy because the dataset is given from the beginning, but this time we need to start by collecting the data.

Also, it is difficult to create a model because various methods can be considered. Nowadays, you can easily use gradient boosting, deep learning, etc. in the library, but you will need to try various methods to improve the prediction accuracy.

Prerequisite knowledge

--Basic knowledge of HTML, CSS, etc. --Basic usage of Selenium --Basic usage of Beautifulsoup --Basic usage of pandas --Basic usage of keras

Summary of results

Usage data

--Learning data: January 2008-July 23, 2017 --Verification data: July 23, 2017-November 2019

result

--Winning accuracy rate: 0.2450 --Double win correct answer rate: 0.5434

I made a model with higher accuracy than myself as a horse racing beginner

Let's start by collecting data

Machine learning is not possible suddenly even though there is no data. Let's do crawling scraping.

First, get information on past race results and horses from the target site.

The data obtained here should be as close to the raw data as possible, and the data will be formatted later for learning.

Target site

netkeiba.com

It is the largest horse racing information site in Japan. From past race data to horse pedigree information, you can get pretty detailed data for free.

It seems that more detailed data can be obtained by becoming a paid member. It is effective when you want to improve the accuracy of the model.

Collected data

This time, we decided to collect data focusing on the race results at the Central Racecourse, which has a large amount of information and a unified system.

Since there is a lot of data, you can make a good model by collecting and using various data. However, it is quite troublesome to collect pedigree information and data such as owners and trainers, so I decided not to do it this time. It seems that the prediction accuracy will improve if you add data around here.

First, get the URL to all races

From the Detailed Race Search Screen on the site, use Selenium to get all the URLs to the race results.

The reason for not using requests and BeautifulSoup, which are often used when crawling and scraping in Python, is that both the search URL and the search result URL are [https://db.netkeiba.com/?pid=race_search_detail](https:: //db.netkeiba.com/?pid=race_search_detail) hasn't changed.

If the screen is dynamically generated by JavaScript or PHP, you cannot get the desired data by simply downloading the html.

With Selenium, screen transitions can be performed by actual browser operations, so web crawling can be performed even on sites where the display changes by clicking such a button or sites that require login. (Please note that many sites that require login prohibit crawling due to membership agreements, etc.).

First of all, prepare what you need

import time

from selenium import webdriver
from selenium.webdriver.support.ui import Select,WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')    #In headless mode
driver = webdriver.Chrome(chrome_options=options) 
wait = WebDriverWait(driver,10)

Fill in the form input

Fill in the required fields on the form. After sending, wait until the search results are displayed. スクリーンショット 2019-11-18 18.09.14.png

URL = "https://db.netkeiba.com/?pid=race_search_detail"
driver.get(URL)
time.sleep(1)
wait.until(EC.presence_of_all_elements_located)

#Search by month
year = 2019
month = 1

#Select a period
start_year_element = driver.find_element_by_name('start_year')
start_year_select = Select(start_year_element)
start_year_select.select_by_value(str(year))
start_mon_element = driver.find_element_by_name('start_mon')
start_mon_select = Select(start_mon_element)
start_mon_select.select_by_value(str(month))
end_year_element = driver.find_element_by_name('end_year')
end_year_select = Select(end_year_element)
end_year_select.select_by_value(str(year))
end_mon_element = driver.find_element_by_name('end_mon')
end_mon_select = Select(end_mon_element)
end_mon_select.select_by_value(str(month))

#Check out the Central Racecourse
for i in range(1,11):
    terms = driver.find_element_by_id("check_Jyo_"+ str(i).zfill(2))
    terms.click()
        
#Select the number to be displayed(20,50,From 100 to the maximum 100)
list_element = driver.find_element_by_name('list')
list_select = Select(list_element)
list_select.select_by_value("100")

#Submit form
frm = driver.find_element_by_css_selector("#db_search_detail_form > form")
frm.submit()
time.sleep(5)
wait.until(EC.presence_of_all_elements_located)

For the sake of simplicity, I am trying to get the URL for January 2019. If you want a wider range of data, do one of the following:

--Do not fill out the year / month form --Get the URL for each year and month in a loop --Change the range of selected years

(In the code on github, we are trying to collect race data that has not been acquired since 2008.)

If you don't fill in the selection of racetracks, data on races held overseas will be included. Let's check 10 central racecourses properly.

I decided not to use the data other than the Central Racecourse this time because there may be few horses running or the data may be incomplete.

Save URL while pagination

Click the button in Selenium and save the URL displayed 100 times at a time. スクリーンショット 2019-11-18 18.11.48.png

with open(str(year)+"-"+str(month)+".txt", mode='w') as f:
    while True:
        time.sleep(5)
        wait.until(EC.presence_of_all_elements_located)
        all_rows = driver.find_element_by_class_name('race_table_01').find_elements_by_tag_name("tr")
        for row in range(1, len(all_rows)):
            race_href=all_rows[row].find_elements_by_tag_name("td")[4].find_element_by_tag_name("a").get_attribute("href")
            f.write(race_href+"\n")
        try:
            target = driver.find_elements_by_link_text("Next")[0]
            driver.execute_script("arguments[0].click();", target) #Click processing with javascript
        except IndexError:
            break

Open the file and write the obtained URL line by line. The race URL is in the 5th column of the table, so in Python where array elements start at 0, select something like find_elements_by_tag_name ("td") [4] .

Page feed is performed in a while loop. I'm using try to catch the exception because I can't click on the last page.

The driver.execute_script ("arguments [0] .click (); ", target) part of the try, but if you make it a simple target.click (), you will get a ʻElementClickInterceptedException` in headless mode. It has occurred. Apparently it was recognized that the elements overlapped and I could not click it well. Here had a solution, but I was able to do it well by clicking with JavaScript as above.

Get html based on the obtained URL

The html obtained earlier does not seem to make much use of PHP or JavaScript for displaying the page, so I will finally use requests here. I get the html based on the information in the URL above and save it, but it takes a few seconds to get each page, so it takes a lot of time.

import os
import requests

save_dir = "html"+"/"+str(year)+"/"+str(month)
if not os.path.isdir(save_dir):
    os.makedirs(save_dir)
        
with open(str(year)+"-"+str(month)+".txt", "r") as f:
    urls = f.read().splitlines()
    for url in urls:
        list = url.split("/")
        race_id = list[-2]
        save_file_path = save_dir+"/"+race_id+'.html'
        response = requests.get(url)
        response.encoding = response.apparent_encoding
        html = response.text
        time.sleep(5)
        with open(save_file_path, 'w') as file:
            file.write(html)

Due to the character code, if you get it obediently, the characters may be garbled. I did it with response.encoding = response.apparent_encoding and it worked. Reference: Correct garbled characters when handling Japanese in Requests

Parse html and create csv

Details of the race ・ Information on each racehorse will be stored in csv. I decided to create a csv with the following format.

--Race details --Race ID ――How many rounds --Race title --About the course

the weather --Soil condition --Date and time
Stadium —— Horse numbers and frame numbers from 1st to 3rd --Odds for each betting ticket

--Horse details --Race ID --Ranking --Horse ID --Horse number --Frame number --Gender Age --Burden weight --Weight and weight difference --Time ――Difference ――Rising time --Odds --Popular

There is other information that can be obtained. It seems that paid members can also get what is called a speed index.

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

CSV_DIR = "csv"
if not os.path.isdir(CSV_DIR):
    os.makedirs(CSV_DIR)
save_race_csv = CSV_DIR+"/race-"+str(year)+"-"+str(month)+".csv"
horse_race_csv = CSV_DIR+"/horse-"+str(year)+"-"+str(month)+".csv"

# race_data_columns, horse_data_Since columns will be long, omit it
race_df = pd.DataFrame(columns=race_data_columns )
horse_df = pd.DataFrame(columns=horse_data_columns )

html_dir = "html"+"/"+str(year)+"/"+str(month)
if os.path.isdir(html_dir):
    file_list = os.listdir(html_dir)
    for file_name in file_list:
        with open(html_dir+"/"+file_name, "r") as f:
            html = f.read()
            list = file_name.split(".")
            race_id = list[-2]
            race_list, horse_list_list = get_rade_and_horse_data_by_html(race_id, html) #Omitted because it will be long
            for horse_list in horse_list_list:
                horse_se = pd.Series( horse_list, index=horse_df.columns)
                horse_df = horse_df.append(horse_se, ignore_index=True)
            race_se = pd.Series(race_list, index=race_df.columns )
            race_df = race_df.append(race_se, ignore_index=True )
            
race_df.to_csv(save_race_csv, header=True, index=False)
horse_df.to_csv(horse_race_csv, header=True, index=False)

For each race, add the details of the race, information on each racehorse, etc. to the list and add one line to the pandas data frame.

The get_rade_and_horse_data_by_html function, race_data_columns, and horse_data_columns will be complicated and will not be included here. To briefly explain, the get_rade_and_horse_data_by_html function is a function that uses BeautifulSoup to list and return the desired data from html. race_data_columns, horse_data_columns are the column names of the data to be acquired.

Other notes

When crawling, make sure to allow time to access it so that it does not attack the server.

There are other people who have summarized detailed legal precautions, so if you actually do it, Web scraping precautions list-Qiita ) Etc., please refer to.

Once the data is obtained, we will carry out shaping and analysis.

Now that we have the data in csv format, let's clean it up so that it is easy to handle.

Next, think about what kind of model to make while looking at the state of the data. After that, let's create train data according to the model you want to create.

Format data to be easy to handle

Let's format the data so that it is easy to handle.

For example, convert date data or numbers in a string to a datetime object or int. Also, since it will be easier in the future if the data in one column is as simple as possible, gender and age are divided into two columns. There are many things to do.

It might have been better to do it at the same time as scraping, but since the scraping code seemed to be complicated, I decided to do it separately this time.

Below are some of them.

#Extract time information and combine it with date information. Make it a datetime type
race_df["time"] = race_df["time"].str.replace('Start: (\d\d):(\d\d)(.|\n)*', r'\1 o'clock\2 minutes')
race_df["date"] = race_df["date"] + race_df["time"]
race_df["date"] = pd.to_datetime(race_df['date'], format='%Y year%m month%d day%H o'clock%M minutes')
#The original time is unnecessary, so delete it
race_df.drop(['time'], axis=1, inplace=True)

#Remove extra R, blanks, and line breaks in some round columns
race_df['race_round'] = race_df['race_round'].str.strip('R \n')

Data analysis

We will analyze the formatted data and roughly check what kind of distribution it has. When creating a model, it is necessary to train it so that the data is not biased as much as possible, so it is also important for problem setting of the model.

Data analysis is also important when considering how to make features. In the case of deep learning etc., it seems that it is not necessary to stick to feature quantity engineering so much, but when doing ordinary machine learning that is not deep such as gradient boosting such as LightGBM, it is necessary to think carefully about what the feature quantity should be. there is.

Even with Kaggle, if you can find a good feature, it seems that the possibility of getting into the upper ranks will increase.

Creating train data

After deciding what kind of model to make while referring to the data analysis mentioned earlier, let's create train data.

Although it is input data, it is roughly as follows.

--Information on the race you want to predict --Horse number --Frame number --Age --Burden weight --Weight --Weight change from the last time --Burden weight / weight

sex --Distance --Number of horses participating in the race --Whether it is an obstacle race --Ground condition -Is it turf or dirt? ――Whether the course is clockwise, counterclockwise or straight --Weather
venue -Whether the jockey or owner has changed from the last time --Time difference from the previous race --Information on the last 5 races --Same content as the race information you want to predict --Ranking --Goal time --Average ranking in the middle ――Rising time --Average speed

The odds for the race you want to predict will fluctuate until just before the match, so we won't include them in the data.

Finally model creation (deep learning)

First of all, I will give an overview, this time I will do deep learning with keras. Using data from a horse as input

--A model that predicts the probability of becoming number one --A model that predicts the probability of being in the third place

I made two of them.

How did you decide on the model

It is necessary to consider whether to solve the classification problem or the regression problem.

In the case of regression problems, I think it will predict how much the horse will be (it will allow something like 1.2) and the time.

In the case of a classification problem, you will be asked to predict how many horses will be (this is classified by a natural number from 1 to 16), whether it will be number one, whether it will be in the top, etc. ..

Times and speeds vary greatly depending on the racetrack and course, so it will be difficult if you do not predict them separately. This time, we will simply predict "whether or not to be in the top" as a classification problem.

What we did in model creation and how to deal with overfitting

I will write about various things I tried when creating the model.

In addition, even if you create a model, it is indispensable to devise ways to prevent overfitting and to verify whether or not overfitting is occurring. Even if you do machine learning and get good results from your data, that model may not be able to predict other data with good accuracy.

Split the dataset for training and testing

First of all, from the basics. There is no point in creating a model unless you can evaluate whether it is good or not.

80% of the collected and formatted data was used as training data, and 20% was used as test data. In other words

--Learning data: January 2008-July 23, 2017 --Test data: July 23, 2017-November 2019

It is in the form of. This test data is used for the correct answer rate written at the beginning.

At the time of training, the training data was further divided into one for train and one for validation.

Weight regularization and dropout

There are weight regularizations and dropouts as a means of suppressing overfitting, which keras makes easy to use.

Adding the cost according to the weight to the loss function of the network is the regularization of the weight, and the dropout is to randomly reduce (drop) the feature amount from the layer during training.

We used L2 regularization for weight regularization.

Reference: Learn about overfitting and lack of learning | TensorFlow Core


model = tf.keras.Sequential([
        tf.keras.layers.Dense(300, kernel_regularizer=tf.keras.regularizers.l2(0.001), activation=tf.nn.relu, input_dim=df_columns_len), #l2 Regularized layer
        tf.keras.layers.Dropout(0.2), #Drop out
        tf.keras.layers.Dense(100, kernel_regularizer=tf.keras.regularizers.l2(0.001), activation=tf.nn.relu), #l2 Regularized layer
        tf.keras.layers.Dropout(0.2), #Drop out
        tf.keras.layers.Dense(1, activation=tf.nn.sigmoid) 
    ])

Cross-validation

A simple holdout validation using only a specific time period may happen to be overfitting for good results during that period.

Let's verify whether the model is good with the data at hand, as cross-validation is done in competitions such as Kaggle.

The problem is that it's time series data, so you can't just use KFold to split the data. When inputting time series data, if future information is set to train and past information is set to validation, the result may be better than it should be. Actually, I made a mistake at first and learned by inputting future data, but the probability of predicting a double win exceeded 70%.

So, this time, I used the split method used for cross-validation of time series data (sklearn's [TimeSeries Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit. html)).

Roughly speaking, as shown in the figure below, the data set is divided by adding time series, and part of it is used as verification data.

プレゼンテーション1のコピー.png

In this figure, you will learn three times. However, some training data will be reduced, so if the number of data is small, a simple holdout may be better.

tscv = TimeSeriesSplit(n_splits=3)
for train_index, val_index in tscv.split(X_train,Y_train):
    train_data=X_train[train_index]
    train_label=Y_train[train_index]
    val_data=X_train[val_index]
    val_label=Y_train[val_index]
    model = train_model(train_data,train_label,val_data,val_label,target_name)

Hyperparameter tuning

Hyperparameters in machine learning are important. For example, in deep learning, the larger the layer in between, the more intermediate variables there are, and the less training data there is, the easier it is to overfoot. On the other hand, if it is small, even if the amount of data is sufficient, it may not be flexible enough to learn correctly.

There will be a lot of controversy about how to do it. This seems to vary from person to person.

This time, there was a library called hyperas that automatically adjusts the parameter tuning of keras, so I decided to use it. It was relatively intuitive and easy to understand.

To use it simply, pass the data preparation function and the function that returns the value you want to minimize by training to ʻoptim.minimize`.

Specify the width you want to adjust with choice for integer values and ʻuniform` for real numbers.

For details, refer to here: https://github.com/maxpumperla/hyperas

import keras
from keras.callbacks import EarlyStopping
from keras.callbacks import CSVLogger
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation

from hyperopt import Trials, STATUS_OK, tpe
from hyperas import optim
from hyperas.distributions import choice, uniform
def prepare_data_is_hukusyo():
    """
I will prepare various data here
    """
    return X_train, Y_train, X_test, Y_test

def create_model_is_hukusyo(X_train, Y_train, X_test, Y_test):
    train_size = int(len(Y_train) * 0.8)
    train_data = X_train[0:train_size]
    train_label = Y_train[0:train_size]
    val_data = X_train[train_size:len(Y_train)]
    val_label = Y_train[train_size:len(Y_train)]

    callbacks = []
    callbacks.append(EarlyStopping(monitor='val_loss', patience=2))

    model = Sequential()
    model.add(Dense({{choice([512,1024])}}, kernel_regularizer=keras.regularizers.l2(0.001), activation="relu", input_dim=train_data.shape[1]))
    model.add(Dropout({{uniform(0, 0.3)}}))
    model.add(Dense({{choice([128, 256, 512])}}, kernel_regularizer=keras.regularizers.l2(0.001), activation="relu"))
    model.add(Dropout({{uniform(0, 0.5)}}))

    if {{choice(['three', 'four'])}} == 'three':
        pass
    elif {{choice(['three', 'four'])}} == 'four':
        model.add(Dense(8, kernel_regularizer=keras.regularizers.l2(0.001), activation="relu"))
        model.add(Dropout({{uniform(0, 0.5)}}))

    model.add(Dense(1, activation="sigmoid"))

    model.compile(
        loss='binary_crossentropy',
        optimizer=keras.optimizers.Adam(),
        metrics=['accuracy'])

    history = model.fit(train_data,
        train_label,
        validation_data=(val_data, val_label),
        epochs=30,
        batch_size=256,
        callbacks=callbacks)

    val_loss, val_acc = model.evaluate(X_test, Y_test, verbose=0)
    print('Best validation loss of epoch:', val_loss)
    return {'loss': val_loss, 'status': STATUS_OK, 'model': model}

#Actually adjust with hyperas
best_run, best_model = optim.minimize(model=create_model_is_hukusyo,
                                     data=prepare_data_is_hukusyo,
                                     algo=tpe.suggest,
                                     max_evals=15,
                                     trials=Trials())

Blend the result

You may be able to make better accurate predictions by mixing the outputs from different models.

By averaging the 1st and 3rd place predictions, we were able to obtain a slightly higher value than the original prediction value.

The characteristics of horses that are likely to be ranked first and the characteristics of horses that are likely to be ranked high may be slightly different, and it is thought that a more accurate prediction is possible by mixing the two. I will.

For example, a horse that may be ranked first but does not overdo it if it seems to fail in the middle of the race and a horse that stably enters the top are likely to have slightly different characteristics.

result

In the end, I made a model with higher accuracy than myself, a horse racing beginner.

--Winning accuracy rate: 0.2450 --Double win correct answer rate: 0.5434

There is still more information that seems to be important in horse racing, so there seems to be room for improvement.

The balance when I keep buying the 1st place in a win is as follows. I plotted it properly using pandas.

In the double win, it became as follows.

It's a big deficit. It will be a little better if you buy only the ones with high predictions or not the ones with low odds.

Other tips

In making this horse racing prediction, I will leave some of the things I tried that have nothing to do with the main line.

Use GCP

GCP's free credits were about to expire around the end of November, so my second goal was to consume them.

You can throw the program before going to bed and check it when you wake up in the morning.

Free instances don't have enough memory for CSV creation and deep learning, so be careful if you use GCP.

Notify with LINE Notify

Regarding GCP, I used to send LINE Notify if the program ended or if an error occurred.

As soon as I was done, I could see the results and run the next program, which was a lot of work.

Some at the end

It's a suitable sideshow for students, so if you're familiar with it, I think there are a lot of things to do. If you make a mistake, it will be a learning experience, so I would be grateful if you could kindly point it out in the comments or on Twitter.

Twitter ID (I don't tweet too much): @ 634kami

Source code

It is published on github. I wanted to make something that works for the time being, so it's not something that people can see, but please see only those who say that it's okay.

https://github.com/unonao/race-predict

The code on Qiita has been partially modified to make it easier to read.

Where you can improve / what you want to do

--Since the missing data of the input value is filled with 0, only the complete data is predicted. --Enter pedigree data --Enter jockey data --Try using gradient boosting such as LightGBM ――Since scraping of the tie-up race failed, make it complete data

Postscript

I added the following because I did the following.

--Deleted because the train data contains information on races of 7 or less. --Remove fault races from trian data --Calculate the correct answer rate for each number of races

Below are the results.

total: 8, random tansyo accuracy:0.125, hukusyo accuracy:0.375
tansyo accuracy: 0.3497536945812808
hukusyo accuracy: 0.7044334975369458

total: 9, random tansyo accuracy:0.1111111111111111, hukusyo accuracy:0.3333333333333333
tansyo accuracy: 0.2693726937269373
hukusyo accuracy: 0.6568265682656826

total: 10, random tansyo accuracy:0.1, hukusyo accuracy:0.3
tansyo accuracy: 0.30563002680965146
hukusyo accuracy: 0.6407506702412868

total: 11, random tansyo accuracy:0.09090909090909091, hukusyo accuracy:0.2727272727272727
tansyo accuracy: 0.2582278481012658
hukusyo accuracy: 0.5468354430379747

total: 12, random tansyo accuracy:0.08333333333333333, hukusyo accuracy:0.25
tansyo accuracy: 0.2600806451612903
hukusyo accuracy: 0.5826612903225806

total: 13, random tansyo accuracy:0.07692307692307693, hukusyo accuracy:0.23076923076923078
tansyo accuracy: 0.2894736842105263
hukusyo accuracy: 0.5855263157894737

total: 14, random tansyo accuracy:0.07142857142857142, hukusyo accuracy:0.21428571428571427
tansyo accuracy: 0.23014586709886548
hukusyo accuracy: 0.5380875202593193

total: 15, random tansyo accuracy:0.06666666666666667, hukusyo accuracy:0.2
tansyo accuracy: 0.2525399129172714
hukusyo accuracy: 0.532656023222061

In each case, the accuracy rate was better than the completely random selection method.

I tried to predict horse racing by doing everything from data collection to deep learning

Overview

Why horse racing?

Explanation to those who are new to horse racing

Rough procedure

Prerequisite knowledge

Summary of results

Let's start by collecting data

Target site

Collected data

First, get the URL to all races

First of all, prepare what you need

Fill in the form input

Save URL while pagination

Get html based on the obtained URL

Parse html and create csv

Other notes

Once the data is obtained, we will carry out shaping and analysis.

Format data to be easy to handle

Data analysis

Creating train data

Finally model creation (deep learning)

How did you decide on the model

What we did in model creation and how to deal with overfitting

Split the dataset for training and testing

Weight regularization and dropout

Cross-validation

Hyperparameter tuning

Blend the result

result

Other tips

Use GCP

Notify with LINE Notify

Some at the end

Source code

Where you can improve / what you want to do

Other links that I referred to or are likely to be

Postscript