Purpose

Predict horse racing with machine learning and aim for a recovery rate of 100%

What to do this time

This article is a continuation of the following article. -Scraping race result data using pandas read_html ・ Scraping detailed race information using Beautiful Soup ・ Predict the horses that will be in the top 3 in LightGBM ・ [Add past performance data of horses to features] (https://qiita.com/dijzpeb/items/63cb783c7d45cb91d262)

This time, I will try to simulate how much I can win if I actually use this model and bet on double wins.

Source code

First, scrape the refund table. スクリーンショット 2020-07-11 15.16.52.png If you scrape normally, double win and wide will not be separated as shown below, so convert the </ font> line feed tag to a character string. スクリーンショット 2020-07-11 15.17.45.png

f = urlopen(url)
html = f.read()
html = html.replace(b'<br />', b'br')

スクリーンショット 2020-07-12 11.43.26.png

As in the previous article, if you include a list of race_id, create and execute a function that scrapes the refund data and convert it to DataFrame type.

import pandas as pd
import time
from tqdm.notebook import tqdm
from urllib.request import urlopen

def scrape_return_tables(race_id_list, pre_return_tables={}):
    return_tables = pre_return_tables
    for race_id in tqdm(race_id_list):
        if race_id in return_tables.keys():
            continue
        try:
            url = "https://db.netkeiba.com/race/" + race_id
            f = urlopen(url)
            html = f.read()
            html = html.replace(b'<br />', b'br')
            dfs = pd.read_html(html)
            return_tables[race_id] = pd.concat([dfs[1], dfs[2]])
            time.sleep(1)
        except IndexError:
            continue
        except:
            break
    return return_tables

return_tables = scrape_return_tables(race_id_list)
for key in return_tables:
    return_tables[key].index = [key] * len(return_tables[key])
return_tables = pd.concat([return_tables[key] for key in return_tables])

Next, create a Retrun class and process the double win data so that it can be used.

class Return:
    def __init__(self, return_tables):
        self.return_tables = return_tables
    
    @property
    def fukusho(self):
        fukusho = self.return_tables[self.return_tables[0]=='Double win'][[1,2]]
        wins = fukusho[1].str.split('br', expand=True).drop([3], axis=1)
        wins.columns = ['win_0', 'win_1', 'win_2']
        returns = fukusho[2].str.split('br', expand=True).drop([3], axis=1)
        returns.columns = ['return_0', 'return_1', 'return_2']
        
        df = pd.concat([wins, returns], axis=1)
        for column in df.columns:
            df[column] = df[column].str.replace(',', '')
        return df.fillna(0).astype(int)

rt = Return(return_tables)
rt.fukusho

スクリーンショット 2020-07-11 15.26.37.png Next, put in LightGBM and the refund data you just scraped, and create a ModelEvaluator class that will calculate the AUC score and balance and evaluate the model.

from sklearn.metrics import roc_auc_score

class ModelEvaluator:
    def __init__(self, model, return_tables):
        self.model = model
        self.fukusho = Return(return_tables).fukusho
    
    def predict_proba(self, X):
        return self.model.predict_proba(X)[:, 1]
    
    def predict(self, X, threshold=0.5):
        y_pred = self.predict_proba(X)
        return [0 if p<threshold else 1 for p in y_pred]
    
    def score(self, y_true, X):
        return roc_auc_score(y_true, self.predict_proba(X))
    
    def feature_importance(self, X, n_display=20):
        importances = pd.DataFrame({"features": X.columns, 
                                    "importance": self.model.feature_importances_})
        return importances.sort_values("importance", ascending=False)[:n_display]
    
    def pred_table(self, X, threshold=0.5, bet_only=True):
        pred_table = X.copy()[['Horse number']]
        pred_table['pred'] = self.predict(X, threshold)
        if bet_only:
            return pred_table[pred_table['pred']==1]['Horse number']
        else:
            return pred_table
        
    def calculate_return(self, X, threshold=0.5):
        pred_table = self.pred_table(X, threshold)
        money = -100 * len(pred_table)
        df = self.fukusho.copy()
        df = df.merge(pred_table, left_index=True, right_index=True, how='right')
        for i in range(3):
            money += df[df['win_{}'.format(i)]==df['Horse number']]['return_{}'.format(i)].sum()
        return money

When I actually calculate ...

me = ModelEvaluator(lgb_clf, return_tables)

gain = {}
n_samples = 100
for i in tqdm(range(n_samples)):
    threshold = i / n_samples
    gain[threshold] = me.calculate_return(X_test, threshold)
pd.Series(gain).plot()

スクリーンショット 2020-07-11 15.30.19.png I'm really losing, so I still need to improve ...

Detailed explanation in the video ↓ Data analysis / machine learning starting with horse racing prediction スクリーンショット 2020-07-11 15.33.33.png

A concrete method of predicting horse racing by machine learning and simulating the recovery rate

Purpose

What to do this time

Source code