Purpose

Predict horse racing with machine learning and aim for a recovery rate of 100%.

What to do this time

Using all the 2019 race result data obtained in Previous article, we predict the horses that will be in the top 3 in LightGBM. スクリーンショット 2020-07-07 15.34.15.png

Source code

First of all, pretreatment

import datetime

def preprocessing(results):
    df = results.copy()

    #Remove items that contain non-numeric character strings in the order of arrival
    df = df[~(df["Order of arrival"].astype(str).str.contains("＼D"))]
    #Using backslashes in qiita is buggy, so I've capitalized it.
    df["Order of arrival"] = df["Order of arrival"].astype(int)

    #Divide sex age into sex and age
    df["sex"] = df["sex齢"].map(lambda x: str(x)[0])
    df["age"] = df["Sexual age"].map(lambda x: str(x)[1:]).astype(int)

    #Divide horse weight into weight and weight change
    df["body weight"] = df["馬body weight"].str.split("(", expand=True)[0].astype(int)
    df["Weight change"] = df["Horse weight"].str.split("(", expand=True)[1].str[:-1].astype(int)

    #Data int,Convert to float
    df["Win"] = df["Win"].astype(float)

    #Remove unnecessary columns
    df.drop(["time", "Difference", "Trainer", "Sexual age", "Horse weight"], axis=1, inplace=True)

    df["date"] = pd.to_datetime(df["date"], format="%Y year%m month%d day")

    return df

results_p = preprocessing(results)

スクリーンショット 2020-07-07 15.40.28.png I want to divide it into training data and test data, but train_test_split cannot be used in this case because the training data must be older than the test data. Therefore, we will create a function that separates training data and test data in chronological order using the column called'date', which is now datetime type.

def split_data(df, test_size=0.3):
    sorted_id_list = df.sort_values("date").index.unique()
    train_id_list = sorted_id_list[: round(len(sorted_id_list) * (1 - test_size))]
    test_id_list = sorted_id_list[round(len(sorted_id_list) * (1 - test_size)) :]
    train = df.loc[train_id_list].drop(['date'], axis=1)
    test = df.loc[test_id_list].drop(['date'], axis=1)
    return train, test

The category variable is made into a dummy variable, but the horse name has too many categories, so it is omitted this time.

results_p.drop(["Horse name"], axis=1, inplace=True)
results_d = pd.get_dummies(results_p)

If the order of arrival is within 3rd, label it as 1 and otherwise label it as 0 and treat it as an objective variable.

results_d["rank"] = results_d["Order of arrival"].map(lambda x: 1 if x < 4 else 0)
results_d.drop(['Order of arrival'], axis=1, inplace=True)

Using the created split_data function, divide the training data and test data and train them with LightGBM.

import lightgbm as lgb

train, test = split_data(results_d, 0.3)
X_train = train.drop(["rank"], axis=1)
y_train = train["rank"]
X_test = test.drop(["rank"], axis=1)
y_test = test["rank"]

params = {
    "num_leaves": 4,
    "n_estimators": 80,
    "class_weight": "balanced",
    "random_state": 100,
}

lgb_clf = lgb.LGBMClassifier(**params)
lgb_clf.fit(X_train.values, y_train.values)

Evaluate by AUC score.

y_pred_train = lgb_clf.predict_proba(X_train)[:, 1]
y_pred = lgb_clf.predict_proba(X_test)[:, 1]
print(roc_auc_score(y_train, y_pred_train))
print(roc_auc_score(y_test, y_pred))

Result is, Training data: 0.819 Test data: 0.812 was. I think it's a good score for the fact that I haven't made any features yet. Looking at the importance of features,

importances = pd.DataFrame(
    {"features": X_train.columns, "importance": lgb_clf.feature_importances_}
)
importances.sort_values("importance", ascending=False)[:20]

スクリーンショット 2020-07-07 15.56.41.png Looking at this, it depends almost entirely on the data of winning odds, that is, it has become a model to bet on low odds </ font>, so create features etc. I want to improve it.

We have a detailed explanation in the video! Data analysis and machine learning starting with horse racing prediction スクリーンショット 2020-07-09 17.56.49.png

I tried to predict the horses that will be in the top 3 with LightGBM

Purpose

What to do this time

Source code