Predict horse racing with machine learning and aim for a recovery rate of 100%.
Using all the 2019 race result data obtained in Previous article, we predict the horses that will be in the top 3 in LightGBM.
First of all, pretreatment
import datetime
def preprocessing(results):
df = results.copy()
#Remove items that contain non-numeric character strings in the order of arrival
df = df[~(df["Order of arrival"].astype(str).str.contains("\D"))]
#Using backslashes in qiita is buggy, so I've capitalized it.
df["Order of arrival"] = df["Order of arrival"].astype(int)
#Divide sex age into sex and age
df["sex"] = df["sex齢"].map(lambda x: str(x)[0])
df["age"] = df["Sexual age"].map(lambda x: str(x)[1:]).astype(int)
#Divide horse weight into weight and weight change
df["body weight"] = df["馬body weight"].str.split("(", expand=True)[0].astype(int)
df["Weight change"] = df["Horse weight"].str.split("(", expand=True)[1].str[:-1].astype(int)
#Data int,Convert to float
df["Win"] = df["Win"].astype(float)
#Remove unnecessary columns
df.drop(["time", "Difference", "Trainer", "Sexual age", "Horse weight"], axis=1, inplace=True)
df["date"] = pd.to_datetime(df["date"], format="%Y year%m month%d day")
return df
results_p = preprocessing(results)
I want to divide it into training data and test data, but train_test_split cannot be used in this case because the training data must be older than the test data. Therefore, we will create a function that separates training data and test data in chronological order using the column called'date', which is now datetime type.
def split_data(df, test_size=0.3):
sorted_id_list = df.sort_values("date").index.unique()
train_id_list = sorted_id_list[: round(len(sorted_id_list) * (1 - test_size))]
test_id_list = sorted_id_list[round(len(sorted_id_list) * (1 - test_size)) :]
train = df.loc[train_id_list].drop(['date'], axis=1)
test = df.loc[test_id_list].drop(['date'], axis=1)
return train, test
The category variable is made into a dummy variable, but the horse name has too many categories, so it is omitted this time.
results_p.drop(["Horse name"], axis=1, inplace=True)
results_d = pd.get_dummies(results_p)
If the order of arrival is within 3rd, label it as 1 and otherwise label it as 0 and treat it as an objective variable.
results_d["rank"] = results_d["Order of arrival"].map(lambda x: 1 if x < 4 else 0)
results_d.drop(['Order of arrival'], axis=1, inplace=True)
Using the created split_data function, divide the training data and test data and train them with LightGBM.
import lightgbm as lgb
train, test = split_data(results_d, 0.3)
X_train = train.drop(["rank"], axis=1)
y_train = train["rank"]
X_test = test.drop(["rank"], axis=1)
y_test = test["rank"]
params = {
"num_leaves": 4,
"n_estimators": 80,
"class_weight": "balanced",
"random_state": 100,
}
lgb_clf = lgb.LGBMClassifier(**params)
lgb_clf.fit(X_train.values, y_train.values)
Evaluate by AUC score.
y_pred_train = lgb_clf.predict_proba(X_train)[:, 1]
y_pred = lgb_clf.predict_proba(X_test)[:, 1]
print(roc_auc_score(y_train, y_pred_train))
print(roc_auc_score(y_test, y_pred))
Result is, Training data: 0.819 Test data: 0.812 was. I think it's a good score for the fact that I haven't made any features yet. Looking at the importance of features,
importances = pd.DataFrame(
{"features": X_train.columns, "importance": lgb_clf.feature_importances_}
)
importances.sort_values("importance", ascending=False)[:20]
Looking at this, it depends almost entirely on the data of winning odds, that is, it has become a model to bet on low odds </ font>, so create features etc. I want to improve it.
We have a detailed explanation in the video! Data analysis and machine learning starting with horse racing prediction
Recommended Posts