Recently, the development of services incorporating machine learning has begun to increase, and I myself sometimes direct it.

However, it is undeniable that it is a simple task to blindly use the learning model created by people called data scientists and ML engineers, and it is easy for beginners (I) to raise the knowledge level of machine learning. I tried to summarize the process until I became able to create a learning model.

This goal

We will start by building an environment with python, and then try to build a classification model by logistic regression, which seems to be the quickest. As for the subject, I will challenge the horse racing prediction model for both hobbies and profits.

Although we use the technical terms of horse racing, we would appreciate it if you could investigate any unclear points.

Environment

Premise

The environment implemented is as follows.

Python：3.7.7
pip：20.2.2

pipenv installation

I will build the execution environment of python using pipenv.

$ pip install pipenv

Build a virtual environment to run python.

$ export PIPENV_VENV_IN_PROJECT=true
$ cd <project_dir>
$ pipenv --python 3.7

PIPENV_VENV_IN_PROJECT is a setting to build a virtual environment under the project directory (./.venv/).

Library installation

Here, we will install the minimum required libraries.

$ pipenv install pandas
$ pipenv install sklearn
$ pipenv install matplotlib
$ pipenv install jupyter

After installation, Pipfile and Pipfile.lock in the current directory have been updated. These 4 libraries are essential items, so let's install them without saying anything.

Library	Use
pandas	Data storage and preprocessing (cleansing, integration, transformation, etc.)
sklearn	Learning and prediction using various machine learning algorithms
matplotlib	Data visualization by graph drawing
jupyter	Interactive programming on the browser

How to start jupyter notebook

$ cd <project_dir>
$ pipenv run jupyter notebook
...
    To access the notebook, open this file in a browser:
        file:///Users/katayamk/Library/Jupyter/runtime/nbserver-4261-open.html
    Or copy and paste one of these URLs:
        http://localhost:8888/?token=f809cb2bcb716ba5726912d43738dd51992d3d7f20942d71
     or http://127.0.0.1:8888/?token=f809cb2bcb716ba5726912d43738dd51992d3d7f20942d71

By accessing the localhost URL output to the terminal, you will be able to browse the jupyter notebook on the local server.

This completes the environment construction.

Model building

There are various types of machine learning, such as supervised learning, unsupervised learning, enhanced learning, and deep learning, but this time, as mentioned at the beginning, in order to be able to create a simple learning model. Build a classification model for supervised learning.

Machine learning workflow

The AWS article was easy to understand, so I think you should refer to it here. What is the workflow of machine learning? Explaining AWS machine learning services with Grareco I think that the above flow will be summarized briefly, so we will build the learning model in this order.

1. Data acquisition

By building a horse racing prediction model, we first need past horse racing data. There are also methods for scraping horse racing information sites on the Internet, but in anticipation of future operations, we will purchase and obtain official JRA data. Acquired data: JRA-VAN Data Lab

You can create a program to get the data yourself, but you can also use the free horse racing software provided in advance to output the data to a file. (Since it is not the main story, I will omit the details.)

This time, I got the following two types of data files. The target period of the data is 5 years from 2015 to 2019.

file name	type of data	Data description
syutsuba_data.csv	Race table data	Program guide data that describes the racehorses that will be held
seiseki_data.csv	Grade data	開催されたレースの着順などが記載されたGrade data

2. Data preprocessing

What is data preprocessing?

Here is the most important step in machine learning. Perform the following processing according to the acquired data.

Data cleansing

You can remove noise data or fill in missing values with different values.

Data integration

It is rare that the data required for training is gathered together from the beginning, and by integrating the distributed data, consistent data is generated.

Data conversion

The process of converting data into a specified format to improve the quality of the model. For example, processing various data such as standardizing numerical data to data that fits in the range of -1 to 1, or converting category data in which either dog or cat is selected into a dummy variable and converting it to numerical data. Will be carried out.

Preprocessing of horse racing data

From here, we will actually implement the preprocessing of horse racing data, but if you use the launched jupyter notebook, you can program while checking the data status interactively.

First, load the acquired horse racing data into the DataFrame of pandas, but as a result of preprocessing the data, I will finally process the data into the following structure.

data item	Use	Data description
race_index	index	Identification ID that identifies the race to be held
This prize	Explanatory variable	Total amount of prize money earned for racehorses
Jockey name	Explanatory variable	Use the jockey name as a dummy variable
Within 3	Objective variable	Convert the finish order of racehorses to 1 if it is within 3rd place and 0 if it is 4th or less

This time, we will use the total amount of prize money that each horse has won so far as a feature to measure the ability of the racehorse. We also adopted the jockey name, considering that there is a big difference depending on the skill of the jockey. Let's try to see how accurate the prediction can be with these two explanatory variables alone.

`build.ipynb`


import os
import pandas as pd

#Race table data
syutsuba_path = './data/sample/syutsuba_data.csv'
df_syutsuba = pd.read_csv(syutsuba_path, encoding='shift-jis')
df_syutsuba = df_syutsuba[['Race ID', 'This prize', 'Jockey name']]

#Grade data
seiseki_path = './data/sample/seiseki_data.csv'
df_seiseki = pd.read_csv(seiseki_path, encoding='shift-jis')
df_seiseki = df_seiseki[['Race ID', 'Confirmed order of arrival']]

In DataFrame, the data is organized as follows. スクリーンショット 2020-09-26 11.03.20.png スクリーンショット 2020-09-26 11.04.11.png

Reference) Race ID data format

Subscript (range)	Data length	Item description
0〜3	4byte	Year
4〜5	2byte	Month
6〜7	2byte	Day
8〜9	2byte	Racetrack code
10〜11	2byte	Held times
12〜13	2byte	Date
14〜15	2byte	Race number
16〜17	2byte	Horse number

Next, we will integrate the acquired data and perform data cleansing and conversion.

`build.ipynb`


#Merge runner table data and grade data
df = pd.merge(df_syutsuba, df_seiseki, on = 'Race ID')

#Records with missing values are removed
df.dropna(how='any', inplace=True)

#Add a column to see if the order of arrival is within 3
f_ranking = lambda x: 1 if x in [1, 2, 3] else 0
df['Within 3'] = df['Confirmed order of arrival'].map(f_ranking)

#Generate dummy variable
df = pd.get_dummies(df, columns=['Jockey name'])

#Set index (use up to 16th byte to specify race only)
df['race_index'] = df['Race ID'].astype(str).str[0:16]
df.set_index('race_index', inplace=True)

#Delete unnecessary columns
df.drop(['Race ID', 'Confirmed order of arrival'], axis=1, inplace=True)

If you check the DataFrame, you can see that the columns that have been made into dummy variables are replaced with new columns for the number of categories that belong to them, and the 0 or 1 flag is set. スクリーンショット 2020-09-26 11.36.35.png By making the jockey name a dummy variable, the number of columns has increased to 295, but please note that making a column with a large number of categories a dummy variable may cause overfitting.

3. Model learning

Next, let's learn the model. First, the data is divided into training data and evaluation data for each explanatory variable and objective variable.

`build.ipynb`


from sklearn.model_selection import train_test_split

#Store explanatory variables in dataX
dataX = df.drop(['Within 3'], axis=1)

#Store objective variable in dataY
dataY = df['Within 3']

#Divide the data (learning data 0).8 Evaluation data 0.2）
X_train, X_test, y_train, y_test = train_test_split(dataX, dataY, test_size=0.2, stratify=dataY)

In short, it is divided into the following four types of data.

Variable name	type of data	Use
X_train	Explanatory variable	Training data
X_test	Explanatory variable	Evaluation data
y_train	Objective variable	Training data
y_test	Objective variable	Evaluation data

This time, train_test_split is used to easily divide the training data and the evaluation data, but for data with a time series concept such as horse racing, ** (past)-> training data-> It seems that the accuracy will be improved if the data is divided so that the order is evaluation data-> (current) **.

Next, we will train the prepared data. The basic algorithm is included in sklearn, and this time we will use ** logistic regression **.

`build.ipynb`


from sklearn.linear_model import LogisticRegression

#Create a classifier (logistic regression)
clf = LogisticRegression()

#Learning
clf.fit(X_train, y_train)

That's it. It's very easy.

4. Model evaluation

First, let's predict the evaluation data and check the correct answer rate based on the result.

`build.ipynb`


#Forecast
y_pred = clf.predict(X_test)

#Display correct answer rate
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.7874043003746538

The correct answer rate is 0.7874043003746538, which means that 78% can be predicted correctly. At first glance, you might be happy to say, "Oh awesome! It's really profitable!", But be careful with this accuracy_score. Then try running the following code.

`build.ipynb`


#Show confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred, labels=[1, 0]))
[[  339 10031]
 [  410 38332]]

This two-dimensional array, called the confusion matrix, represents the following:

	Prediction: Within 3rd place	Prediction: 4th or less
Actual: Within 3	339	10031
Actual: 4 or less	410	38332

Of these, the correct answer rate is the total of ** Prediction: 3rd place or less x Actual: 3rd place or less ** and ** Prediction: 4th place or less x Actual: 4th place or less **.

** Correct answer rate **: 0.78 = (339 + 38332) / (339 + 38332 + 410 + 10031)

	Prediction: Within 3rd place	Prediction: 4th or less
Actual: Within 3	339	10031
Actual: 4 or less	410	38332

From this result, it can be seen that the number of cases predicted to be within 3rd place is too small in the first place, and the correct answer rate is boosted by predicting that most of them are 4th place or less.

Now that you know that you need to be careful about the accuracy rate, what should be used to evaluate the accuracy of the model? One way to utilize this confusion matrix is to check the F value.

What is F value?

It is a combination of 1 and 2 below.

Percentage of horses that are predicted to be within 3rd place and answered correctly (called the precision rate)
Percentage of horses that actually finished in 3rd place with correct answers (called recall rate)

** Compliance rate **: 0.45 = 339 / (339 + 410) ** recall **: 0.03 = 339 / (339 + 10031)

	Prediction: Within 3rd place	Prediction: 4th or less
Actual: Within 3	339	10031
Actual: 4 or less	410	38332

`build.ipynb`


#Display F value
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))
0.06097670653835776

When I checked the F value this time, it was 0.06097670653835776. Regarding the F value, in the case of randomly dividing it into 0 and 1, it has the property of converging to 0.5, so you can see that the value of 0.06 this time is an extremely low value.

Correct data imbalance

`build.ipynb`


print(df['Within 3'].value_counts())
0    193711
1     51848

The data ratio of the objective variable within 3rd place and 4th place or less is 1: 4, and there is a slight bias in the data, so let's correct this a little.

First, install the following libraries additionally.

$ pipenv install imbalanced-learn

Undersample the data ratio of the training data within 3rd place and 4th place or less to 1: 2. Undersampling means randomly narrowing down the number of large numbers of data to match the small number of data.

`build.ipynb`


from imblearn.under_sampling import RandomUnderSampler

#Undersampling training data
f_count = y_train.value_counts()[1] * 2
t_count = y_train.value_counts()[1]
rus = RandomUnderSampler(sampling_strategy={0:f_count, 1:t_count})
X_train_rus, y_train_rus = rus.fit_sample(X_train, y_train)

Now that we've corrected some of the data imbalances, we'll train and evaluate the model again.

`build.ipynb`


#Learning
clf.fit(X_train_rus, y_train_rus)

#Forecast
y_pred = clf.predict(X_test)

#Display correct answer rate
print(accuracy_score(y_test, y_pred))
0.7767958950969214

#Show confusion matrix
print(confusion_matrix(y_test, y_pred, labels=[1, 0]))
[[ 1111  9259]
 [ 1703 37039]]

#Display F value
print(f1_score(y_test, y_pred))
0.1685376213592233

The F value is 0.1685376213592233, which is a considerable improvement.

Standardize explanatory variables

There are two explanatory variables, the prize money and the jockey name, but the jockey name has a value of 0 or 1 due to dummy variable conversion, while the prize money has the following distribution of features.

`build.ipynb`


import matplotlib.pyplot as plt
plt.xlabel('prize')
plt.ylabel('freq')
plt.hist(dataX['This prize'], range=(0, 20000), bins=20)

スクリーンショット 2020-09-26 14.07.09.png Since the values are too different, it is highly likely that the prize money and the jockey name cannot be compared on an equal footing, and it is necessary to scale each feature to the same range. One of the methods is standardization.

`build.ipynb`


from sklearn.preprocessing import StandardScaler

#Standardize explanatory variables
sc = StandardScaler()
X_train_rus_std = pd.DataFrame(sc.fit_transform(X_train_rus), columns=X_train_rus.columns)
X_test_std = pd.DataFrame(sc.transform(X_test), columns=X_test.columns)

スクリーンショット 2020-09-26 16.02.08.png By standardizing, the values of all explanatory variables have been converted so that they fall within a certain range, so the model is trained and evaluated again.

`build.ipynb`


#Learning
clf.fit(X_train_rus_std, y_train_rus)

#Forecast
y_pred = clf.predict(X_test_std)

#Display correct answer rate
print(accuracy_score(y_test, y_pred))
0.7777732529727969

#Show confusion matrix
print(confusion_matrix(y_test, y_pred, labels=[1, 0]))
[[ 2510  7860]
 [ 3054 35688]]

#Display F value
print(f1_score(y_test, y_pred))
0.3150495795155014

The F value became 0.3150495795155014, and the accuracy was further improved from the previous time, and it reached the 30% level. In addition, the precision rate is 0.45 and the recall rate is 0.24, which is a reasonable prediction result for horse racing.

Check the weight of the regression coefficient

Finally, check the regression coefficient to see which value of the explanatory variable has a strong influence on the horse racing prediction.

`build.ipynb`


pd.options.display.max_rows = X_train_rus_std.columns.size
print(pd.Series(clf.coef_[0], index=X_train_rus_std.columns).sort_values())

Jockey name_Lower principle-0.092015
Jockey name_Seiji Sakai-0.088886
Jockey name_Teruo Eda-0.081689
Jockey name_Hayabusa Mitsuya-0.078886
Jockey name_Toshiya Yamamoto-0.075083
Jockey name_Norifumi Mikamoto-0.073361
Jockey name_Keita Ban-0.072113
Jockey name_Junji Iwabe-0.070202
Jockey name_Bushizawa Tomo-0.069766
Jockey name_Mitsuyuki Miyazaki-0.068009
...(abridgement)
Jockey name_Yasunari Iwata 0.065899
Jockey name_Hironobu Tanabe 0.072882
Jockey name_Moreira 0.073010
Jockey name_Taketoyo 0.084130
Jockey name_Yuichi Fukunaga 0.107660
Jockey name_Yuga Kawada 0.123749
Jockey name_Keita Tosaki 0.127755
Jockey name_M. Dem 0.129514
Jockey name_Lemaire 0.185976
This prize 0.443854

You can see that the prize money has the most positive influence, which is expected to be in the third place, followed by the major jockeys.

5. Model operation

With the work so far, we managed to build the model. Next, let's consider the actual operation. Horse races are held on a regular basis every week, but I hope to get rich by predicting which horses will be in the top three of each race.

So every week, do you run the machine learning workflow in order from the beginning? [1. Data acquisition](https://qiita.com/drafts#1-%E3%83%87%E3%83%BC%E3%82%BF%E3%81%AE%E5%8F%96 % E5% BE% 97) needs to be performed every time to get the latest runner table data, but 2. Data preprocessing 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 89% 8D% E5% 87% A6% E7% 90% 86) and 3. Model Learning //qiita.com/drafts#3-%E3%83%A2%E3%83%87%E3%83%AB%E3%81%AE%E5%AD%A6%E7%BF%92) every time Instead of doing it, you should be able to reuse the model you built once (although regular model updates are required). So let's carry out that operation.

`build.ipynb`


import pickle

filename = 'model_sample.pickle'
pickle.dump(clf, open(filename, 'wb'))

In this way, by using a library called pickle, you can serialize the built model and save it in a file.

And here is how to restore the saved model.

`restore.ipynb`


import pickle

filename = 'model_sample.pickle'
clf = pickle.load(open(filename, 'rb'))

#Forecast
y_pred = clf.predict(Explanatory variable data for the race to be predicted)

You can easily restore the model and use it for future race predictions. This enables efficient operation without the need for data preprocessing or model training.

at the end

With the above, we were able to carry out a series of work step by step from environment construction to model construction. It will be a poor explanation by beginners, but I hope it will be helpful for people with similar circumstances.

Next time, I would like to use another algorithm to try to create a mechanism that goes one step further, not only in comparison and verification with the model created this time and prediction accuracy, but also in what the actual balance is.

Machine learning beginners tried to make a horse racing prediction model with python

This goal

Environment

Premise

pipenv installation

Library installation

How to start jupyter notebook

Model building

Machine learning workflow

1. Data acquisition

2. Data preprocessing

What is data preprocessing?

Data cleansing

Data integration

Data conversion

Preprocessing of horse racing data

build.ipynb

Reference) Race ID data format

build.ipynb

3. Model learning

build.ipynb

build.ipynb

4. Model evaluation

build.ipynb

build.ipynb

What is F value?

build.ipynb

Correct data imbalance

build.ipynb

build.ipynb

build.ipynb

Standardize explanatory variables

build.ipynb

build.ipynb

build.ipynb

Check the weight of the regression coefficient

build.ipynb

5. Model operation

build.ipynb

restore.ipynb

at the end

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`build.ipynb`

`restore.ipynb`