Recently, the development of services incorporating machine learning has begun to increase, and I myself sometimes direct it.
However, it is undeniable that it is a simple task to blindly use the learning model created by people called data scientists and ML engineers, and it is easy for beginners (I) to raise the knowledge level of machine learning. I tried to summarize the process until I became able to create a learning model.
We will start by building an environment with python, and then try to build a classification model by logistic regression, which seems to be the quickest. As for the subject, I will challenge the horse racing prediction model for both hobbies and profits.
The environment implemented is as follows.
I will build the execution environment of python using pipenv.
$ pip install pipenv
Build a virtual environment to run python.
$ export PIPENV_VENV_IN_PROJECT=true
$ cd <project_dir>
$ pipenv --python 3.7
PIPENV_VENV_IN_PROJECT
is a setting to build a virtual environment under the project directory (./.venv/).
Here, we will install the minimum required libraries.
$ pipenv install pandas
$ pipenv install sklearn
$ pipenv install matplotlib
$ pipenv install jupyter
After installation, Pipfile and Pipfile.lock in the current directory have been updated. These 4 libraries are essential items, so let's install them without saying anything.
Library | Use |
---|---|
pandas | Data storage and preprocessing (cleansing, integration, transformation, etc.) |
sklearn | Learning and prediction using various machine learning algorithms |
matplotlib | Data visualization by graph drawing |
jupyter | Interactive programming on the browser |
$ cd <project_dir>
$ pipenv run jupyter notebook
...
To access the notebook, open this file in a browser:
file:///Users/katayamk/Library/Jupyter/runtime/nbserver-4261-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=f809cb2bcb716ba5726912d43738dd51992d3d7f20942d71
or http://127.0.0.1:8888/?token=f809cb2bcb716ba5726912d43738dd51992d3d7f20942d71
By accessing the localhost URL output to the terminal, you will be able to browse the jupyter notebook on the local server.
This completes the environment construction.
There are various types of machine learning, such as supervised learning, unsupervised learning, enhanced learning, and deep learning, but this time, as mentioned at the beginning, in order to be able to create a simple learning model. Build a classification model for supervised learning.
The AWS article was easy to understand, so I think you should refer to it here. What is the workflow of machine learning? Explaining AWS machine learning services with Grareco I think that the above flow will be summarized briefly, so we will build the learning model in this order.
By building a horse racing prediction model, we first need past horse racing data. There are also methods for scraping horse racing information sites on the Internet, but in anticipation of future operations, we will purchase and obtain official JRA data. Acquired data: JRA-VAN Data Lab
You can create a program to get the data yourself, but you can also use the free horse racing software provided in advance to output the data to a file. (Since it is not the main story, I will omit the details.)
This time, I got the following two types of data files. The target period of the data is 5 years from 2015 to 2019.
file name | type of data | Data description |
---|---|---|
syutsuba_data.csv | Race table data | Program guide data that describes the racehorses that will be held |
seiseki_data.csv | Grade data | 開催されたレースの着順などが記載されたGrade data |
Here is the most important step in machine learning. Perform the following processing according to the acquired data.
You can remove noise data or fill in missing values with different values.
It is rare that the data required for training is gathered together from the beginning, and by integrating the distributed data, consistent data is generated.
The process of converting data into a specified format to improve the quality of the model. For example, processing various data such as standardizing numerical data to data that fits in the range of -1 to 1, or converting category data in which either dog or cat is selected into a dummy variable and converting it to numerical data. Will be carried out.
From here, we will actually implement the preprocessing of horse racing data, but if you use the launched jupyter notebook, you can program while checking the data status interactively.
First, load the acquired horse racing data into the DataFrame of pandas, but as a result of preprocessing the data, I will finally process the data into the following structure.
data item | Use | Data description |
---|---|---|
race_index | index | Identification ID that identifies the race to be held |
This prize | Explanatory variable | Total amount of prize money earned for racehorses |
Jockey name | Explanatory variable | Use the jockey name as a dummy variable |
Within 3 | Objective variable | Convert the finish order of racehorses to 1 if it is within 3rd place and 0 if it is 4th or less |
This time, we will use the total amount of prize money that each horse has won so far as a feature to measure the ability of the racehorse. We also adopted the jockey name, considering that there is a big difference depending on the skill of the jockey. Let's try to see how accurate the prediction can be with these two explanatory variables alone.
build.ipynb
import os
import pandas as pd
#Race table data
syutsuba_path = './data/sample/syutsuba_data.csv'
df_syutsuba = pd.read_csv(syutsuba_path, encoding='shift-jis')
df_syutsuba = df_syutsuba[['Race ID', 'This prize', 'Jockey name']]
#Grade data
seiseki_path = './data/sample/seiseki_data.csv'
df_seiseki = pd.read_csv(seiseki_path, encoding='shift-jis')
df_seiseki = df_seiseki[['Race ID', 'Confirmed order of arrival']]
In DataFrame, the data is organized as follows.
Subscript (range) | Data length | Item description |
---|---|---|
0〜3 | 4byte | Year |
4〜5 | 2byte | Month |
6〜7 | 2byte | Day |
8〜9 | 2byte | Racetrack code |
10〜11 | 2byte | Held times |
12〜13 | 2byte | Date |
14〜15 | 2byte | Race number |
16〜17 | 2byte | Horse number |
Next, we will integrate the acquired data and perform data cleansing and conversion.
build.ipynb
#Merge runner table data and grade data
df = pd.merge(df_syutsuba, df_seiseki, on = 'Race ID')
#Records with missing values are removed
df.dropna(how='any', inplace=True)
#Add a column to see if the order of arrival is within 3
f_ranking = lambda x: 1 if x in [1, 2, 3] else 0
df['Within 3'] = df['Confirmed order of arrival'].map(f_ranking)
#Generate dummy variable
df = pd.get_dummies(df, columns=['Jockey name'])
#Set index (use up to 16th byte to specify race only)
df['race_index'] = df['Race ID'].astype(str).str[0:16]
df.set_index('race_index', inplace=True)
#Delete unnecessary columns
df.drop(['Race ID', 'Confirmed order of arrival'], axis=1, inplace=True)
If you check the DataFrame, you can see that the columns that have been made into dummy variables are replaced with new columns for the number of categories that belong to them, and the 0 or 1 flag is set. By making the jockey name a dummy variable, the number of columns has increased to 295, but please note that making a column with a large number of categories a dummy variable may cause overfitting.
Next, let's learn the model. First, the data is divided into training data and evaluation data for each explanatory variable and objective variable.
build.ipynb
from sklearn.model_selection import train_test_split
#Store explanatory variables in dataX
dataX = df.drop(['Within 3'], axis=1)
#Store objective variable in dataY
dataY = df['Within 3']
#Divide the data (learning data 0).8 Evaluation data 0.2)
X_train, X_test, y_train, y_test = train_test_split(dataX, dataY, test_size=0.2, stratify=dataY)
In short, it is divided into the following four types of data.
Variable name | type of data | Use |
---|---|---|
X_train | Explanatory variable | Training data |
X_test | Explanatory variable | Evaluation data |
y_train | Objective variable | Training data |
y_test | Objective variable | Evaluation data |
This time, train_test_split is used to easily divide the training data and the evaluation data, but for data with a time series concept such as horse racing, ** (past)-> training data-> It seems that the accuracy will be improved if the data is divided so that the order is evaluation data-> (current) **.
Next, we will train the prepared data. The basic algorithm is included in sklearn, and this time we will use ** logistic regression **.
build.ipynb
from sklearn.linear_model import LogisticRegression
#Create a classifier (logistic regression)
clf = LogisticRegression()
#Learning
clf.fit(X_train, y_train)
That's it. It's very easy.
First, let's predict the evaluation data and check the correct answer rate based on the result.
build.ipynb
#Forecast
y_pred = clf.predict(X_test)
#Display correct answer rate
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.7874043003746538
The correct answer rate is 0.7874043003746538
, which means that 78% can be predicted correctly.
At first glance, you might be happy to say, "Oh awesome! It's really profitable!", But be careful with this accuracy_score. Then try running the following code.
build.ipynb
#Show confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred, labels=[1, 0]))
[[ 339 10031]
[ 410 38332]]
This two-dimensional array, called the confusion matrix, represents the following:
Prediction: Within 3rd place | Prediction: 4th or less | |
---|---|---|
Actual: Within 3 | 339 | 10031 |
Actual: 4 or less | 410 | 38332 |
Of these, the correct answer rate is the total of ** Prediction: 3rd place or less x Actual: 3rd place or less ** and ** Prediction: 4th place or less x Actual: 4th place or less **.
** Correct answer rate **: 0.78 = (339 + 38332) / (339 + 38332 + 410 + 10031)
Prediction: Within 3rd place | Prediction: 4th or less | |
---|---|---|
Actual: Within 3 | 339 | 10031 |
Actual: 4 or less | 410 | 38332 |
From this result, it can be seen that the number of cases predicted to be within 3rd place is too small in the first place, and the correct answer rate is boosted by predicting that most of them are 4th place or less.
Now that you know that you need to be careful about the accuracy rate, what should be used to evaluate the accuracy of the model? One way to utilize this confusion matrix is to check the F value.
It is a combination of 1 and 2 below.
** Compliance rate **: 0.45 = 339 / (339 + 410) ** recall **: 0.03 = 339 / (339 + 10031)
Prediction: Within 3rd place | Prediction: 4th or less | |
---|---|---|
Actual: Within 3 | 339 | 10031 |
Actual: 4 or less | 410 | 38332 |
build.ipynb
#Display F value
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))
0.06097670653835776
When I checked the F value this time, it was 0.06097670653835776
. Regarding the F value, in the case of randomly dividing it into 0 and 1, it has the property of converging to 0.5, so you can see that the value of 0.06 this time is an extremely low value.
build.ipynb
print(df['Within 3'].value_counts())
0 193711
1 51848
The data ratio of the objective variable within 3rd place and 4th place or less is 1: 4, and there is a slight bias in the data, so let's correct this a little.
First, install the following libraries additionally.
$ pipenv install imbalanced-learn
Undersample the data ratio of the training data within 3rd place and 4th place or less to 1: 2. Undersampling means randomly narrowing down the number of large numbers of data to match the small number of data.
build.ipynb
from imblearn.under_sampling import RandomUnderSampler
#Undersampling training data
f_count = y_train.value_counts()[1] * 2
t_count = y_train.value_counts()[1]
rus = RandomUnderSampler(sampling_strategy={0:f_count, 1:t_count})
X_train_rus, y_train_rus = rus.fit_sample(X_train, y_train)
Now that we've corrected some of the data imbalances, we'll train and evaluate the model again.
build.ipynb
#Learning
clf.fit(X_train_rus, y_train_rus)
#Forecast
y_pred = clf.predict(X_test)
#Display correct answer rate
print(accuracy_score(y_test, y_pred))
0.7767958950969214
#Show confusion matrix
print(confusion_matrix(y_test, y_pred, labels=[1, 0]))
[[ 1111 9259]
[ 1703 37039]]
#Display F value
print(f1_score(y_test, y_pred))
0.1685376213592233
The F value is 0.1685376213592233
, which is a considerable improvement.
There are two explanatory variables, the prize money and the jockey name, but the jockey name has a value of 0 or 1 due to dummy variable conversion, while the prize money has the following distribution of features.
build.ipynb
import matplotlib.pyplot as plt
plt.xlabel('prize')
plt.ylabel('freq')
plt.hist(dataX['This prize'], range=(0, 20000), bins=20)
Since the values are too different, it is highly likely that the prize money and the jockey name cannot be compared on an equal footing, and it is necessary to scale each feature to the same range. One of the methods is standardization.
build.ipynb
from sklearn.preprocessing import StandardScaler
#Standardize explanatory variables
sc = StandardScaler()
X_train_rus_std = pd.DataFrame(sc.fit_transform(X_train_rus), columns=X_train_rus.columns)
X_test_std = pd.DataFrame(sc.transform(X_test), columns=X_test.columns)
By standardizing, the values of all explanatory variables have been converted so that they fall within a certain range, so the model is trained and evaluated again.
build.ipynb
#Learning
clf.fit(X_train_rus_std, y_train_rus)
#Forecast
y_pred = clf.predict(X_test_std)
#Display correct answer rate
print(accuracy_score(y_test, y_pred))
0.7777732529727969
#Show confusion matrix
print(confusion_matrix(y_test, y_pred, labels=[1, 0]))
[[ 2510 7860]
[ 3054 35688]]
#Display F value
print(f1_score(y_test, y_pred))
0.3150495795155014
The F value became 0.3150495795155014
, and the accuracy was further improved from the previous time, and it reached the 30% level. In addition, the precision rate is 0.45 and the recall rate is 0.24, which is a reasonable prediction result for horse racing.
Finally, check the regression coefficient to see which value of the explanatory variable has a strong influence on the horse racing prediction.
build.ipynb
pd.options.display.max_rows = X_train_rus_std.columns.size
print(pd.Series(clf.coef_[0], index=X_train_rus_std.columns).sort_values())
Jockey name_Lower principle-0.092015
Jockey name_Seiji Sakai-0.088886
Jockey name_Teruo Eda-0.081689
Jockey name_Hayabusa Mitsuya-0.078886
Jockey name_Toshiya Yamamoto-0.075083
Jockey name_Norifumi Mikamoto-0.073361
Jockey name_Keita Ban-0.072113
Jockey name_Junji Iwabe-0.070202
Jockey name_Bushizawa Tomo-0.069766
Jockey name_Mitsuyuki Miyazaki-0.068009
...(abridgement)
Jockey name_Yasunari Iwata 0.065899
Jockey name_Hironobu Tanabe 0.072882
Jockey name_Moreira 0.073010
Jockey name_Taketoyo 0.084130
Jockey name_Yuichi Fukunaga 0.107660
Jockey name_Yuga Kawada 0.123749
Jockey name_Keita Tosaki 0.127755
Jockey name_M. Dem 0.129514
Jockey name_Lemaire 0.185976
This prize 0.443854
You can see that the prize money has the most positive influence, which is expected to be in the third place, followed by the major jockeys.
With the work so far, we managed to build the model. Next, let's consider the actual operation. Horse races are held on a regular basis every week, but I hope to get rich by predicting which horses will be in the top three of each race.
So every week, do you run the machine learning workflow in order from the beginning? [1. Data acquisition](https://qiita.com/drafts#1-%E3%83%87%E3%83%BC%E3%82%BF%E3%81%AE%E5%8F%96 % E5% BE% 97) needs to be performed every time to get the latest runner table data, but 2. Data preprocessing 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 89% 8D% E5% 87% A6% E7% 90% 86) and 3. Model Learning //qiita.com/drafts#3-%E3%83%A2%E3%83%87%E3%83%AB%E3%81%AE%E5%AD%A6%E7%BF%92) every time Instead of doing it, you should be able to reuse the model you built once (although regular model updates are required). So let's carry out that operation.
build.ipynb
import pickle
filename = 'model_sample.pickle'
pickle.dump(clf, open(filename, 'wb'))
In this way, by using a library called pickle, you can serialize the built model and save it in a file.
And here is how to restore the saved model.
restore.ipynb
import pickle
filename = 'model_sample.pickle'
clf = pickle.load(open(filename, 'rb'))
#Forecast
y_pred = clf.predict(Explanatory variable data for the race to be predicted)
You can easily restore the model and use it for future race predictions. This enables efficient operation without the need for data preprocessing or model training.
With the above, we were able to carry out a series of work step by step from environment construction to model construction. It will be a poor explanation by beginners, but I hope it will be helpful for people with similar circumstances.
Next time, I would like to use another algorithm to try to create a mechanism that goes one step further, not only in comparison and verification with the model created this time and prediction accuracy, but also in what the actual balance is.
Recommended Posts