A beginner in machine learning with python ... the second in a series.
In the previous article "Beginners of machine learning made a horse racing prediction model with python", I summarized the process of building a simple learning model using horse racing as a theme.
Since this is a practical edition, I would like to make the prediction of Arima Kinen, which is the total settlement of accounts for 2020, actually. Since I am in charge of the big bird of the Advent calendar, is it just right in terms of timing? is not it!
It is one of the GI races of the Central Horse Racing, and it is a big race that concludes the year, which is also called the Grand Prix, where the runners are selected by fan voting.
Click here for the race outline.
-** Date: 2020/12/27 (Sun) ** -** Racecourse: Nakayama Racecourse ** -** Course: Shiba 2500m **
As it is a long-distance race at Nakayama Racecourse with a small turn, I personally feel that it is a race that is easy to get confused. In fact, there are many unpopular horses in the third place, which is a predictable race.
Last time, I used all the data for the past 5 years. I want to win the Arima Kinen this time! And the purpose is clear, so consider a dataset that fits it.
Here is a summary of the number of data samples for each condition.
conditions | Number of races | Number of racehorse records |
---|---|---|
All races | 20,677 | 293,120 |
Nakayama Racecourse turf race | 1,276 | 18,109 |
Race over 2000m turf at Nakayama Racecourse | 479 | 6,621 |
Nakayama Racecourse turf 2500m race | 58 | 711 |
This time, we will not create a general-purpose model, but focus on creating a model that is close to the characteristics of Arima Kinen. However, since a certain number of samples is required, I decided to use the data set of "** Race of turf 2000m or more at Nakayama Racecourse **".
As with the last time, we will carry out preprocessing of the data.
Finally I will load the following data into the pandas DataFrame.
data item | Use | Data description |
---|---|---|
race_index | index | Identification ID that identifies the race to be held |
Horse number | Explanatory variable | Racehorse's horse number |
Race class | Explanatory variable | Convert the class of the race to a numerical value(*1) |
Time index | Explanatory variable | Time index of the last 3 races of racehorses(*2)Median |
Passing order 4 corners | Explanatory variable | Median order of passing the final corners of the last three racehorses |
Jockey name | Explanatory variable | Use the jockey name as a dummy variable |
Stallion name | Explanatory variable | Use the stallion name as a dummy variable |
Within 3 | Objective variable | Convert the finish order of racehorses to 1 if it is within 3rd place and 0 if it is 4th or less |
(* 1) Race class has the following rules
Race class | Converted number |
---|---|
New horse | 250 |
Not won | 250 |
1 win class/5 million | 500 |
2 win class/Ten million | 1000 |
3 win class/16 million | 1500 |
OP | 2000 |
G3 | 3000 |
G2 | 4500 |
G1 | 7000 |
(* 2) The time index is an index of the running time of past races provided by Data acquisition source.
The original data this time is this CSV file group.
Cleanse, integrate and transform these data and load them into the DataFrame as shown below.
After that, the objective variable is generated and the explanatory variable is made into a dummy variable in order, as in the previous time.
sample.ipynb
(Omission)
#Add a column to see if the order of arrival is within 3
f_ranking = lambda x: 1 if x in [1, 2, 3] else 0
df['Within 3'] = df['Confirmed order of arrival'].map(f_ranking)
#Generate dummy variable
df = pd.get_dummies(df, columns=['Jockey name'])
df = pd.get_dummies(df, columns=['Stallion name'])
#Set index (use up to 16th byte to specify race only)
df['race_index'] = df['Race ID'].astype(str).str[0:16]
df.set_index('race_index', inplace=True)
#Delete unnecessary columns
df.drop(['Race ID', 'Confirmed order of arrival'], axis=1, inplace=True)
This completes the data preprocessing.
Next, we will train the model, but this time we would like to verify the following classification algorithms including the logistic regression used last time.
algorithm | Overview |
---|---|
Logistic regression | A method that uses two-choice prediction results returned with a probability of 0 to 1 for classification. |
Support vector machine | A method of classifying by drawing a boundary that divides the class to the maximum |
K-nearest neighbor method | A method of classifying by majority vote of data groups in the vicinity of the data to be predicted |
Random forest | Decision tree(Yes/No branch condition)Method of making multiple and classifying by majority vote |
This article is organized in an easy-to-understand manner. Reference: Roughly organize machine learning information centered on methods
The above algorithms are all included in sklearn, and can be operated with the same implementation except for the process of creating each classifier class.
We will carry out the following processing in the same way as last time.
Divide the data into training data and evaluation data for each explanatory variable and objective variable. And this time, in order to save time and effort, we will standardize the explanatory variables at the stage before division.
sample.ipynb
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
#Store explanatory variables in dataX
dataX = df.drop(['Within 3'], axis=1)
#Store objective variable in dataY
dataY = df['Within 3']
#Standardize the explanatory variables at this stage
sc = StandardScaler()
dataX_std = pd.DataFrame(sc.fit_transform(dataX), columns=dataX.columns, index=dataX.index)
#Divide the data (learning data 0).8 Evaluation data 0.2)
X_train, X_test, y_train, y_test = train_test_split(dataX_std, dataY, test_size=0.2, stratify=dataY)
Variable name | type of data | Use |
---|---|---|
X_train | Explanatory variable | Training data |
X_test | Explanatory variable | Evaluation data |
y_train | Objective variable | Training data |
y_test | Objective variable | Evaluation data |
sample.ipynb
from imblearn.under_sampling import RandomUnderSampler
#Undersampling training data
f_count = y_train.value_counts()[1] * 2
t_count = y_train.value_counts()[1]
rus = RandomUnderSampler(sampling_strategy={0:f_count, 1:t_count})
X_train, y_train = rus.fit_sample(X_train, y_train)
From here, we will train and evaluate the model using each algorithm. The first is logistic regression.
sample.ipynb
from sklearn.linear_model import LogisticRegression
#Create a classifier (logistic regression)
clf = LogisticRegression(max_iter=10000)
#Learning
clf.fit(X_train, y_train)
#Forecast
y_pred = clf.predict(X_test)
#Display correct answer rate
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.7488372093023256
#Show precision
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred))
0.4158878504672897
#Display F value
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))
0.39732142857142855
Next, let's verify the support vector machine.
sample.ipynb
from sklearn.svm import SVC
#Create a classifier (support vector machine)
clf = SVC(kernel='rbf', gamma=0.1, probability=True)
#Learning
clf.fit(X_train, y_train)
#Forecast
y_pred = clf.predict(X_test)
#Display correct answer rate
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.7581395348837209
#Show precision
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred))
0.42168674698795183
#Display F value
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))
0.35000000000000003
You can see that it is the same implementation as logistic regression, except where it creates a classifier class. In addition, tuning the parameters of the classifier class can improve accuracy and prevent overfitting.
See the reference below for details. Reference: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
Next, let's verify the K-nearest neighbor method.
sample.ipynb
from sklearn.neighbors import KNeighborsClassifier
#Create a classifier (K-nearest neighbor method)
clf = KNeighborsClassifier(n_neighbors=9)
#Learning
clf.fit(X_train, y_train)
#Forecast
y_pred = clf.predict(X_test)
#Display correct answer rate
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.68
#Show precision
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred))
0.31543624161073824
#Display F value
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))
0.3533834586466166
This is also the same implementation as logistic regression, except that it creates a classifier class. Also, in the parameters of the classifier class, it is important to set n_neighbors (the number of data for which a majority vote is taken).
See the reference below for details. Reference: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Finally, let's examine Random Forest.
sample.ipynb
from sklearn.ensemble import RandomForestClassifier
#Create a classifier (random forest)
clf = RandomForestClassifier(
random_state=100,
n_estimators=50,
min_samples_split=100
)
#Learning
clf.fit(X_train, y_train)
#Forecast
y_pred = clf.predict(X_test)
#Display correct answer rate
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.7851162790697674
#Show precision
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred))
0.5121951219512195
#Display F value
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))
0.35294117647058826
This is also the same implementation as logistic regression, except that it creates a classifier class. Also, be sure to tune and optimize the parameters of the classifier class.
See the reference below for details. Reference: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Overfitting is also called overfitting, but as the name implies, it creates a model that fits only the trained data.
To easily check if there is a tendency to overfit, I think you should use both the training data and the evaluation data for prediction and check the difference in the accuracy rate.
sample.ipynb
#Create a classifier (random forest) * Try it without parameters
clf = RandomForestClassifier()
#Learning
clf.fit(X_train, y_train)
#Display the correct answer rate using evaluation data for prediction (normal evaluation flow)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.7609302325581395
#Use training data for prediction to display accuracy rate
y_pred_for_train = clf.predict(X_train)
print(accuracy_score(y_train, y_pred_for_train))
0.9992892679459844
I got noticeable results when I ran Random Forest without parameters, so I tried to reproduce it. When over-learning, the prediction result from the training data tends to be excessively higher than that from the evaluation data as described above.
Next, I would like to use the past Arima Kinen data to see the accuracy of the predicted results. After learning and evaluation by each algorithm so far, the following processing is performed.
sample.ipynb
#Arima Kinen race_index list
target_race_indexes = [
'2015122706050810',
'2016122506050910',
'2017122406050811',
'2018122306050811',
'2019122206050811'
]
for idx in target_race_indexes:
#Arima Kinen Explanatory Variables(X_target)And the objective variable(y_target)Get
X_target = dataX_std[dataX_std.index == idx]
y_target = dataY[idx]
#Forecast
y_pred = clf.predict(X_target)
#Result display
print('y=', idx[0:4], 'pred=', y_pred, 'result=', y_target.values, 'precision_score=', precision_score(y_target, y_pred))
The output results of each algorithm are as follows.
sample.ipynb
#For logistic regression
y= 2015 pred= [0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0] result= [0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0] precision_score= 0.0
y= 2016 pred= [0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0] result= [1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0] precision_score= 0.3333333333333333
y= 2017 pred= [0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0] result= [0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0] precision_score= 0.6666666666666666
y= 2018 pred= [0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0] result= [0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0] precision_score= 0.5
y= 2019 pred= [0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0] result= [0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0] precision_score= 0.3333333333333333
#Support vector machine
y= 2015 pred= [0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0] result= [0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0] precision_score= 1.0
y= 2016 pred= [1 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0] result= [1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0] precision_score= 0.6
y= 2017 pred= [0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0] result= [0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0] precision_score= 0.6666666666666666
y= 2018 pred= [0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1] result= [0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0] precision_score= 0.5
y= 2019 pred= [0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0] result= [0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0] precision_score= 0.75
#K-nearest neighbor method
y= 2015 pred= [0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1] result= [0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0] precision_score= 0.2
y= 2016 pred= [1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 1] result= [1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0] precision_score= 0.5
y= 2017 pred= [0 1 1 0 1 1 1 0 0 1 1 1 0 0 0 0] result= [0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0] precision_score= 0.375
y= 2018 pred= [1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 1] result= [0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0] precision_score= 0.3333333333333333
y= 2019 pred= [0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1] result= [0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0] precision_score= 0.25
#For random forest
y= 2015 pred= [0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 0] result= [0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0] precision_score= 0.4
y= 2016 pred= [1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0] result= [1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0] precision_score= 1.0
y= 2017 pred= [0 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0] result= [0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0] precision_score= 0.6
y= 2018 pred= [1 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0] result= [0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0] precision_score= 0.3333333333333333
y= 2019 pred= [0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0] result= [0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0] precision_score= 0.3333333333333333
As I imagined, I predict that multiple horses will be ** 1 (within 3rd place) ** per race. Originally, I would like to narrow down the horses to be purchased by comparing the horses that run in each race, but with the current implementation, can the horses be purchased from the entire data? It is predicted that. I would like to make this an issue for the future.
The accuracy of each algorithm so far is summarized in the table.
algorithm | Overall precision rate | Overall F value | Arima Kinen Conformity Rate (for 5 years) |
---|---|---|---|
Logistic regression | 0.42 | 0.40 | 0.0, 0.33, 0.66, 0.5, 0.33 |
Support vector machine | 0.42 | 0.35 | 1.0, 0.6, 0.66, 0.5, 0.75 |
K-nearest neighbor method | 0.32 | 0.35 | 0.2, 0.5, 0.38, 0.33, 0.25 |
Random forest | 0.51 | 0.35 | 0.4, 1.0, 0.6, 0.33, 0.33 |
Which algorithm is best depends on the characteristics of the dataset used, the number of samples, and the timing of implementation. With that in mind, we will use either a support vector machine or a random forest based on the above results. (However, support vector machines may be overfitting)
It is difficult to use as it is because we predict that multiple horses will be ** 1 (within 3rd place) **. I would like to narrow down the purchase targets by ranking the racehorses with the probability of being classified within ** 3rd place **.
sample.ipynb
#Arima Kinen race_index list
target_race_indexes = [
'2015122706050810',
'2016122506050910',
'2017122406050811',
'2018122306050811',
'2019122206050811'
]
for idx in target_race_indexes:
#Arima Kinen Explanatory Variables(X_target)And the objective variable(y_target)Get
X_target = dataX_std[dataX_std.index == idx]
y_target = dataY[idx]
#Prediction (Predict the probability of being classified as 0 or 1)
y_pred_proba = clf.predict_proba(X_target)
#Convert to dictionary(key:Horse number, value:Probability of becoming 1)
keys = list(range(1, y_pred_proba[:, 1].size + 1))
values = y_pred_proba[:, 1]
pred_dict = dict(zip(keys, values))
#Result display in descending order of dictionary value
print('y=', idx[0:4])
print(dict(sorted(pred_dict.items(), key=lambda x:x[1], reverse=True)))
It's verbose, but the output looks like this: The key of the dictionary is the horse number, and the value represents the probability of being classified within the third place. (The dictionary is sorted in descending order of value)
sample.ipynb
y= 2015
{7: 0.5696133455536686, 9: 0.4905907696112562, 11: 0.49035299894918755, 13: 0.35007505837022596, 12: 0.34220680265218334, 3: 0.31354320341453473, 4: 0.30980352572486725, 6: 0.30215860817620876, 10: 0.28490440087889995, 16: 0.27909507104899467, 1: 0.27533238657398446, 8: 0.24462710225495993, 2: 0.24459098148537395, 14: 0.24457566067758357, 5: 0.2445741121569982, 15: 0.23657499952423014}
y= 2016
{1: 0.6170252668074172, 2: 0.6051853981429345, 11: 0.5713617761448656, 9: 0.477082991798865, 6: 0.46056067001143736, 12: 0.30720442615574284, 3: 0.30215860817620876, 13: 0.30215860817620876, 8: 0.3007077278874594, 16: 0.2824267715516374, 7: 0.24464207649468928, 10: 0.24460750167495196, 4: 0.24459032539440356, 5: 0.2445880535923202, 14: 0.24458580009313594, 15: 0.24457449358955205}
y= 2017
{10: 0.6170803259427108, 2: 0.617026799448752, 5: 0.4606653690190285, 11: 0.3979634800224914, 14: 0.34913956740973595, 15: 0.3483806159861276, 12: 0.30215860817620876, 4: 0.3021535584865604, 13: 0.30024466402472444, 9: 0.2922074543922137, 1: 0.28743844415935654, 8: 0.2835192845558853, 6: 0.24461953217712495, 7: 0.2445971287209923, 16: 0.24458997746828753, 3: 0.2398748266004306}
y= 2018
{15: 0.5931962545935543, 12: 0.5631034477026525, 13: 0.46364861217784636, 16: 0.4423252760260589, 10: 0.3453931564376497, 3: 0.31157557743661457, 14: 0.30392079440550224, 8: 0.303732258765211, 6: 0.30219848678824074, 2: 0.3021586072259061, 7: 0.302143337075652, 1: 0.2981084912586054, 4: 0.27316635690234087, 5: 0.2445861267179151, 11: 0.2445764568939144, 9: 0.2445733900887549}
y= 2019
{6: 0.6170145067552477, 7: 0.5872900780905845, 10: 0.4904861419159532, 14: 0.43700495515775173, 12: 0.3512586575980933, 2: 0.3087214186649427, 9: 0.30553764130552913, 15: 0.3021220272592637, 16: 0.24776137832454997, 11: 0.2446323520236049, 5: 0.2446088059727512, 13: 0.24459614207316613, 8: 0.24459434296808064, 1: 0.24458784939997164, 4: 0.24457367329291685, 3: 0.24452744515587446}
By using the ** predict_proba ** method, you can get the probability that a point will be classified as 0 or 1 instead of the 0 or 1 classification result. This time we are using the probability of being classified as 1.
Now, let's make a prediction of the main subject of Arima Kinen.
sample.ipynb
#Arima Kinen race_index
target_race_index = '2020122706050811'
#Arima Kinen Explanatory Variables(X_target)Get
X_target = dataX_std[dataX_std.index == target_race_index]
#Forecast
y_pred_proba = clf.predict_proba(X_target)
#Convert to dictionary(key:Horse number, value:Probability of becoming 1)
keys = list(range(1, y_pred_proba[:, 1].size + 1))
values = y_pred_proba[:, 1]
pred_dict = dict(zip(keys, values))
#Result display in descending order of dictionary value
print(dict(sorted(pred_dict.items(), key=lambda x:x[1], reverse=True)))
With the above implementation, I would like to make three predictions each with a support vector machine and a random forest. Then, 3 horses with a high probability of coming within 3rd place are extracted.
Circular | Horse number | Horse name | probability |
---|---|---|---|
First time | 5 | World premiere | 0.58 |
First time | 13 | Fierement | 0.48 |
First time | 15 | Ocean Great | 0.42 |
Circular | Horse number | Horse name | probability |
---|---|---|---|
Second time | 5 | World premiere | 0.58 |
Second time | 13 | Fierement | 0.58 |
Second time | 14 | Salacia | 0.41 |
Circular | Horse number | Horse name | probability |
---|---|---|---|
Third time | 13 | Fierement | 0.55 |
Third time | 5 | World premiere | 0.53 |
Third time | 10 | Curren Bouquetdore | 0.42 |
Circular | Horse number | Horse name | probability |
---|---|---|---|
First time | 13 | Fierement | 0.63 |
First time | 5 | World premiere | 0.52 |
First time | 4 | Loves Only You | 0.49 |
Circular | Horse number | Horse name | probability |
---|---|---|---|
Second time | 13 | Fierement | 0.64 |
Second time | 5 | World premiere | 0.55 |
Second time | 9 | Chrono Genesis | 0.54 |
Circular | Horse number | Horse name | probability |
---|---|---|---|
Third time | 13 | Fierement | 0.60 |
Third time | 4 | Loves Only You | 0.57 |
Third time | 5 | World premiere | 0.56 |
The result was that the 5th world premiere and the 13th Fierement were missing, and the 3rd place was in competition. However, it is quite strange that the horse in 3rd place is different every time, and even though the number of data samples is not so large, train_test_split is used to easily divide the training data and the evaluation data. Seems to be influencing.
This time, I will keep it as it is, and I would like to make it a future issue.
This time, I tried to experience the practical form of actually using the created learning model, but after all the practice and the actual were different, and I was able to gain various notices. Next time, I would like to work on solving the postponed issues and building a general-purpose model.
I actually bought a betting ticket, uploaded the captured image and tried to close it, but the sale has not started yet ... sorry. I will add it when I actually purchase it.
** Added on December 26, 2020: ** I purchased Wide 5-13 at one point, referring to the forecast results.
Recommended Posts