Introduction

This article is an explanation of the internal code of the boat race triple prediction site "Today, do you have a good prediction?" that was created by myself and released on the Web. It will be. This time, I will summarize the machine learning model. In the first place, I will create a separate article on data acquisition and shaping, so please refer to that.

Create a data frame from the acquired boat race text data

The code is written in my own way, so I would appreciate it if you could give me some advice.

What is ranking learning?

I referred to the following articles very much when I was able to use it.

・ Horse Racing Prediction AI Again -Part 1- ~ Lambda Rank ~

Ranking learning is said to be a method for learning relative order relationships. Like the link above, I thought that it would be suitable for learning the relative strength of multiple people (horses) such as horse racing and boat racing.

The paper is still being read (laughs), but I'll try using it first. The library used is lightgbm.

Prepare the Query dataset

This time, the learning data will be from January to April 2020, and the data from May 2020 will be the verification data.

One of the features of ranking learning is "Query data". This Query data represents the number of training data contained in one race. In the case of boat race, if there is no trouble, the race will be held with 6 boats, so

Query data box = [6,6,6, ..., 6]

You should have a list of "6" s for the number of races (if there are no absentees).

So, we will create a query data box with the following code.

%%time #Measure time
target_cols = ["Position"]
feature_cols = ["Name","Lane","Round","Month","Place"]

train = pd.read_csv(train_file)
train_shuffle = pd.DataFrame([],columns=train.columns)

train_group =[]
for i,k in enumerate(train["Round"]):
    if i == 0:
        temp = k
        temp2 = i
        
    else:
        if k == temp:
            pass
        else:
            train_group.append(i-temp2)
            #↓ .Make the data shuffled with sample.
            train_shuffle=train_shuffle.append(train[temp2:i].sample(frac=1))
            temp = k
            temp2 = i

#Added because the last pair is not included
train_group.append(i+1-temp2)
train_shuffle=train_shuffle.append(train[temp2:i+1].sample(frac=1))

train_y = train_shuffle[target_cols].astype(int)
train = train_shuffle[feature_cols]
print(train.shape)

The train file read by read_csv is based on the article Create a data frame from the acquired boat race text data.

The number of the same Round is counted and stored in the list of train_group. And when I read the reference article, it was ** that it would be dangerous if I did not shuffle the order in this group, **, so when storing it in train_shuffle, shuffle processing with .sample doing.

Apply the above code to the validation dataset to create a validation Query dataset.

Use LightGBM

I will omit missing processing, feature engineering, one-hot encoding, etc., but if the learning data set and query data set are ready, it is easy to execute machine learning in the present world. Only one point, maybe it is a lightgbm specification, an error will occur if Japanese is in the column. Therefore, the following processing was added.

column_list = []
for i in range(len(comb_onehot.columns)):
    column_list.append(str(i)+'_column')

comb_onehot.columns = column_list

Comb_onehot DataFrame type is a data frame created when Train dataset and Valid dataset are combined and One-hot encoding is processed. After this process

train_onehot = comb_onehot[:len(train)]
val_onehot = comb_onehot[len(train):]

Re-separated for learning and verification. Now, let's carry out machine learning.

import lightgbm as lgb

lgbm_params =  {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'lambdarank', #← Designated as ranking learning here!
    'metric': 'ndcg',   # for lambdarank
    'ndcg_eval_at': [1,2,3],  #I want to predict triples
    'max_position': 6,  #Kyotei is only up to 6th place
    'learning_rate': 0.01, 
    'min_data': 1,
    'min_data_in_bin': 1,
#     'num_leaves': 31,
#     'min_data_in_leaf': 20,
#     'max_depth':35,
}
lgtrain = lgb.Dataset(train_onehot, train_y,  group=train_group)
lgvalid = lgb.Dataset(val_onehot, val_y,group=val_group)
lgb_clf = lgb.train(
    lgbm_params,
    lgtrain,
    num_boost_round=250,
    valid_sets=[lgtrain, lgvalid],
    valid_names=['train','valid'],
    early_stopping_rounds=20,
    verbose_eval=5
)

Hyperparameters such as num_leaves should be adjusted, but let's proceed without thinking here. The prediction for the verification data looks like this. It's a really convenient time ...

y_pred = lgb_clf.predict(val_onehot,group=val_group, num_iteration=lgb_clf.best_iteration)

Result is...

The triad prediction by ranking learning is as follows. Triple single is 8.15% .. !!

By the way, in order to get the above hit rate (especially 2nd and 3rd), I wrote the following code. Hmmm, redundant!

#Calculation of Valid data hit rate
j = 0
solo_count = 0
doub_count = 0
tri_count = 0
for i in val_group:
    result = y_pred[j:j+i]
    ans = val_y[j:j+i].reset_index()
    
    result1st = np.argmin(result)
    if len(np.where(result==sorted(result)[1])[0])>1:
        result2nd = np.where(result==sorted(result)[1])[0][0]
        result3rd = np.where(result==sorted(result)[1])[0][1]
    else:
        if i > 1:
            result2nd = np.where(result==sorted(result)[1])[0][0]
        if i > 2:
            result3rd = np.where(result==sorted(result)[2])[0][0]
    
    ans1st = int(ans[ans["Position"]==1].index.values)
    if len(ans[ans["Position"]==2].index.values)>1:
        ans2nd = int(ans[ans["Position"]==2].index.values[0])
        ans3rd = int(ans[ans["Position"]==2].index.values[1])
    else:
        if i > 1:
            ans2nd = int(ans[ans["Position"]==2].index.values[0])
        if i > 2:
            ans3rd = int(ans[ans["Position"]==3].index.values[0])
    
    if ans1st==result1st:
        #print(ans1st,result1st)
        solo_count = solo_count+1
    
    if i > 1:
        if (ans1st==result1st)&(ans2nd==result2nd):
            doub_count = doub_count+1
    
    if i > 2:
        if (ans1st==result1st)&(ans2nd==result2nd)&(ans3rd==result3rd):
            tri_count = tri_count+1 
    j=j+i

print("Winning rate:",round(solo_count/len(val_group)*100,2),"%")
print("Double single predictive value:",round(doub_count/len(val_group)*100,2),"%")
print("Triple single predictive value:",round(tri_count/len(val_group)*100,2),"%")

at the end

The above result is a higher hit rate than buying without thinking. (The most frequently occurring triplet combination is "1-2-3", and the frequency is about 7%)

However, this result alone is unacceptable as a hit rate, so I felt that another device was needed. I would like to summarize that in another article.