I tried the J League match forecast based on past data We had a 55% chance of predicting wins, losses and draws (Learning data 2573 games, test data 200 games)
We do not take any responsibility for using the content of the article
Google predicted World Cup wins and losses around 2014 " When Google analyzes big data and predicts the World Cup, it will hit the quarterfinals during all games. Will it end up? " Looking at this article, I decided to make a prediction for the J League. What I was interested in in Google's efforts Information on all player and ball positions can be obtained from soccer data called OPTA. (I searched for data, but it seems like it's too much for me to get it personally) For example, simulating the entire game using the Monte Carlo method.
There are quite a few sites about the J League. We have collected the data quietly so as not to cause any inconvenience.
(The total number of learning variables is 754)
Labeled 0,1,2 for wins, losses, or draws
This time, I used beautifulsoup in earnest for the first time to extract data from html. It's too convenient. I should have used it earlier. You should also normalize to 0 to 1. You can easily do it in about 2 lines using np.max.
Pretty ordinary scikit learn code
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
all_input_np = np.load('input.npy')
all_label_np = np.load('label.npy')
train_input = all_input_np[:-200]
test_input = all_input_np[-200:]
train_result = all_label_np[:-200]
test_result = all_label_np[-200:]
tuned_parameters = [{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
clf = GridSearchCV(SVC(), tuned_parameters, n_jobs=4, scoring="accuracy")
print(">>>START")
clf.fit(train_input, train_result)
print("Best parameters set found on development set: %s" % clf.best_params_)
print("Grid scores on development set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
print("The scores are computed on the full evaluation set.")
y_true, y_pred = test_result, clf.predict(test_input)
print(classification_report(y_true, y_pred))
print(accuracy_score(y_true, y_pred))
precision recall f1-score support
0.0 0.61 0.56 0.59 87
1.0 0.51 0.78 0.62 78
2.0 0.00 0.00 0.00 35
avg / total 0.46 0.55 0.50 200
0.55
55% is the correct answer rate As for the label, 0 loses, 1 wins, and 2 draws.
Here is one ingenuity. It seems that about 20% of all soccer games are drawn Therefore, in order to raise the overall correct answer rate, learning predicts that the draw is a win. I tried to predict only wins and losses. (Conversely, the draw is unpredictable)
The correct answer rate was about 50% Multilayer perceptron (3 layers, elu, dropout rate like 0.5)
After all, it is difficult to access only limited data. With the data collected this time, I think that even if the mathematical model is raised to the limit, the correct answer rate is at most about 60%. I think it's a good idea for a whole game simulation like Google did I'm skeptical that soccer can be reproduced with just the pass success rate and other numbers. After all, by linking some image data of the game, humans did not notice it, or One point is that it may be possible to perform analytical processing that humans cannot bear. The other is if we can visualize the process of data analysis I think we can see the direction of improvement of the model and input data.
Recommended Posts