The other day, I read the book Bayesian Statistical Modeling with Stan and R, which made me want to try statistical modeling. I recommend it because it is a very good book.
Stan can be used not only from R, but also from the Python library PyStan. This time, I will make a model that estimates the strength (like value) of each team in the 2016 soccer J1 league from the match results.
You can check the J-League match results from http://www.jleague.jp/stats/SFTD01.html. Suppose you have the following data collected from each team in the 1st stage of 2016 in some way. https://gist.github.com/mokemokechicken/a26eb7230e51fba7018476f7706e5b0b
away_score | away_team | date | home_score | home_team | score | visitors |
---|---|---|---|---|---|---|
1 | Nagoya | 02/27(soil) | 0 | Iwata | 0-1 | 14333 |
1 | Kawasaki F | 02/27(soil) | 0 | Hiroshima | 0-1 | 18120 |
1 | Fukuoka | 02/27(soil) | 2 | Tosu | 2-1 | 19762 |
... | ... | ... | ... | ... | ... | ... |
= 18 * 17/2
)Let's assume how the match result (score) is generated as follows.
will do. Well, it doesn't seem that strange story so far.
Next, let's enter the world of statistics and think about it. First, it is appropriately assumed that the "score" follows the Poisson distribution. In other words
Score of a game of a team ~ Poisson (Attack power of that team-Defense power of the opponent team)
It's like that. (By the way, ~
means "follow the distribution".)
However, the parameters of the Poisson distribution cannot take negative values, so if you take ʻexp` for convenience and add home power,
Scoring a match for a home team~ Poisson(exp(
(Home power+Attack power of the home team) -Away team defense
))
Suppose that Also symmetrically
Conceded a match in a home team~ Poisson(exp(
Away team attack power- (Home power+The defense of the home team)
))
will do.
It turned out to be something like this.
model_code = """
data {
int N; // N Games
int K; // K Teams
int Th[N]; // Home Team ID
int Ta[N]; // Away Team ID
int Sh[N]; // Home Team score point
int Sa[N]; // Away Team score point
}
parameters {
real atk[K];
real def[K];
real home_power[K];
real<lower=0> sigma;
real<lower=0> hp_sigma;
}
model {
for (k in 1:K) {
atk[k] ~ normal(0, sigma);
def[k] ~ normal(0, sigma);
home_power[k] ~ normal(0, hp_sigma);
}
for (n in 1:N) {
Sh[n] ~ poisson(exp(
(home_power[Th[n]] + atk[Th[n]]) -
(def[Ta[n]])
));
Sa[n] ~ poisson(exp(
(atk[Ta[n]]) -
(def[Th[n]] + home_power[Th[n]])
));
}
}
generated quantities {
real games[K, K, 2];
for (th in 1:K) {
for (ta in 1:K) {
games[th, ta, 1] = poisson_rng(exp((home_power[th] + atk[th]) - (def[ta])));
games[th, ta, 2] = poisson_rng(exp((atk[ta]) - (def[th] + home_power[th])));
}
}
}
"""
Regarding Stan, the book I introduced earlier is very helpful, but I will explain it briefly here as well.
data {
int N; // N Games
int K; // K Teams
int Th[N]; // Home Team ID
int Ta[N]; // Away Team ID
int Sh[N]; // Home Team score point
int Sa[N]; // Away Team score point
}
This data
block declares externally given data.
The data collected earlier will be injected later with this name.
parameters {
real atk[K];
real def[K];
real home_power[K];
real<lower=0> sigma;
real<lower=0> hp_sigma;
}
This parameters
block describes the parameters you want to estimate.
This time, the main focus is attack power (atk), defense power (def), and home power (home_power).
<lower = 0>
represents the range that the parameter can take.
model {
for (k in 1:K) {
atk[k] ~ normal(0, sigma);
def[k] ~ normal(0, sigma);
home_power[k] ~ normal(0, hp_sigma);
}
for (n in 1:N) {
Sh[n] ~ poisson(exp(
(home_power[Th[n]] + atk[Th[n]]) -
(def[Ta[n]])
));
Sa[n] ~ poisson(exp(
(atk[Ta[n]]) -
(def[Th[n]] + home_power[Th[n]])
));
}
}
This model
block uses a probability distribution to represent the relationship between the input data and the parameters you want to estimate.
The "mechanism" mentioned earlier is also expressed here.
In addition, it is assumed that the offensive power and defense power of each team follow the "normal distribution of the variance sigma
with an average of 0 ".
This sigma
is also (incidentally) an estimated parameter.
Otherwise, the calculation (MCMC) would not converge (because all values would converge in various range values). Well, it seems like I wrote a hope that "I want you to fit in the value around here for the time being".
What's interesting is that it was easier for the calculation to converge if it was estimated rather than writing sigma
as a fixed 1
(the converged sigma
was much smaller than 1). Hard-coding 1 makes me feel that the range is limited, but it is interesting that it is not so.
generated quantities {
real games[K, K, 2];
for (th in 1:K) {
for (ta in 1:K) {
games[th, ta, 1] = poisson_rng(exp((home_power[th] + atk[th]) - (def[ta])));
games[th, ta, 2] = poisson_rng(exp((atk[ta]) - (def[th] + home_power[th])));
}
}
}
This generated quantities
block is like a bonus.
It is a function that makes a different calculation using the converged value.
Here, the parameters of each team are used to sample the score of the match when each team played against each other. You can calculate it separately on the Python side, but the calculation logic must match the Model, so it seems easier to maintain it if you manage it here.
There is another transformed parameters
block, which allows you to define different variables using the data
and parameters
variables.
Expressions such as (home_power [th] + atk [th])-(def [ta])
this time also appear twice in model
and generated quantities
, so it is better to put them together. I wonder, but ...
from hashlib import sha256
import pickle
import time
import requests
import pandas as pd
from pystan import StanModel
def stan_cache(file=None, model_code=None, model_name=None, cache_dir="."):
"""Use just as you would `stan`"""
# http://pystan.readthedocs.io/en/latest/avoiding_recompilation.html#automatically-reusing-models
if file:
if file.startswith("http"):
model_code = requests.get(file).text
else:
with open(file) as f:
model_code = f.read()
code_hash = sha256(model_code.encode('utf8')).hexdigest()
if model_name is None:
cache_fn = '{}/cached-model-{}.pkl'.format(cache_dir, code_hash)
else:
cache_fn = '{}/cached-{}-{}.pkl'.format(cache_dir, model_name, code_hash)
try:
sm = pickle.load(open(cache_fn, 'rb'))
except:
start_time = time.time()
sm = StanModel(model_code=model_code)
print("compile time: %s sec" % (time.time() - start_time))
with open(cache_fn, 'wb') as f:
pickle.dump(sm, f)
return sm
def sampling_summary_to_df(fit4model):
"""
:param StanFit4model fit4model:
:return:
"""
s = fit4model.summary()
return pd.DataFrame(s["summary"], columns=s['summary_colnames'], index=s['summary_rownames'])
The stan_cache
function caches the stan model and retrieves the model code via HTTP as well. Convenient.
The sampling_summary_to_df
function turns the result of fitting the model into a Pandas DataFrame. It has the advantage of being easier to see when working with Jupyter.
import numpy as np
import pandas as pd
df = pd.read_csv("https://gist.githubusercontent.com/mokemokechicken/a26eb7230e51fba7018476f7706e5b0b/raw/90e1c26e172b92d5f52d53d0591a611e0258bd2b/2016-j1league-1st.csv")
team_list = list(sorted(set(df['away_team']) | set(df['home_team'])))
teams = dict(zip(range(1, len(team_list)+1), team_list))
for k, v in list(teams.items()):
teams[v] = k
model = stan_cache(model_code=model_code)
stan_input = {
"N": len(df),
"K": len(team_list),
"Th": df['home_team'].apply(lambda x: teams[x]),
"Ta": df['away_team'].apply(lambda x: teams[x]),
"Sh": df['home_score'],
"Sa": df['away_score'],
}
fitted = model.sampling(data=stan_input, seed=999)
sampling_summary_to_df(fitted).sort_values("Rhat", ascending=False)
For example, the summary of the parameter estimation results is as follows.
In the book, "If this Rhat
is less than 1.1 for all estimated parameters, it can be considered that the calculation has converged ", so I will consider it as converged for the time being.
However, it is a little subtle that the "number of times the value can be sampled properly" called n_eff
is less than 100 (especially when estimating the interval), so there are some subtleties, but this time I will leave it as good.
mean | se_mean | sd | 2.5% | 25% | 50% | 75% | 97.5% | n_eff | Rhat | |
---|---|---|---|---|---|---|---|---|---|---|
lp__ | -214.114891 | 1.542823 | 13.712915 | -237.604900 | -223.960556 | -215.747425 | -205.033248 | -183.817981 | 79.0 | 1.049519 |
hp_sigma | 0.094069 | 0.005835 | 0.063921 | 0.011408 | 0.044970 | 0.081836 | 0.132181 | 0.245524 | 120.0 | 1.031258 |
atk[12] | 0.048430 | 0.009507 | 0.158794 | -0.318161 | -0.052664 | 0.052359 | 0.155878 | 0.342186 | 279.0 | 1.018430 |
atk[16] | 0.225840 | 0.008570 | 0.158951 | -0.075769 | 0.115129 | 0.223274 | 0.330571 | 0.531899 | 344.0 | 1.013612 |
sigma | 0.220279 | 0.002456 | 0.056227 | 0.112035 | 0.182132 | 0.217550 | 0.255500 | 0.344382 | 524.0 | 1.011379 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Collect parameters by team.
tk = {}
params = fitted.extract()
atks = params['atk']
defs = params['def']
home_power = params['home_power']
games = params['games']
for k, team in enumerate(team_list):
tk[team] = {
"atk": np.mean(atks[:, k]),
"def": np.mean(defs[:, k]),
"home_power": np.mean(home_power[:, k]),
}
tdf = pd.DataFrame(tk).T
For the time being, I will arrange them in descending order of "home power".
tdf.sort_values("home_power", ascending=False)
atk | def | home_power | |
---|---|---|---|
Kashima | 0.225840 | 0.217699 | 0.068898 |
Urawa | 0.152638 | 0.063192 | 0.041830 |
Kobe | 0.099154 | -0.163383 | 0.030443 |
Hiroshima | 0.293808 | -0.000985 | 0.029861 |
Nagoya | 0.130757 | -0.247969 | 0.015849 |
Kawasaki F | 0.312593 | 0.087859 | 0.014728 |
Niigata | 0.010827 | -0.146252 | 0.011885 |
Iwata | 0.048430 | -0.105230 | 0.011843 |
Sendai | 0.037960 | -0.150151 | 0.005462 |
FC tokyo | -0.072115 | 0.017485 | -0.005050 |
Gamba Osaka | 0.087034 | -0.033666 | -0.005532 |
Tosu | -0.232595 | 0.103412 | -0.006186 |
oak | 0.036172 | -0.053332 | -0.009568 |
Omiya | -0.040014 | 0.027842 | -0.020942 |
Yokohama FM | 0.059420 | -0.009322 | -0.021393 |
Fukuoka | -0.185292 | -0.130759 | -0.026643 |
Kofu | 0.002608 | -0.268061 | -0.037079 |
Shonan | -0.000052 | -0.184515 | -0.049440 |
have become. This time, the average strength should be 0, so whether it is higher or lower than 0 is more than the average of the J1 team.
By the way, if you do it with the data of the 2nd stage, you will get a completely different result (Kashima will be much worse). To the last, it's just like this when calculated from the data of the 1st stage, so I'm not sure.
If you think about it, it's natural, but it's close to the 2016 1st stage standings. https://ja.wikipedia.org/wiki/2016%E5%B9%B4%E3%81%AEJ1%E3%83%AA%E3%83%BC%E3%82%B0#.E9.A0.86.E4.BD.8D.E8.A1.A8 In particular, there seems to be a general correlation between "total points and total goals" and "attack power and defense power" (although it would be a problem if there were none).
I tried to simulate what would happen if I bought the 2nd stage Toto with this prediction model, but I stopped because it seems to be a terrible result (laugh). It seems that we can make good predictions if the competition density is high in a short period of time. Predicting 17 verses with data up to 16 verses, is it a little better ??
It's amazing that this kind of modeling has become easy, and it's a lot of fun.
Recommended Posts