Introduction

The other day, I read the book Bayesian Statistical Modeling with Stan and R, which made me want to try statistical modeling. I recommend it because it is a very good book.

Stan can be used not only from R, but also from the Python library PyStan. This time, I will make a model that estimates the strength (like value) of each team in the 2016 soccer J1 league from the match results.

This time's main version

Python 3.5.1
pystan==2.14.0.0

Collect data

You can check the J-League match results from http://www.jleague.jp/stats/SFTD01.html. Suppose you have the following data collected from each team in the 1st stage of 2016 in some way. https://gist.github.com/mokemokechicken/a26eb7230e51fba7018476f7706e5b0b

away_score	away_team	date	home_score	home_team	score	visitors
1	Nagoya	02/27(soil)	0	Iwata	0-1	14333
1	Kawasaki F	02/27(soil)	0	Hiroshima	0-1	18120
1	Fukuoka	02/27(soil)	2	Tosu	2-1	19762
...	...	...	...	...	...	...

Thinking

The number of teams is 18
In the 1st stage, each team will play only once, so the number of games is ** 153 ** (= 18 * 17/2)
In soccer, it's almost home and away, and there seems to be a difference between the local and the enemy's land in the match.

Think about the mechanism by which match results are produced

Let's assume how the match result (score) is generated as follows.

Suppose each team has ** attack power ** ** defense power ** ** home power **.
The higher the attack power, the higher the score, and the higher the defense power, the lower the goal.
In home games, both offensive and defensive power are given advantages (= home power)

will do. Well, it doesn't seem that strange story so far.

Next, let's enter the world of statistics and think about it. First, it is appropriately assumed that the "score" follows the Poisson distribution. In other words

Score of a game of a team ~ Poisson (Attack power of that team-Defense power of the opponent team)

It's like that. (By the way, ~ means "follow the distribution".) However, the parameters of the Poisson distribution cannot take negative values, so if you take ʻexp` for convenience and add home power,

Scoring a match for a home team~ Poisson(exp( 
		(Home power+Attack power of the home team) -Away team defense
	))

Suppose that Also symmetrically

Conceded a match in a home team~ Poisson(exp( 
Away team attack power- (Home power+The defense of the home team)
	))

will do.

Write the model in Stan

It turned out to be something like this.

model_code = """
data {
    int N;  // N Games
    int K;  // K Teams
    int Th[N]; // Home Team ID
    int Ta[N]; // Away Team ID
    int Sh[N]; // Home Team score point
    int Sa[N]; // Away Team score point
}

parameters {
    real atk[K];
    real def[K];
    real home_power[K];
    real<lower=0> sigma;
    real<lower=0> hp_sigma;
}

model {
    for (k in 1:K) {
        atk[k] ~ normal(0, sigma);
        def[k] ~ normal(0, sigma);
        home_power[k] ~ normal(0, hp_sigma);
    }

    for (n in 1:N) {
        Sh[n] ~ poisson(exp(
            (home_power[Th[n]] + atk[Th[n]]) - 
            (def[Ta[n]])
        ));

        Sa[n] ~ poisson(exp(
            (atk[Ta[n]]) - 
            (def[Th[n]] + home_power[Th[n]])
        ));
    }
}

generated quantities {
    real games[K, K, 2];
    
    for (th in 1:K) {
        for (ta in 1:K) {
            games[th, ta, 1] = poisson_rng(exp((home_power[th] + atk[th]) - (def[ta])));
            games[th, ta, 2] = poisson_rng(exp((atk[ta]) - (def[th] + home_power[th])));
        }
    }
}
"""

easy explanation

Regarding Stan, the book I introduced earlier is very helpful, but I will explain it briefly here as well.

data block

data {
    int N;  // N Games
    int K;  // K Teams
    int Th[N]; // Home Team ID
    int Ta[N]; // Away Team ID
    int Sh[N]; // Home Team score point
    int Sa[N]; // Away Team score point
}

This data block declares externally given data. The data collected earlier will be injected later with this name.

parameters block

parameters {
    real atk[K];
    real def[K];
    real home_power[K];
    real<lower=0> sigma;
    real<lower=0> hp_sigma;
}

This parameters block describes the parameters you want to estimate. This time, the main focus is attack power (atk), defense power (def), and home power (home_power). <lower = 0> represents the range that the parameter can take.

model block

model {
    for (k in 1:K) {
        atk[k] ~ normal(0, sigma);
        def[k] ~ normal(0, sigma);
        home_power[k] ~ normal(0, hp_sigma);
    }

    for (n in 1:N) {
        Sh[n] ~ poisson(exp(
            (home_power[Th[n]] + atk[Th[n]]) - 
            (def[Ta[n]])
        ));

        Sa[n] ~ poisson(exp(
            (atk[Ta[n]]) - 
            (def[Th[n]] + home_power[Th[n]])
        ));
    }
}

This model block uses a probability distribution to represent the relationship between the input data and the parameters you want to estimate. The "mechanism" mentioned earlier is also expressed here. In addition, it is assumed that the offensive power and defense power of each team follow the "normal distribution of the variance sigma with an average of 0 ". This sigma is also (incidentally) an estimated parameter. Otherwise, the calculation (MCMC) would not converge (because all values would converge in various range values). Well, it seems like I wrote a hope that "I want you to fit in the value around here for the time being". What's interesting is that it was easier for the calculation to converge if it was estimated rather than writing sigma as a fixed 1 (the converged sigma was much smaller than 1). Hard-coding 1 makes me feel that the range is limited, but it is interesting that it is not so.

generated quantities block

generated quantities {
    real games[K, K, 2];
    
    for (th in 1:K) {
        for (ta in 1:K) {
            games[th, ta, 1] = poisson_rng(exp((home_power[th] + atk[th]) - (def[ta])));
            games[th, ta, 2] = poisson_rng(exp((atk[ta]) - (def[th] + home_power[th])));
        }
    }
}

This generated quantities block is like a bonus. It is a function that makes a different calculation using the converged value. Here, the parameters of each team are used to sample the score of the match when each team played against each other. You can calculate it separately on the Python side, but the calculation logic must match the Model, so it seems easier to maintain it if you manage it here.

Other

There is another transformed parameters block, which allows you to define different variables using the data and parameters variables. Expressions such as (home_power [th] + atk [th])-(def [ta]) this time also appear twice in model and generated quantities, so it is better to put them together. I wonder, but ...

Calculate with model and data

Define some useful functions

from hashlib import sha256
import pickle
import time

import requests
import pandas as pd
from pystan import StanModel


def stan_cache(file=None, model_code=None, model_name=None, cache_dir="."):
    """Use just as you would `stan`"""
    # http://pystan.readthedocs.io/en/latest/avoiding_recompilation.html#automatically-reusing-models

    if file:
        if file.startswith("http"):
            model_code = requests.get(file).text
        else:
            with open(file) as f:
                model_code = f.read()
    code_hash = sha256(model_code.encode('utf8')).hexdigest()
    if model_name is None:
        cache_fn = '{}/cached-model-{}.pkl'.format(cache_dir, code_hash)
    else:
        cache_fn = '{}/cached-{}-{}.pkl'.format(cache_dir, model_name, code_hash)
    try:
        sm = pickle.load(open(cache_fn, 'rb'))
    except:
        start_time = time.time()
        sm = StanModel(model_code=model_code)
        print("compile time: %s sec" % (time.time() - start_time))
        with open(cache_fn, 'wb') as f:
            pickle.dump(sm, f)
    return sm


def sampling_summary_to_df(fit4model):
    """

    :param StanFit4model fit4model:
    :return:
    """
    s = fit4model.summary()
    return pd.DataFrame(s["summary"], columns=s['summary_colnames'], index=s['summary_rownames'])

The stan_cache function caches the stan model and retrieves the model code via HTTP as well. Convenient.

The sampling_summary_to_df function turns the result of fitting the model into a Pandas DataFrame. It has the advantage of being easier to see when working with Jupyter.

Execute

import numpy as np
import pandas as pd

df = pd.read_csv("https://gist.githubusercontent.com/mokemokechicken/a26eb7230e51fba7018476f7706e5b0b/raw/90e1c26e172b92d5f52d53d0591a611e0258bd2b/2016-j1league-1st.csv")

team_list = list(sorted(set(df['away_team']) | set(df['home_team'])))
teams = dict(zip(range(1, len(team_list)+1), team_list))
for k, v in list(teams.items()):
    teams[v] = k
    
model = stan_cache(model_code=model_code) 

stan_input = {
    "N": len(df),
    "K": len(team_list),
    "Th": df['home_team'].apply(lambda x: teams[x]),
    "Ta": df['away_team'].apply(lambda x: teams[x]),
    "Sh": df['home_score'],
    "Sa": df['away_score'],
}

fitted = model.sampling(data=stan_input, seed=999)
sampling_summary_to_df(fitted).sort_values("Rhat", ascending=False)

result

For example, the summary of the parameter estimation results is as follows. In the book, "If this Rhat is less than 1.1 for all estimated parameters, it can be considered that the calculation has converged ", so I will consider it as converged for the time being. However, it is a little subtle that the "number of times the value can be sampled properly" called n_eff is less than 100 (especially when estimating the interval), so there are some subtleties, but this time I will leave it as good.

	mean	se_mean	sd	2.5%	25%	50%	75%	97.5%	n_eff	Rhat
lp__	-214.114891	1.542823	13.712915	-237.604900	-223.960556	-215.747425	-205.033248	-183.817981	79.0	1.049519
hp_sigma	0.094069	0.005835	0.063921	0.011408	0.044970	0.081836	0.132181	0.245524	120.0	1.031258
atk[12]	0.048430	0.009507	0.158794	-0.318161	-0.052664	0.052359	0.155878	0.342186	279.0	1.018430
atk[16]	0.225840	0.008570	0.158951	-0.075769	0.115129	0.223274	0.330571	0.531899	344.0	1.013612
sigma	0.220279	0.002456	0.056227	0.112035	0.182132	0.217550	0.255500	0.344382	524.0	1.011379
...	...	...	...	...	...	...	...	...	...	...

View the results

Estimated team strength

Collect parameters by team.

tk = {}

params = fitted.extract()
atks = params['atk']
defs = params['def']
home_power = params['home_power']
games = params['games']


for k, team in enumerate(team_list):
    tk[team] = {
            "atk": np.mean(atks[:, k]),
            "def": np.mean(defs[:, k]),
            "home_power": np.mean(home_power[:, k]),
    }

tdf = pd.DataFrame(tk).T

For the time being, I will arrange them in descending order of "home power".

tdf.sort_values("home_power", ascending=False)

	atk	def	home_power
Kashima	0.225840	0.217699	0.068898
Urawa	0.152638	0.063192	0.041830
Kobe	0.099154	-0.163383	0.030443
Hiroshima	0.293808	-0.000985	0.029861
Nagoya	0.130757	-0.247969	0.015849
Kawasaki F	0.312593	0.087859	0.014728
Niigata	0.010827	-0.146252	0.011885
Iwata	0.048430	-0.105230	0.011843
Sendai	0.037960	-0.150151	0.005462
FC tokyo	-0.072115	0.017485	-0.005050
Gamba Osaka	0.087034	-0.033666	-0.005532
Tosu	-0.232595	0.103412	-0.006186
oak	0.036172	-0.053332	-0.009568
Omiya	-0.040014	0.027842	-0.020942
Yokohama FM	0.059420	-0.009322	-0.021393
Fukuoka	-0.185292	-0.130759	-0.026643
Kofu	0.002608	-0.268061	-0.037079
Shonan	-0.000052	-0.184515	-0.049440

have become. This time, the average strength should be 0, so whether it is higher or lower than 0 is more than the average of the J1 team.

By the way, if you do it with the data of the 2nd stage, you will get a completely different result (Kashima will be much worse). To the last, it's just like this when calculated from the data of the 1st stage, so I'm not sure.

If you think about it, it's natural, but it's close to the 2016 1st stage standings. https://ja.wikipedia.org/wiki/2016%E5%B9%B4%E3%81%AEJ1%E3%83%AA%E3%83%BC%E3%82%B0#.E9.A0.86.E4.BD.8D.E8.A1.A8 In particular, there seems to be a general correlation between "total points and total goals" and "attack power and defense power" (although it would be a problem if there were none).

at the end

I tried to simulate what would happen if I bought the 2nd stage Toto with this prediction model, but I stopped because it seems to be a terrible result (laugh). It seems that we can make good predictions if the competition density is high in a short period of time. Predicting 17 verses with data up to 16 verses, is it a little better ??

It's amazing that this kind of modeling has become easy, and it's a lot of fun.

Play with Statistical Modeling: Quantify J-League Team Strength with Stan and Python