** This is the article on the 18th day of ADVENT CALENDER of NTT DoCoMo Service Innovation Department. ** **
Hello! This is Osugi from NTT DoCoMo.
I spent my school days playing soccer and futsal, and now I'm doing marketing-related data analysis work.
Today, I would like to introduce soccer action, a python package related to soccer, while analyzing the match data of the FIFA World Cup Russia tournament held in 2018.
socceraction is featured in ** ”Actions Speak Louder than Goals: Valuing Player Actions in Soccer” ** [^ 1], which won the Best Paper Award from the KDD2019 Appried Data Sciense Track.
[^1 ]: Decroos, Tom, et al. "Actions speak louder than goals: Valuing player actions in soccer." Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2019.
This paper proposes a new index for evaluating the behavior of soccer players during a match, and specifically includes the following.
--Definition of ** SPADL (Soccer Player Action Description Language) ** that represents the actions of players during a match --Definition of ** VAEP (Valuing Actions by Estimating Probabilities) **, which is a framework for player evaluation --Prediction of score / goal probability when attacking / defending --Results and considerations from analysis using data from major European leagues from 2012/2013 to 2017/2018
And about soccer action, you can do the following by using this package.
In other words, socceraction makes it easy to try out the analytical methods described in the paper. You can also perform a series of analyzes by referring to the public-notebooks published on github. This time, I would like to touch the data based on this public-notebook.
The code used in this article is posted at the end of this article, so I hope everyone will give it a try.
You can install socceraction with pip as follows.
pip install socceraction
--Reference
In the paper, the data of Wyscout is used, but in the public-notebook of soccer action, the data is acquired from StatsBomb. I am doing. In socceraction, you can also process Opta data into SPADL.
This time, I also use the data of StatsBomb.
[Image source]: https://github.com/statsbomb/open-data/blob/master/README.md
Open data has been released [^ 3] on StatsBomb, and as of December 17, 2019, the data for the following tournaments have been released.
If you look at it like this, you can see that there is a wealth of data on La Liga (Liga Espanola: Spain). This time, we will use the data of the 2018 Russia World Cup, in which the Japanese national team also participated.
Zip file as in 1-download-statsbomb-data.ipynb Data can be obtained by acquiring / expanding. This time, in order to convert to SPADL and calculate VAEP accurately, I also obtained it with reference to the above notebook.
By the way, another way to get the StatsBomb data is to use the python package statsbomb.
You can also actually get the StatsBomb data with the code below.
# https://pypi.org/project/statsbomb/
import statsbomb as sb
# Competitions
comps = sb.Competitions()
comps_df = comps.get_dataframe() #Tournament list
# Matches(FIFA World Cup : competition_id(event_id) = 43, session_id = 3)
matches = sb.Matches(event_id='43', season_id='3')
matches_df = matches.get_dataframe() #Match list
# Events(Japan VS Belgium : event_id = '7584')
# event_Details of type are linked below
# https://github.com/imrankhan17/statsbomb-parser/blob/master/statsbomb/events.yaml
events = sb.Events(event_id='7584')
events.get_dataframe(event_type='substitution') #Data at the time of player change
--Reference
The format of SPADL (Soccer Player Action Description Language) is as follows.
-** StartTime : Start time of action - EndTime : End time of action - StartLoc : Location information at the start of action - EndLoc : Location information at the end of the action - Player : The player who acted - Team : The team to which the player belongs - ActionType : Action type (21 types such as pass / shoot / dribble / intercept / throw-in) - BodyPart : Body part used by the athlete during the action - Result **: Result of action (success or failure)
In the reference notebook, the code to change the StatsBomb data to the above SPADL format is written, and you can easily convert it to SPADL by referring to this. Here, the procedure is as follows.
socceraction.spadl.api.statsbombjson_to_statsbombh5 (statsbomb_json, statsbomb_h5)
statsbombh5_to_spadlh5 (statsbomb_h5, spadl_h5)
You can also plot the SPADL format data as follows by using the python package matplotsoccer.actions
called ** matplotsoccer **.
Here, I plotted one scene [^ 9] of the match between Japan and Belgium, which would be very impressive to Japanese soccer fans. You can see who is playing when and what kind of play with figures and tables.
[^ 9]: Belgium national team third point scene
--Reference
Next, create a feature based on SPADL and find the score / goal probability in attack / defense. This time, I tried to create a prediction model by including the previous play in the feature quantity. The features used are as follows. There are two types of this, the latest play and the previous play.
This feature and the objective variable can be derived using socceraction.classification.features
and socceraction.classification.labels
.
We ran a prediction model with xgboost using these features and confirmed the prediction accuracy. Then, the prediction accuracy of the prediction model created this time is as follows.
Scores | Concedes | |
---|---|---|
brier_score_loss | 0.0092 | 0.0025 |
AUC | 0.8512 | 0.8865 |
We also used SHAP [^ 5] [^ 6] to see how features contribute to the prediction model. Let's look at the features that contribute to the scoring probability with summary_plot.
[^5 ]: Lundberg, Scott M., and Su-In Lee. "A unified approach to interpreting model predictions." Advances in Neural Information Processing Systems. 2017. [^6 ]: Lundberg, Scott M., et al. "Explainable AI for Trees: From Local Explanations to Global Understanding." arXiv preprint arXiv:1905.04610 (2019).
SHAP allows you to visually see how features affect the objective variable.
You can also take a closer look at each variable if you find one that interests you. In the dependency_plot below, we see how the distance to the goal in action contributes to the scoring probability. Since the horizontal axis is the distance from the goal when the action is completed and the vertical axis is the SHAP value, here you can see that the shorter the distance after the action, the higher the scoring probability.
VAEP (Valuing Actions by Estimating Probabilities) will be calculated based on the score / goal probability by the prediction model calculated in 3. VAEP in action $ a_i $ of team $ x $ is calculated as follows.
V(a_i,x) = \Delta P_{scores}(a_i,x) + (- \Delta P_{concedes}(a_i,x))
At this time, $ \ Delta P_ {scores} (a_i, x) $ means the increase in the probability of scoring due to the action, and $ \ Delta P_ {concedes} (a_i, x) $ means the increase in the probability of losing points due to the action. I will.
In other words, VAEP will be higher for actions that (1) increase the probability of scoring and (2) decrease the probability of conceding.
Actually calculate VAEP using soccer action.
Here, it can be calculated by using socceraction.vaep.value ()
.
As a result of arranging in descending order of total VAEP, the following results were obtained.
The result was that the winning French national team Mbappe had the highest total VAEP.
In the above, I tried to put out the total of VAEP, but this result alone does not consider the play time. Therefore, average the play time as practiced in the paper and calculate the VAEP per 90 minutes. In addition, as a condition, we are narrowing down to only the players who participated for 180 minutes or more.
Looking at the VAEP per 90 minutes, Russia's Denis Cheryshev came in first. Cheryshev was active in scoring 4 goals out of 5 games, but because there were many mid-game appearances and mid-term changes, By looking at the VAEP per 90 minutes, it seems that the ranking has risen to the 1st place. It was also interesting that Germany's Toni Kroos, who had been eliminated from the group league in this tournament, is ranked high.
In addition to this, by issuing an average VAEP per play, it may be possible to extract players who do a good job even though the number of plays is small.
Let's add the calculated VAEP to the plot by matplotsoccer introduced earlier. This makes it possible to quantify and evaluate the behavior of each player in the target scene. Looking at this figure, except for shooting and assisting, De Bruyne's pass is the most highly rated.
This time, we introduced soccer action using actual data of the soccer World Cup Russia tournament held in 2018. Frankly, I found it very convenient to be able to convert multiple data sources such as StatsBomb and Wyscout to SPADL. Also, the data of StatsBomb is very detailed, and I felt that it could be used for various analyzes. (Thanks for being able to use it for free ...) The public-notebook on github of soccer action is also processed in HDF5 format, and I don't usually use it so much, so I learned it. And above all, I found it very interesting to be able to analyze the actual match data in this way! If you are interested in sports and soccer analysis like me, please give it a try and see soccer action.
# ----
#Referenced public-notebook MIT license
# (c) 2019 KU Leuven Machine Learning Research Group
# Released under the MIT license.
# see https://github.com/ML-KULeuven/socceraction/blob/master/LICENSE
# ----
# package
%load_ext autoreload
%autoreload 2
import os; import sys;
import tqdm
import requests
import math
import zipfile
import warnings
import pandas as pd
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
import socceraction.spadl.api as spadl
import matplotsoccer
import matplotlib
#Folder name/Specifying the file name
datafolder = "hogehoge" #Specify the folder name
statsbombzip = os.path.join(datafolder, "open-data-master.zip")
statsbombroot = os.path.join(datafolder, "statsbomb-root")
statsbombdata = os.path.join(datafolder, "statsbomb-root", "open-data-master", "data")
#Extract zip file
with zipfile.ZipFile(statsbombzip, 'r') as zipObj:
zipObj.extractall(statsbombroot)
# StatsBomb(json)Data of SPADL(HDF5)Convert to
## StatsBomb(Raw Data) : json -> StatsBomb(Raw Data) : h5
statsbomb_json = os.path.join(datafolder,"statsbomb-root","open-data-master","data")
statsbomb_h5 = os.path.join(datafolder,"statsbomb.h5")
spadl_h5 = os.path.join(datafolder,"spadl-statsbomb.h5")
spadl.statsbombjson_to_statsbombh5(statsbomb_json,statsbomb_h5)
tablenames = ["matches","players","teams","competitions"]
tables = {name : pd.read_hdf(statsbomb_h5,key=name) for name in tablenames}
match_id = tables["matches"].match_id[0]
tables["events"] = pd.read_hdf(statsbomb_h5,f"events/match_{match_id}")
for k,df in tables.items():
print("#",k)
print(df.columns,"\n")
## StatsBomb(Raw Data) : h5 -> SPADL : h5
spadl.statsbombh5_to_spadlh5(statsbomb_h5,spadl_h5)
tablenames = ["games","players","teams","competitions","actiontypes","bodyparts","results"]
tables = {name : pd.read_hdf(spadl_h5,key=name) for name in tablenames}
game_id = tables["games"].game_id[0]
tables["actions"] = pd.read_hdf(spadl_h5,f"actions/game_{game_id}")
for k,df in tables.items():
print("#",k)
print(df.columns,"\n")
#FIFA World Cup:Visualize the match between Japan and Belgium
## game_Extraction of id
tablenames = ["games","players","teams","competitions","actiontypes","bodyparts","results"]
tables = {name: pd.read_hdf(spadl_h5, key=name) for name in tablenames}
games = tables["games"].merge(tables["competitions"])
game_id = games[(games.competition_name == "FIFA World Cup")
& (games.away_team_name == "Japan")
& (games.home_team_name == "Belgium")].game_id.values[0]
game_id # 7584
##Action related to scoring_Extraction of id
actions = pd.read_hdf(spadl_h5, f"actions/game_{game_id}")
actions = (
actions.merge(tables["actiontypes"])
.merge(tables["results"])
.merge(tables["bodyparts"])
.merge(tables["players"],"left",on="player_id")
.merge(tables["teams"],"left",on="team_id")
.sort_values(["period_id", "time_seconds", "timestamp"])
.reset_index(drop=True))
actions["player"] = actions[["player_nickname",
"player_name"]].apply(lambda x: x[0] if x[0] else x[1],axis=1)
list(actions[(actions.type_name=='shot')&(actions.result_name=='success')].index)
# [1215, 1334, 1658, 1742, 2153]
##Belgium 3rd point
shot = 2153
a = actions[shot-8:shot+1]
games = tables["games"]
g = list(games[games.game_id == a.game_id.values[0]].itertuples())[0]
minute = int((a.period_id.values[0]-1)*45 +a.time_seconds.values[0] // 60) + 1
game_info = f"{g.match_date} {g.home_team_name} {g.home_score}-{g.away_score} {g.away_team_name} {minute}'"
print(game_info)
labels = a[["time_seconds", "type_name", "player", "team_name"]]
matplotsoccer.actions(
location=a[["start_x", "start_y", "end_x", "end_y"]],
action_type=a.type_name,
team= a.team_name,
result= a.result_name == "success",
label=labels,
labeltitle=["time","actiontype","player","team"],
zoom=False,
figsize=6)
# ----
#Referenced public-notebook MIT license
# (c) 2019 KU Leuven Machine Learning Research Group
# Released under the MIT license.
# see https://github.com/ML-KULeuven/socceraction/blob/master/LICENSE
# ----
# package
%load_ext autoreload
%autoreload 2
import os; import sys; sys.path.insert(0,'hogehoge')#Folder name
import pandas as pd
import tqdm
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
import socceraction.classification.features as fs
import socceraction.classification.labels as lab
import xgboost
from sklearn.metrics import roc_auc_score,brier_score_loss
import shap
shap.initjs()
#File and folder name definitions
datafolder = "hogehoge" #Specify the folder name
spadl_h5 = os.path.join(datafolder,"spadl-statsbomb.h5")
features_h5 = os.path.join(datafolder,"features.h5")
labels_h5 = os.path.join(datafolder,"labels.h5")
predictions_h5 = os.path.join(datafolder,"predictions.h5")
#Data reading
games = pd.read_hdf(spadl_h5,"games")
games = games[games.competition_name == "FIFA World Cup"]
print("nb of games:", len(games))
actiontypes = pd.read_hdf(spadl_h5, "actiontypes")
bodyparts = pd.read_hdf(spadl_h5, "bodyparts")
results = pd.read_hdf(spadl_h5, "results")
#Creating a label
yfns = [lab.scores,lab.concedes,lab.goal_from_shot]
for game in tqdm.tqdm(list(games.itertuples()),
desc=f"Computing and storing labels in {labels_h5}"):
actions = pd.read_hdf(spadl_h5,f"actions/game_{game.game_id}")
actions = (
actions.merge(actiontypes,how="left")
.merge(results,how="left")
.merge(bodyparts,how="left")
.sort_values(["period_id", "time_seconds", "timestamp",'action_id'])
.reset_index(drop=True))
Y = pd.concat([fn(actions) for fn in yfns],axis=1)
Y.to_hdf(labels_h5,f"game_{game.game_id}")
#Creation of features
xfns = [fs.actiontype,
fs.actiontype_onehot,
fs.bodypart,
fs.bodypart_onehot,
fs.result,
fs.result_onehot,
fs.goalscore,
fs.startlocation,
fs.endlocation,
fs.movement,
fs.space_delta,
fs.startpolar,
fs.endpolar,
fs.team,
fs.time,
fs.time_delta]
for game in tqdm.tqdm(list(games.itertuples()),
desc=f"Generating and storing features in {features_h5}"):
actions = pd.read_hdf(spadl_h5,f"actions/game_{game.game_id}")
actions = (
actions.merge(actiontypes,how="left")
.merge(results,how="left")
.merge(bodyparts,how="left")
.sort_values(["period_id", "time_seconds", "timestamp",'action_id'])
.reset_index(drop=True))
gamestates = fs.gamestates(actions,2)
gamestates = fs.play_left_to_right(gamestates,game.home_team_id)
X = pd.concat([fn(gamestates) for fn in xfns],axis=1)
X.to_hdf(features_h5,f"game_{game.game_id}")
xfns = [fs.actiontype_onehot,
fs.bodypart_onehot,
fs.result,
fs.goalscore,
fs.startlocation,
fs.endlocation,
fs.movement,
fs.space_delta,
fs.startpolar,
fs.endpolar,
fs.team,
fs.time_delta]
nb_prev_actions = 2
Xcols = fs.feature_column_names(xfns,nb_prev_actions)
X = []
for game_id in tqdm.tqdm(games.game_id,desc="selecting features"):
Xi = pd.read_hdf(features_h5,f"game_{game_id}")
X.append(Xi[Xcols])
X = pd.concat(X)
Ycols = ["scores","concedes"]
Y = []
for game_id in tqdm.tqdm(games.game_id,desc="selecting label"):
Yi = pd.read_hdf(labels_h5,f"game_{game_id}")
Y.append(Yi[Ycols])
Y = pd.concat(Y)
print("X:", list(X.columns))
print("Y:", list(Y.columns))
#Prediction model construction by xgboost
%%time
# scores
model_scores = xgboost.XGBClassifier()
model_scores.fit(X,Y['scores'])
# concedes
model_concedes = xgboost.XGBClassifier()
model_concedes.fit(X,Y['concedes'])
Y_hat = pd.DataFrame()
Y_hat['scores'] = model_scores.predict_proba(X)[:,1]
Y_hat['concedes'] = model_concedes.predict_proba(X)[:,1]
#Prediction accuracy
print(f"scores_brier : \t\t{brier_score_loss(Y['scores'],Y_hat['scores']).round(4)}")
print(f"concedes_brier : \t{brier_score_loss(Y['concedes'],Y_hat['concedes']).round(4)}")
print(f"scores_auc : \t\t{roc_auc_score(Y['scores'],Y_hat['scores']).round(4)}")
print(f"concedes_auc : \t{roc_auc_score(Y['concedes'],Y_hat['concedes']).round(4)}")
#Identification of predictive factors using SHAP(scores)
explainer_scores = shap.TreeExplainer(model_scores)
shap_scores = explainer_scores.shap_values(X)
## summary_plot
shap.summary_plot(shap_scores,features=X,feature_names=X.columns)
## dependence_plot
shap.dependence_plot('end_dist_to_goal_a0',
shap_scores,
features=X,
feature_names=X.columns,
interaction_index='end_dist_to_goal_a0')
#Saving prediction results
A = []
for game_id in tqdm.tqdm(games.game_id,"loading game ids"):
Ai = pd.read_hdf(spadl_h5,f"actions/game_{game_id}")
A.append(Ai[["game_id"]])
A = pd.concat(A)
A = A.reset_index(drop=True)
grouped_predictions = pd.concat([A,Y_hat],axis=1).groupby("game_id")
for k,df in tqdm.tqdm(grouped_predictions,desc="saving predictions per game"):
df = df.reset_index(drop=True)
df[Y_hat.columns].to_hdf(predictions_h5,f"game_{int(k)}")
# ----
#Referenced public-notebook MIT license
# (c) 2019 KU Leuven Machine Learning Research Group
# Released under the MIT license.
# see https://github.com/ML-KULeuven/socceraction/blob/master/LICENSE
# ----
# package
%load_ext autoreload
%autoreload 2
import os; import sys; sys.path.insert(0,'hogehoge') #Folder name
import pandas as pd
import tqdm
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
import socceraction.vaep as vaep
import matplotsoccer
import matplotlib
#File and folder name definitions
datafolder = "hogehoge" #Folder name
spadl_h5 = os.path.join(datafolder,"spadl-statsbomb.h5")
predictions_h5 = os.path.join(datafolder,"predictions.h5")
#Data acquisition
games = pd.read_hdf(spadl_h5,"games")
games = games[games.competition_name == "FIFA World Cup"]
print("nb of games:", len(games))
players = pd.read_hdf(spadl_h5,"players")
teams = pd.read_hdf(spadl_h5,"teams")
actiontypes = pd.read_hdf(spadl_h5, "actiontypes")
bodyparts = pd.read_hdf(spadl_h5, "bodyparts")
results = pd.read_hdf(spadl_h5, "results")
#Calculation of VAEP
A = []
for game in tqdm.tqdm(list(games.itertuples())):
actions = pd.read_hdf(spadl_h5,f"actions/game_{game.game_id}")
actions = (
actions.merge(actiontypes)
.merge(results)
.merge(bodyparts)
.merge(players,"left",on="player_id")
.merge(teams,"left",on="team_id")
.sort_values(["period_id", "time_seconds", "timestamp"])
.reset_index(drop=True)
)
preds = pd.read_hdf(predictions_h5,f"game_{game.game_id}")
values = vaep.value(actions,preds.scores,preds.concedes)
A.append(pd.concat([actions,preds,values],axis=1))
A = pd.concat(A).sort_values(["game_id","period_id", "time_seconds", "timestamp"]).reset_index(drop=True)
A.columns
A["player"] = A[["player_nickname",
"player_name"]].apply(lambda x: x[0] if x[0] else x[1],axis=1)
#Calculate the total VAEP of each player and check in descending order
summary = A.groupby(['player',
'team_name',
'player'])[['offensive_value',
'defensive_value',
'vaep_value']].sum().reset_index()
summary.sort_values('vaep_value',ascending = False).head(10)
#Calculate the average VAEP per 90 minutes and check in descending order
players = A_[["player_id",
"team_name",
"player",
"vaep_value",
"count"]].groupby(["player_id",
"team_name",
"player"]).sum().reset_index()
players = players.sort_values("vaep_value",ascending=False)
pg = pd.read_hdf(spadl_h5,"player_games")
pg = pg[pg.game_id.isin(games.game_id)]
mp = pg[["player_id","minutes_played"]].groupby("player_id").sum().reset_index()
stats = players.merge(mp)
stats = stats[stats.minutes_played > 180]
stats["vaep_rating"] = stats.vaep_value * 90 / stats.minutes_played
stats.sort_values("vaep_rating",ascending=False).head(10)
#Visualization by matplotsoccer
##Extraction of scenes with goals
shot_goal_index = A[(A.game_id == 7584)&A.type_name.str.contains("shot")&(A.result_name=='success')]
##Belgium's third visualization
def get_time(period_id,time_seconds):
m = int((period_id-1)*45 + time_seconds // 60)
s = time_seconds % 60
if s == int(s):
s = int(s)
return f"{m}m{s}s"
###Extraction of scenes
a = A.iloc[shot_goal_index.index[4]-6:shot_goal_index.index[4]+1,:].sort_values('action_id')
a["player"] = a[["player_nickname",
"player_name"]].apply(lambda x: x[0] if x[0] else x[1],axis=1)
###Match information
g = list(games[games.game_id == a.game_id.values[0]].itertuples())[0]
game_info = f"{g.match_date} {g.home_team_name} {g.home_score}-{g.away_score} {g.away_team_name}"
minute = get_time(int(a[a.index == a.index[-1]].period_id),int(a[a.index == a.index[-1]].time_seconds))
print(f"{game_info} {minute}' {a[a.index == a.index[-1]].type_name.values[0]} {a[a.index == a.index[-1]].player_name.values[0]}")
###Data shaping
a["scores"] = a.scores.apply(lambda x : "%.3f" % x )
a["vaep_value"] = a.vaep_value.apply(lambda x : "%.3f" % x )
a["time"] = a[["period_id","time_seconds"]].apply(lambda x: get_time(*x),axis=1)
cols = ["time","type_name","player","team_name","scores","vaep_value"]
###plot
matplotsoccer.actions(a[["start_x","start_y","end_x","end_y"]],
a.type_name,
team=a.team_name,
result = a.result_name == "success",
label=a[cols],
labeltitle = cols,
zoom=False)
Recommended Posts