Tabular Classification
English
xgboost
sports-analytics
soccer
football
vaep
action-valuation
statsbomb
wyscout
silly-kicks

VAEP Model β€” StatsBomb + Wyscout

Two XGBClassifier models that estimate P(scoring) and P(conceding) within the next 10 actions, enabling per-action valuation of every on-ball event in a soccer match. Trained on ~2,388 matches from StatsBomb Open Data and Wyscout via Hugging Face Jobs (CPU).

Part of the (Right! Luxury!) Lakehouse soccer analytics platform.

Model Description

VAEP (Valuing Actions by Estimating Probabilities) scores each on-ball action by its impact on the probability of scoring and conceding within the next 10 actions, as described in:

Decroos, T., Bransen, L., Van Haaren, J., & Davis, J. (2019). Actions Speak Louder than Goals: Valuing Player Actions in Soccer. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

This model implements the VAEP framework using two independent XGBClassifier models:

  • P(scores): Probability that the team in possession scores within the next 10 actions
  • P(concedes): Probability that the team in possession concedes within the next 10 actions

The net VAEP value of an action is the change in scoring probability minus the change in conceding probability before and after that action:

VAEP(action_i) = [P_scores(S_i) - P_scores(S_{i-1})] + [P_concedes(S_{i-1}) - P_concedes(S_i)]

where S_i is the game state after action i.

Architecture

Both models are XGBClassifier instances with identical hyperparameters:

Parameter Value
n_estimators 100
max_depth 3
learning_rate 0.1
objective binary:logistic
eval_metric logloss
random_state 42

Feature Extraction

Features are extracted using the silly-kicks library with 11 feature functions applied to game states composed of the current action and the previous NB_PREV_ACTIONS = 3 actions:

Feature Function Description
actiontype_onehot One-hot encoding of the 23 SPADL action types
result_onehot One-hot encoding of action outcomes (success, fail, etc.)
bodypart_onehot One-hot encoding of body part (foot, head, other)
time Period and time within the period
startlocation Start x, y coordinates (SPADL 105Γ—68m)
endlocation End x, y coordinates
startpolar Start location in polar coordinates (distance + angle to goal)
endpolar End location in polar coordinates
movement Displacement between start and end locations
team Whether the team changed between consecutive actions
time_delta Time elapsed between consecutive actions

With NB_PREV_ACTIONS = 3, each feature function generates columns for the current action plus the 3 preceding actions, creating a rich game state representation.

Serialization

Both models are serialized in a single JSON envelope β€” no pickle is used (banned by project security policy):

  • Each model's XGBoost booster is saved via save_raw("json") and base64-encoded
  • The envelope includes the number of input features and nb_prev_actions for reproducibility
{
  "model_type": "vaep_xgboost_v1",
  "scores_booster_b64": "...",
  "concedes_booster_b64": "...",
  "n_features": 264,
  "nb_prev_actions": 3
}

This makes model weights fully inspectable, version-controllable, and safe to load without arbitrary code execution.

Training Data

Source Matches License
StatsBomb Open Data ~3,000 CC-BY 4.0
Wyscout Public Dataset ~1,900 CC-BY-NC 4.0
Total ~2,388 (deduplicated) CC-BY-NC 4.0 (most restrictive applies)

Coverage includes the Premier League, La Liga, Serie A, Bundesliga, Ligue 1, Champions League, World Cup, and more.

All event data is converted to the SPADL (Soccer Player Action Description Language) unified format with standardized coordinates (105Γ—68 meters) and 23 canonical action types, enabling cross-source training without vendor-specific adapters.

Training is performed on Hugging Face Jobs using the cpu-basic flavor.

Training / Test Split

  • 80/20 split by game, stratified by competition_id
  • Evaluation metrics computed on the held-out test set

How to Use

Quick Start

pip install huggingface_hub xgboost
import json
import base64

from huggingface_hub import snapshot_download
from xgboost import XGBClassifier

# Download model
model_dir = snapshot_download("luxury-lakehouse/vaep-model-statsbomb-wyscout")

# Load from JSON envelope
with open(f"{model_dir}/vaep_model.json") as f:
    envelope = json.load(f)

# Deserialize P(scores) model
model_scores = XGBClassifier()
scores_raw = base64.b64decode(envelope["scores_booster_b64"])
model_scores.load_model(bytearray(scores_raw))

# Deserialize P(concedes) model
model_concedes = XGBClassifier()
concedes_raw = base64.b64decode(envelope["concedes_booster_b64"])
model_concedes.load_model(bytearray(concedes_raw))

# Predict probabilities (requires silly-kicks feature extraction)
# p_scores = model_scores.predict_proba(X)[:, 1]
# p_concedes = model_concedes.predict_proba(X)[:, 1]

Full Pipeline (with silly-kicks)

import silly_kicks.spadl as spadl
import silly_kicks.vaep.features as fs
import silly_kicks.vaep.labels as labels

NB_PREV_ACTIONS = 3

FEATURE_FNS = [
    fs.actiontype_onehot, fs.result_onehot, fs.bodypart_onehot,
    fs.time, fs.startlocation, fs.endlocation,
    fs.startpolar, fs.endpolar, fs.movement, fs.team, fs.time_delta,
]

# actions: pandas DataFrame in SPADL format
gamestates = fs.gamestates(actions, nb_prev_actions=NB_PREV_ACTIONS)
X = pd.concat([fn(gamestates) for fn in FEATURE_FNS], axis=1)

p_scores = model_scores.predict_proba(X)[:, 1]
p_concedes = model_concedes.predict_proba(X)[:, 1]

# Compute VAEP values (change in probabilities)
vaep_offensive = p_scores[1:] - p_scores[:-1]   # delta P(scores)
vaep_defensive = p_concedes[:-1] - p_concedes[1:]  # delta P(concedes), note sign flip
vaep_value = vaep_offensive + vaep_defensive

Intended Use

  • Player valuation: Rank players by total VAEP contribution beyond goals and assists
  • Action analysis: Identify the most impactful passes, carries, and defensive actions in a match
  • Tactical analysis: Evaluate team playing styles by aggregating VAEP across action types
  • Scouting: Compare players across leagues using a unified valuation framework
  • Research: Reproducible VAEP implementation on open data

EU AI Act β€” Intended Use and Non-Use

This model is published for research and reproducibility purposes on public, open-licensed match data. It is not intended for, not validated for, and not supplied to any use that would fall within Annex III Β§4 (Employment, workers management and access to self-employment) of Regulation (EU) 2024/1689 β€” including recruitment or selection of natural persons, decisions affecting work-related contractual relationships, promotion, termination, task allocation based on individual traits, or the monitoring and evaluation of performance and behaviour of workers for employment decisions.

Any deployer who wishes to use this model for such a purpose is responsible for performing their own conformity assessment under Article 43, for drawing up the technical documentation required by Article 11 and Annex IV, for implementing the human oversight measures required by Article 14, for declaring accuracy metrics under Article 15, and for ensuring the data governance obligations of Article 10 are met. Note specifically that the training data contains no protected attributes and therefore cannot support the group-fairness audits required by Article 10(2)(g) without ingesting additional personal data.

See the AI_GOVERNANCE.md gap analysis in the source repository for the project's full risk classification, re-classification triggers, and governance posture.

Limitations

  • Open data only: Trained on publicly available StatsBomb and Wyscout data. Commercial datasets with richer event annotations may yield different VAEP scores.
  • No tracking data: VAEP is event-based. Off-ball positioning, pressing intensity, and space creation are not captured. See OBSO for tracking-based approaches.
  • Competition-agnostic: The models are trained across all competitions jointly. League-specific models may produce more calibrated probabilities for individual leagues.
  • Cross-source alignment: StatsBomb and Wyscout use different event taxonomies. The SPADL adapter normalizes them, but subtle differences in event definitions (e.g., duel classification) remain.
  • No calibration: Unlike the xG models, the VAEP classifiers do not include post-hoc isotonic calibration. Predicted probabilities may not be perfectly calibrated in absolute terms. For ranking players or actions, this is less consequential; for applications requiring absolute probability values, validate with a reliability diagram.
  • 10-action horizon: VAEP considers only the next 10 actions. Longer-range effects (e.g., a switch of play that leads to a goal 20 actions later) are not captured.

Model Files

vaep_model.json    -- scores + concedes XGBoost boosters (JSON envelope, no pickle)
metrics.json       -- evaluation metrics and training configuration

Citation

If you use this model, please cite the original VAEP paper:

@inproceedings{decroos2019actions,
  title={Actions Speak Louder than Goals: Valuing Player Actions in Soccer},
  author={Decroos, Tom and Bransen, Lotte and Van Haaren, Jan and Davis, Jesse},
  booktitle={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  pages={1851--1861},
  year={2019},
  publisher={ACM}
}

And the silly-kicks library:

@article{silly-kicks,
  title={silly-kicks: A Python library for valuing soccer actions},
  author={Decroos, Tom and Van Haaren, Jan and Davis, Jesse},
  year={2020},
  url={https://github.com/karsten-s-nielsen/silly-kicks}
}
@software{nielsen2026vaep,
  title={VAEP Model: Action Valuation on StatsBomb and Wyscout Open Data},
  author={Nielsen, Karsten Skytt},
  year={2026},
  url={https://github.com/karsten-s-nielsen/luxury-lakehouse}
}

Companion Resources

Dataset Description
SPADL/VAEP Action Values Pre-computed per-action VAEP valuations (~9.5M actions)
xG Shot Data Tabular shot features for xG training
Player Embeddings Pre-computed behavioral + statistical vectors

Demo

Try the interactive Soccer Analytics Explorer β€” explore player impact rankings powered by VAEP valuations, and compare players across leagues.

Explore interactively: HF Space demo

More Information

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train luxury-lakehouse/vaep-model-statsbomb-wyscout

Space using luxury-lakehouse/vaep-model-statsbomb-wyscout 1