xG Model β€” StatsBomb + Wyscout

Two expected goals (xG) models trained on ~131K professional soccer shots from StatsBomb Open Data and Wyscout:

  1. Logistic regression baseline β€” distance + angle only (interpretable reference point)
  2. Calibrated XGBoost β€” all 13 features with isotonic calibration (production model)

Part of the (Right! Luxury!) Lakehouse soccer analytics platform.

Model Description

Logistic Baseline

A logistic regression fitted on two geometric features (distance to goal center, shot angle) with isotonic calibration. Serves as an interpretable lower bound β€” any production model must beat this baseline.

Calibrated XGBoost

A gradient-boosted tree classifier (XGBClassifier) fitted on all 13 features, wrapped in scikit-learn's CalibratedClassifierCV with isotonic regression. The calibration step ensures that predicted probabilities are well-calibrated (a 0.15 xG prediction means ~15% of such shots are goals).

Serialization

Both models are serialized as JSON envelopes β€” no pickle is used (banned by project security policy):

  • XGBoost: Booster saved via save_raw("json"), base64-encoded in a JSON envelope alongside isotonic calibrator thresholds
  • Logistic: Coefficients, intercept, and classes stored as JSON arrays alongside isotonic calibrator thresholds

This makes model weights fully inspectable, version-controllable, and safe to load without arbitrary code execution.

Training Data

Source Shots License
StatsBomb Open Data ~95K CC-BY 4.0 (match data)
Wyscout Public Dataset ~36K CC-BY-NC 4.0 (research)

Coverage includes the Premier League, La Liga, Serie A, Bundesliga, Ligue 1, Champions League, World Cup, and more. Both sources are unified to the StatsBomb 120Γ—80 yard coordinate system at the dbt staging layer.

Features

All 13 features used by the XGBoost model:

Feature Type Description
distance_to_goal Numeric Euclidean distance from shot location to goal center (yards)
shot_angle Numeric Angle subtended by the goal from the shot location (radians)
location_x Numeric Shot x-coordinate (0–120 yards, attacking direction)
location_y Numeric Shot y-coordinate (0–80 yards)
end_location_x Numeric Shot end x-coordinate
end_location_y Numeric Shot end y-coordinate
period Numeric Match period (1 = first half, 2 = second half, etc.)
minute Numeric Match minute
is_first_time Boolean Whether the shot was taken first-time (no prior touch)
shot_body_part Categorical Body part used (Right Foot, Left Foot, Head, Other)
shot_technique Categorical Technique (Normal, Volley, Half Volley, Lob, Overhead Kick, etc.)
shot_type Categorical Shot type (Open Play, Free Kick, Corner, Penalty, etc.)
play_pattern Categorical Build-up pattern (Regular Play, From Counter, From Corner, etc.)

Categorical features are one-hot encoded. The logistic baseline uses only distance_to_goal and shot_angle.

Coordinate System

All coordinates are in the StatsBomb system: 120 Γ— 80 yards, with (0, 0) at the bottom-left corner of the pitch and the attacking goal at x = 120. Wyscout coordinates (0–100% scale) are converted at the dbt staging layer.

Hyperparameters

Parameter Value
XGBoost n_estimators 100
XGBoost max_depth 3
XGBoost learning_rate 0.1
XGBoost eval_metric logloss
Calibration method Isotonic regression
Test split 20% (stratified by competition_id)
Random state 42

Evaluation Metrics

Both models are evaluated on a held-out test set using:

Metric Description
Brier score Mean squared error of probability estimates (lower is better)
Log loss Logarithmic loss (lower is better)
ROC-AUC Area under the ROC curve (higher is better)
Calibration error (ECE) Expected calibration error across 10 uniform bins (lower is better)

Results (held-out test set, ~26K shots)

Model ROC-AUC Brier Score
Custom XGBoost (calibrated) 0.979 0.059
Custom Logistic (baseline) 0.761 0.082

StatsBomb xG Benchmark

The custom XGBoost model is benchmarked against StatsBomb's proprietary xG on the StatsBomb subset of the test set. Acceptance criterion: custom xG Brier score must be within 10% of StatsBomb xG Brier score.

How to Use

Quick Start

pip install huggingface_hub xgboost scikit-learn
import json
import base64

from huggingface_hub import snapshot_download
from xgboost import XGBClassifier
import numpy as np

# Download model
model_dir = snapshot_download("luxury-lakehouse/xg-model-statsbomb-wyscout")

# Load XGBoost model from JSON envelope
with open(f"{model_dir}/xgboost_model.json") as f:
    envelope = json.load(f)

booster_raw = base64.b64decode(envelope["booster_b64"])
xgb = XGBClassifier()
xgb.load_model(bytearray(booster_raw))

# Predict xG for a shot (requires one-hot encoded feature vector)
# See training notebook for full feature engineering pipeline

Note: The raw XGBoost booster above produces uncalibrated probabilities. For production use, load the model with deserialize_xgboost_model (shown below) which wraps the booster in scikit-learn's CalibratedClassifierCV with isotonic regression. Without this calibration step, predicted xG values may be systematically over- or under-confident.

Full Pipeline (with calibration)

For production use with isotonic calibration, use the deserialize_xgboost_model and deserialize_logistic_model functions from the analytics.xg_model module:

from analytics.xg_model import deserialize_xgboost_model, deserialize_logistic_model

with open(f"{model_dir}/xgboost_model.json", "rb") as f:
    xgb_model = deserialize_xgboost_model(f.read())

with open(f"{model_dir}/logistic_model.json", "rb") as f:
    logistic_model = deserialize_logistic_model(f.read())

# predict_proba returns calibrated probabilities
xg_proba = xgb_model.predict_proba(X)[:, 1]

Intended Use

  • Shot valuation: Assign expected goal probabilities to shots for match analysis
  • Player evaluation: Aggregate xG for player performance assessment (goals vs. xG)
  • Tactical analysis: Identify high-quality shooting opportunities by location and context
  • Research: Reproducible xG baseline for sports analytics on open data

EU AI Act β€” Intended Use and Non-Use

This model is published for research and reproducibility purposes on public, open-licensed match data. It is not intended for, not validated for, and not supplied to any use that would fall within Annex III Β§4 (Employment, workers management and access to self-employment) of Regulation (EU) 2024/1689 β€” including recruitment or selection of natural persons, decisions affecting work-related contractual relationships, promotion, termination, task allocation based on individual traits, or the monitoring and evaluation of performance and behaviour of workers for employment decisions.

Any deployer who wishes to use this model for such a purpose is responsible for performing their own conformity assessment under Article 43, for drawing up the technical documentation required by Article 11 and Annex IV, for implementing the human oversight measures required by Article 14, for declaring accuracy metrics under Article 15, and for ensuring the data governance obligations of Article 10 are met. Note specifically that the training data contains no protected attributes and therefore cannot support the group-fairness audits required by Article 10(2)(g) without ingesting additional personal data.

See the AI_GOVERNANCE.md gap analysis in the source repository for the project's full risk classification, re-classification triggers, and governance posture.

Limitations

  • Open data only: Trained on publicly available StatsBomb and Wyscout data. Commercial datasets with richer features (freeze-frame defenders, goalkeeper position) would yield better models.
  • No defensive context: The model does not include freeze-frame features (number of defenders, goalkeeper position, blocking angle). These are available in StatsBomb 360 data but not universally across all matches.
  • Cross-source alignment: StatsBomb and Wyscout use different event taxonomies and coordinate systems. The dbt staging layer normalizes them, but subtle differences in shot classification may remain.
  • Calibration on open data: Isotonic calibration is fitted on the same data distribution. Applying to a substantially different league or era may require recalibration.

Citation

If you use this model, please cite the XGBoost method and this repository:

@inproceedings{chen2016xgboost,
  title={XGBoost: A Scalable Tree Boosting System},
  author={Chen, Tianqi and Guestrin, Carlos},
  booktitle={Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
  year={2016}
}
@software{nielsen2026xgmodel,
  title={Custom xG Model: Logistic Baseline + Calibrated XGBoost on StatsBomb and Wyscout Open Data},
  author={Nielsen, Karsten Skytt},
  year={2026},
  url={https://github.com/karsten-s-nielsen/luxury-lakehouse}
}

Model Files

xgboost_model.json    -- calibrated XGBoost (JSON envelope, no pickle)
logistic_model.json   -- calibrated logistic baseline (JSON envelope, no pickle)
metrics.json          -- evaluation metrics and training configuration

Training

  • Notebook: notebooks/train_xg_model.py (Databricks notebook)
  • UC Volume: /Volumes/soccer_analytics/dev_gold/model_weights/xg_model/
  • Source module: src/analytics/xg_model.py

Companion Resources

Pre-computed datasets derived from the platform's analytics pipelines:

Dataset Description
SPADL/VAEP Action Values Per-action offensive/defensive VAEP valuations
Line-Breaking Passes Pass dataset with defensive line-breaking labels
Player Embeddings Pre-computed behavioral + statistical vectors (career/season/match)
Pitch Control Tracking Per-player per-frame pitch control values from tracking data

Demo

Try the interactive Soccer Analytics Explorer β€” explore shot maps, pitch control, player similarity, and more.

Explore interactively: HF Space demo

More Information

  • License: CC-BY-NC 4.0 (inherited from Wyscout training data)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train luxury-lakehouse/xg-model-statsbomb-wyscout

Space using luxury-lakehouse/xg-model-statsbomb-wyscout 1