xG v2 β Context-Aware Expected Goals with Freeze-Frame Set Encoding
Context-aware expected goals (xG) model that conditions on the visible player positions at the moment of each shot. Trained on ~131K shots from StatsBomb Open Data and Wyscout. Includes MC dropout uncertainty quantification β every prediction comes with a 95% confidence interval.
Part of the (Right! Luxury!) Lakehouse soccer analytics platform.
Model Description
Standard xG models treat each shot in isolation: distance, angle, body part, and a handful of tabular features. xG v2 adds spatial context by encoding the positions of all visible players from StatsBomb 360 freeze frames into a fixed-length context vector using a Deep Sets architecture (Zaheer et al. 2017).
The model answers the question: given where the shooter is, where the defenders are, and where the goalkeeper is, what is the probability this shot results in a goal?
Key properties:
- Permutation-invariant: Handles any number of visible players in any order. There is no fixed roster slot or player identity assumption.
- Graceful degradation: When no freeze-frame data is available, the context vector is zeroed out and the model degrades to tabular-only prediction β identical in structure to the v1 XGBoost baseline.
- Uncertainty-aware: MC dropout produces a mean xG estimate plus a 95% confidence interval, quantifying model confidence per shot rather than collapsing to a single scalar.
- Serverless-compatible: Pure NumPy inference. No PyTorch, no ONNX, no GPU. The JSON-serialized weight file is under 100 KB and loads on Databricks serverless executors.
Architecture
The model combines a set encoder that processes freeze-frame player positions with a prediction MLP that fuses tabular shot features:
Set Encoder (per-player, shared weights):
- Input: N players x 4 features (
x_norm,y_norm,is_keeper,is_teammate) - Per-player MLP: Linear(4 β 32) β ReLU β Linear(32 β 16) β ReLU
- Sum aggregation (permutation invariant) β context vector (16-dim)
Prediction MLP:
- Concatenate: context vector (16-dim) + tabular features (13+ dim)
- Linear(β 64) β ReLU β Dropout
- Linear(β 32) β ReLU β Dropout
- Linear(β 1) β Sigmoid β xG score in [0, 1]
Set Encoder Hyperparameters
| Parameter | Value |
|---|---|
| Player feature dim | 4 (x_norm, y_norm, is_keeper, is_teammate) |
| Encoder hidden dim | 32 |
| Context dim (output) | 16 |
| Aggregation | Sum (permutation invariant) |
Prediction MLP Hyperparameters
| Parameter | Value |
|---|---|
| Hidden layer 1 | 64 units, ReLU |
| Hidden layer 2 | 32 units, ReLU |
| Output | 1 unit, Sigmoid |
| Dropout rate | 0.1 |
| MC dropout samples | 50 |
Uncertainty Quantification
xG v2 uses MC Dropout (Gal & Ghahramani 2016) as a practical Bayesian approximation. Dropout is active at inference time, and 50 stochastic forward passes are run per shot:
for i in range(50):
mask = Bernoulli(1 - 0.1, size=64+32) # random dropout mask
predictions[i] = forward_pass(shot, mask)
mean = predictions.mean()
std = predictions.std()
ci_95 = (clip(mean - 1.96*std, 0, 1),
clip(mean + 1.96*std, 0, 1))
Each prediction returns a 4-tuple: (mean, std, ci_lower, ci_upper).
Interpretation: A narrow CI (e.g., xG = 0.72 Β± 0.03) indicates the model is confident. A wide CI (e.g., xG = 0.35 Β± 0.18) signals high uncertainty β typical for partially occluded freeze frames or unusual shot geometries.
Training Data
| Source | Shots | License |
|---|---|---|
| StatsBomb Open Data | ~75K | CC-BY 4.0 |
| Wyscout Public Dataset | ~56K | CC-BY-NC 4.0 |
| Total | ~131K | CC-BY-NC 4.0 (most restrictive applies) |
Freeze-frame coverage comes from StatsBomb 360 data: approximately 15.58M freeze-frame rows across 323 matches, embedded inline within the events JSON (shot_freeze_frame field). Wyscout shots contribute tabular features only β no freeze frames.
Coverage includes the Premier League, La Liga, Serie A, Bundesliga, Ligue 1, Champions League, World Cup, and more.
Training is performed on Hugging Face Jobs using PyTorch. Inference uses the pure NumPy forward pass exported from the trained weights.
Features
Tabular Features (13)
These features are the same as the v1 XGBoost baseline:
| Feature | Type | Description |
|---|---|---|
distance_to_goal |
Numeric | Euclidean distance from shot location to goal center (yards) |
shot_angle |
Numeric | Angle subtended by the goal from the shot location (radians) |
location_x |
Numeric | Shot x-coordinate (StatsBomb: 0β120) |
location_y |
Numeric | Shot y-coordinate (StatsBomb: 0β80) |
end_location_x |
Numeric | Intended x-coordinate of shot trajectory |
end_location_y |
Numeric | Intended y-coordinate of shot trajectory |
period |
Numeric | Match period (1β5) |
minute |
Numeric | Minute of the match |
is_first_time |
Boolean | Shot taken first-time (no control touch) |
shot_body_part |
Categorical | Head, Right Foot, Left Foot, No Touch |
shot_technique |
Categorical | Normal, Volley, Half Volley, Backheel, Overhead Kick, Diving Header, Lob |
shot_type |
Categorical | Open Play, Free Kick, Corner, Kick Off, Penalty |
play_pattern |
Categorical | From Counter, From Keeper, From Free Kick, From Corner, etc. |
Set Encoder Input (variable-length, per visible player)
| Feature | Type | Description |
|---|---|---|
x_norm |
Float [0, 1] | Player x-position normalized from StatsBomb 120m pitch |
y_norm |
Float [0, 1] | Player y-position normalized from StatsBomb 80m pitch |
is_keeper |
Binary | 1 if this player is the goalkeeper, 0 otherwise |
is_teammate |
Binary | 1 if this player is on the shooter's team, 0 for opponent |
Player identity is never used. The set encoder sees only spatial position and role.
Performance
| Model | ROC-AUC | Brier Score | Log Loss |
|---|---|---|---|
| v1 XGBoost + Isotonic Calibration (13 features) | 0.825 | 0.057 | 1.212 |
| v2 Set Encoder (raw, pre-calibration) | 0.901 | 0.061 | β |
| v2 Set Encoder + Isotonic Calibration + MC Dropout | 0.915 | 0.060 | 0.200 |
ROC-AUC improved by +0.090 over the v1 XGBoost baseline (0.825 β 0.915) β a large gain in discrimination for xG models, where +0.02 is typically meaningful. Isotonic calibration closed the Brier score gap to 0.003 while reducing log loss sixfold (1.212 β 0.200). MC dropout 95% CI coverage: 95.1% (properly calibrated).
Training: 153 seconds on HF Jobs A10G-small. MC dropout z-multiplier: 4.2, inference dropout rate: 0.30 (3Γ training dropout of 0.10).
Evaluation protocol: 80/20 train/test split by competition. Metrics computed on held-out test set.
Coordinate System
All spatial features use the StatsBomb coordinate system:
- Pitch dimensions: 120 yards (length) Γ 80 yards (width)
- Origin: bottom-left corner of the pitch
- Attacking direction: left to right (x increases toward opponent goal)
- Goal center: approximately (120, 40)
Set encoder inputs normalize these to [0, 1]:
x_norm = location_x / 120.0
y_norm = location_y / 80.0
This normalization ensures that the per-player MLP receives consistent scale inputs regardless of pitch dimension conventions.
Inference
The model is serialized as a JSON file with base64-encoded NumPy arrays β no pickle, no PyTorch dependency at inference time.
from huggingface_hub import hf_hub_download
import json
# Download weights
weights_path = hf_hub_download(
repo_id="luxury-lakehouse/xg-v2-model-set-encoder",
filename="xg_v2_weights.json",
)
with open(weights_path, "rb") as f:
weights_bytes = f.read()
# Load weights (NumPy only)
from src.analytics.set_encoder import deserialize_set_encoder_weights
weights = deserialize_set_encoder_weights(weights_bytes)
# Encode freeze-frame player positions
import numpy as np
from src.analytics.set_encoder import encode_player_set, predict_xg_with_uncertainty
player_features = np.array([
[0.85, 0.50, 1, 0], # goalkeeper: x=102, y=40
[0.80, 0.45, 0, 0], # defender 1
[0.78, 0.55, 0, 0], # defender 2
[0.82, 0.48, 0, 1], # teammate
], dtype=np.float64)
context = encode_player_set(player_features, weights)
# Tabular features (pre-processed with build_features)
tabular = np.array([...]) # 13+ features after one-hot encoding
# Predict with uncertainty
mean_xg, std, ci_lower, ci_upper = predict_xg_with_uncertainty(
tabular, context, weights
)
print(f"xG = {mean_xg:.3f} (95% CI: {ci_lower:.3f}-{ci_upper:.3f})")
For shots without freeze-frame data, pass a zero context vector:
from src.analytics.set_encoder import SetEncoderConfig
config = SetEncoderConfig()
context = np.zeros(config.context_dim) # graceful degradation to tabular-only
Serialization Format
Weights are stored as a JSON envelope with base64-encoded arrays:
{
"model_type": "set_encoder_xg_v2",
"weights": {
"encoder_fc1_weight": {"data": "...", "shape": [32, 4], "dtype": "float64"},
"encoder_fc1_bias": {"data": "...", "shape": [32], "dtype": "float64"},
"encoder_fc2_weight": {"data": "...", "shape": [16, 32], "dtype": "float64"},
"encoder_fc2_bias": {"data": "...", "shape": [16], "dtype": "float64"},
"pred_fc1_weight": {"data": "...", "shape": [64, ...], "dtype": "float64"},
...
}
}
No pickle is used anywhere in the serialization or deserialization path (banned by project security policy).
EU AI Act β Intended Use and Non-Use
This model is published for research and reproducibility purposes on public, open-licensed match data. It is not intended for, not validated for, and not supplied to any use that would fall within Annex III Β§4 (Employment, workers management and access to self-employment) of Regulation (EU) 2024/1689 β including recruitment or selection of natural persons, decisions affecting work-related contractual relationships, promotion, termination, task allocation based on individual traits, or the monitoring and evaluation of performance and behaviour of workers for employment decisions.
Any deployer who wishes to use this model for such a purpose is responsible for performing their own conformity assessment under Article 43, for drawing up the technical documentation required by Article 11 and Annex IV, for implementing the human oversight measures required by Article 14, for declaring accuracy metrics under Article 15, and for ensuring the data governance obligations of Article 10 are met. Note specifically that the training data contains no protected attributes and therefore cannot support the group-fairness audits required by Article 10(2)(g) without ingesting additional personal data.
See the AI_GOVERNANCE.md gap analysis in the source repository for the project's full risk classification, re-classification triggers, and governance posture.
Limitations
- Anonymous freeze frames: The set encoder receives only position and role (keeper/teammate flag). Player identity, stamina, height, dominant foot, and tactical assignment are not encoded. Two players in identical positions produce identical context contributions.
- Missing freeze-frame coverage: Only StatsBomb 360 matches include freeze frames (~323 of ~3,000 StatsBomb matches). All Wyscout shots and non-360 StatsBomb shots fall back to the zero context vector.
- Partial occlusion: StatsBomb 360 freeze frames capture only visible players. Players behind the camera or in crowded areas may be absent. The set encoder handles this gracefully (sum over fewer players), but predictions may underestimate defensive pressure when multiple defenders are occluded.
- Open data only: Trained on publicly available StatsBomb and Wyscout data. Models trained on full broadcast-quality tracking data with complete visibility would likely produce narrower uncertainty intervals and higher discrimination.
- Static snapshot: The freeze frame captures player positions at the instant of the shot only. Prior positioning (run-up angle, off-ball movement, pressing intensity) is not encoded.
- No player clustering or identity: The set encoder cannot distinguish a massed low block from an isolated goalkeeper. Tactical shape is implicit in the aggregate position distribution, not explicit.
Model Files
xg_v2_weights.json -- set encoder weights (JSON + base64, ~100 KB)
Citation
If you use this model, please cite the Deep Sets architecture and the MC Dropout method:
@inproceedings{zaheer2017deep,
title={Deep Sets},
author={Zaheer, Manzil and Kottur, Satwik and Ravanbakhsh, Siamak
and P{\'o}czos, Barnab{\'a}s and Salakhutdinov, Ruslan
and Smola, Alexander J.},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
volume={30},
year={2017}
}
@inproceedings{gal2016dropout,
title={Dropout as a Bayesian Approximation: Representing Model Uncertainty
in Deep Learning},
author={Gal, Yarin and Ghahramani, Zoubin},
booktitle={International Conference on Machine Learning (ICML)},
pages={1050--1059},
year={2016}
}
@software{nielsen2026xgv2,
title={xG v2: Context-Aware Expected Goals with Freeze-Frame Set Encoding},
author={Nielsen, Karsten Skytt},
year={2026},
url={https://github.com/karsten-s-nielsen/luxury-lakehouse}
}
Companion Resources
| Dataset | Description |
|---|---|
| xG Shot Data | Tabular shot features used for training and evaluation |
| xG Freeze Frame Data | StatsBomb 360 freeze-frame player positions (15.58M rows, 323 matches) |
| SPADL/VAEP Action Values | Per-action offensive/defensive VAEP valuations |
| Player Embeddings | Pre-computed behavioral + statistical vectors (career/season/match) |
Demo
Try the interactive Soccer Analytics Explorer β visualize shot maps with v2 xG values and uncertainty bands, and compare v1 vs v2 predictions side-by-side.
Explore interactively: HF Space demo
More Information
- License: CC-BY-NC 4.0 (inherited from Wyscout training data)
- v1 baseline model: xG v1 (XGBoost)
- Platform: Luxury Lakehouse Soccer Analytics