Football2Vec v2 β€” Transformer Player Embeddings with Adversarial Team Debiasing

128-dimensional player embedding vectors from a 4-layer transformer encoder trained via masked language modeling (MLM) on ~87K SPADL action sequences from ~3,000 professional soccer matches. Adversarial team debiasing via gradient reversal (Ganin et al. 2016) removes competition-specific confounds, producing style representations that generalize across leagues.

Part of the (Right! Luxury!) Lakehouse soccer analytics platform. Replaces the v1 Doc2Vec model (32-dim) as the @Champion version.

Model Description

Football2Vec v2 learns contextual player embeddings from SPADL action sequences. Each player-match is represented as a sequence of tokenized actions with continuous spatial coordinates. The transformer encoder processes these sequences, and mean pooling over valid tokens produces a fixed-length 128-dim embedding capturing playing style.

Architecture

Component Detail
Token embedding 23 SPADL action types β†’ 128d lookup table
Spatial encoding MLP(x) + MLP(y) β†’ 128d each, summed with token embedding
Positional embedding Learnable, max 512 tokens
Encoder 4-layer TransformerEncoder, 4 attention heads, GELU activation, 4x FFN
Pooling Mean pooling over valid (non-padding) tokens β†’ 128d
Adversarial head Gradient reversal layer (Ξ»=0.2) + competition classifier

Two-Stage Training

Stage 1 β€” Masked Language Modeling: 15% of action tokens are masked; the model predicts the original action type from surrounding context. This forces the encoder to learn meaningful action-context representations.

Stage 2 β€” Adversarial Debiasing: A competition classifier head is attached via a gradient reversal layer (Ganin et al. 2016). The encoder learns to produce embeddings that cannot predict which competition a player belongs to, removing league-style confounds (e.g., Serie A defending vs Premier League pressing) while retaining individual style signal.

Dual-Vector Architecture

This model provides the behavioral half of a dual-vector player representation:

Vector Dimensions Source Captures
Behavioral (this model) 128 Transformer on SPADL sequences Playing style, spatial patterns, action context
Statistical 13 Z-score normalized per-90 stats Goals, assists, xG, passes, VAEP, defensive metrics

Both vectors are stored in PostgreSQL with pgvector HNSW indexes (128-dim, cosine ops) for sub-10ms similarity queries.

Training Data

Source Matches Events License
StatsBomb Open Data ~3,000 ~3M CC-BY 4.0
Wyscout Public Dataset ~1,900 ~3M CC-BY-NC 4.0

Training data is published as luxury-lakehouse/football2vec-training-data on HF Hub.

Tokenization

Events are tokenized using the 23-type SPADL vocabulary (from fct_action_values): pass, cross, throw_in, freekick_crossed, freekick_short, corner_crossed, corner_short, take_on, foul, tackle, interception, shot, shot_penalty, shot_freekick, keeper_save, keeper_claim, keeper_punch, keeper_pick_up, clearance, bad_touch, non_action, dribble, goalkick.

Continuous spatial coordinates (x, y) normalized to [0, 1] on a 105Γ—68m pitch are injected via learned MLP projections summed with the token embeddings.

Hyperparameters

Parameter Value
Hidden dimension 128
Encoder layers 4
Attention heads 4
FFN multiplier 4x (512)
Dropout 0.1
Max sequence length 512
MLM mask probability 0.15
Spatial MLP intermediate dim 64
Batch size 256
Learning rate 1e-4
Weight decay 0.01
Warmup fraction 10%
Adversarial Ξ» max 0.2
Adversarial warmup epochs 5

Training runs on HF Jobs A10G-large GPU (~2 hours total for both stages).

How to Use

Quick Start

from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json

# Download model weights (Stage 2 β€” adversarial debiased)
weights_path = hf_hub_download("luxury-lakehouse/football2vec-v2", "stage2/model.safetensors")
config_path = hf_hub_download("luxury-lakehouse/football2vec-v2", "stage2/config.json")

with open(config_path) as f:
    config = json.load(f)

state_dict = load_file(weights_path)
print(f"Config: {config['hidden_dim']}-dim, {config['num_layers']} layers")
print(f"Parameters: {sum(p.numel() for p in state_dict.values()):,}")

Pre-Computed Embeddings (recommended)

For most use cases, load the pre-computed embeddings directly β€” no model inference needed:

from datasets import load_dataset

ds = load_dataset("luxury-lakehouse/football2vec-player-embeddings")
df = ds["train"].to_pandas()

vectors = np.array(df["behavioral_vector"].tolist())
print(f"{vectors.shape[0]} players, {vectors.shape[1]}-dim embeddings")  # (8950, 128)

Intended Use

  • Player similarity search: "Find players with a similar playing style to X" via cosine distance on 128-dim vectors
  • Scouting: Identify transfer targets by behavioral profile, independent of league context
  • Tactical analysis: Cluster players by on-pitch behavior with team-agnostic representations
  • Research: Reproducible player embeddings for sports analytics, with adversarial debiasing removing confounding league effects
  • Downstream features: Input to GNN tactical pattern models, sequence models (ScoutGPT), or combined with tracking-based features

EU AI Act β€” Intended Use and Non-Use

This model is published for research and reproducibility purposes on public, open-licensed match data. It is not intended for, not validated for, and not supplied to any use that would fall within Annex III Β§4 (Employment, workers management and access to self-employment) of Regulation (EU) 2024/1689 β€” including recruitment or selection of natural persons, decisions affecting work-related contractual relationships, promotion, termination, task allocation based on individual traits, or the monitoring and evaluation of performance and behaviour of workers for employment decisions. Player similarity search is a canonical scouting workflow, and any deployer is responsible for treating this model as decision-support at most, never as a decision system.

Any deployer who wishes to use this model for such a purpose is responsible for performing their own conformity assessment under Article 43, for drawing up the technical documentation required by Article 11 and Annex IV, for implementing the human oversight measures required by Article 14, for declaring accuracy metrics under Article 15, and for ensuring the data governance obligations of Article 10 are met. Note specifically that the training data contains no protected attributes and therefore cannot support the group-fairness audits required by Article 10(2)(g) without ingesting additional personal data. The adversarial competition debiasing (Ganin et al. 2016) described above addresses a confounding effect (league-style leakage) and is not a substitute for an Article 10 protected-attribute audit.

See the AI_GOVERNANCE.md gap analysis in the source repository for the project's full risk classification, re-classification triggers, and governance posture.

Limitations

  • Event-based only: Captures on-ball actions. Off-ball movement, positioning, and pressing intensity are not represented.
  • Open data only: Trained on publicly available StatsBomb and Wyscout data. Commercial datasets may yield different representations.
  • Competition debiasing, not team debiasing: The adversarial head targets competition ID (the strongest contextual confounder). Within-league team effects are attenuated but not fully removed.
  • No temporal hierarchy: The transformer processes a flat sequence per match. Season-level or career-level patterns emerge only through post-hoc aggregation (mean pooling across matches).
  • Cross-source alignment: Player identity is unified via the entity resolution pipeline (dim_players), but subtle cross-source differences in event definitions remain.

Freshness

Metric Value
Training data freshness SLA 168 hours (7 days)
Inference schedule Daily 06:00 UTC
Skip guard match_id-level β€” only new matches are processed

Model Files

stage1/model.safetensors      -- Stage 1 MLM checkpoint (safetensors format)
stage2/model.safetensors      -- Stage 2 adversarial, final (safetensors format)
stage2/config.json            -- Football2VecConfig as JSON
zscore_params.json            -- z-score normalization parameters (13-dim stat vector)

Model weights use the safetensors format β€” a tensor-only serialization with zero pickle surface and no code execution capability. Pre-computed embeddings are delivered as Parquet (non-executable).

Citation

@article{ganin2016domain,
  title={Domain-Adversarial Training of Neural Networks},
  author={Ganin, Yaroslav and Ustinova, Evgeniya and Cambau, Hana
          and Lempitsky, Victor and Laviolette, Fran{\c{c}}ois},
  journal={Journal of Machine Learning Research},
  volume={17},
  number={1},
  pages={1--35},
  year={2016}
}
@inproceedings{le2014distributed,
  title={Distributed Representations of Sentences and Documents},
  author={Le, Quoc and Mikolov, Tomas},
  booktitle={International Conference on Machine Learning},
  year={2014}
}
@software{nielsen2026football2vec_v2,
  title={Football2Vec v2: Transformer Player Embeddings with Adversarial Team Debiasing},
  author={Nielsen, Karsten Skytt},
  year={2026},
  url={https://github.com/karsten-s-nielsen/luxury-lakehouse}
}

Companion Resources

Resource Description
Training Data SPADL action sequences used for training
Player Embeddings Pre-computed 128-dim vectors (career/season/match)
SPADL/VAEP Action Values Per-action offensive/defensive VAEP valuations
v1 Doc2Vec baseline 32-dim Doc2Vec model (retained as baseline)

Demo

Try the interactive Soccer Analytics App β€” search for similar players by 128-dim behavioral embedding on the Player Similarity page.

Explore interactively: HF Space demo

More Information

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train luxury-lakehouse/football2vec-v2

Spaces using luxury-lakehouse/football2vec-v2 3