AI & ML interests

Soccer analytics, sports analytics, player embeddings, pitch control, action valuation, expected threat, tracking data, VAEP, Doc2Vec, entity resolution, pgvector, defensive valuation, line-breaking passes, physics-based models

Recent Activity

karstenskyt  updated a dataset about 3 hours ago
luxury-lakehouse/football2vec-training-data
karstenskyt  updated a Space about 7 hours ago
luxury-lakehouse/README
karstenskyt  updated a model about 7 hours ago
luxury-lakehouse/xg-v2-model-set-encoder
View all activity

Organization Card

Luxury Lakehouse

(Right! Luxury!) Lakehouse

"Luxury! We used to dream of serverless!"

Open-source soccer analytics platform built on Databricks Lakebase — replacing a 6-service traditional AWS pipeline with a unified lakehouse architecture that scales to zero. The Hugging Face Hub serves as the public distribution layer for models, datasets, and interactive demos.

Try it now: Full Dashboard — 16-page Taipy app with live data from 380+ matches across 5 providers. Or explore the Gradio Demo for a quick look.


Platform Scale & Data Engineering

The infrastructure uses a Medallion architecture (Bronze → Silver → Gold) provisioned entirely via Terraform IaC, unifying multi-vendor event and tracking data into a single analytical layer.

  • 38M+ tracking frames ingested from three optical tracking providers (25fps and 10fps)
  • 5 distinct data sources unified: StatsBomb, Wyscout, Metrica Sports, IDSSE (Bundesliga), and SkillCorner (A-League)
  • 16 Taipy dashboard pages deployed on Hugging Face Spaces (Docker SDK), querying Lakebase PostgreSQL via Databricks OAuth
  • 34 synced tables with Zero-ETL continuous sync from Gold Delta Lake to Lakebase PostgreSQL 17
  • 56 PostgreSQL indexes (50 btree + 6 HNSW vector indexes: 4x128d + 2x144d) for sub-10ms OLTP queries
  • Pipeline reliability enforced through 1,118+ unit tests and 381+ dbt data tests

The Hugging Face Footprint

All public artifacts are hosted entirely within the HF ecosystem.

Models

Model Architecture Scale
football2vec-v2 Transformer encoder (128-dim) + adversarial team debiasing (Ganin GRL) 87K per-match vectors across 8,950 players, debiased for team identity
football2vec-statsbomb-wyscout Doc2Vec (PV-DM) 32-dim behavioral embeddings (v1 baseline) 87K per-match vectors across 8,950 players from ~3,000 matches
xg-model-statsbomb-wyscout Calibrated XGBoost + logistic baseline (13 features) Trained on ~131K shots, ROC-AUC 0.979 on held-out test set
vaep-model-statsbomb-wyscout 2× XGBClassifier (P(scores) + P(concedes)) Trained on ~2,388 matches from StatsBomb + Wyscout
xg-v2-model-set-encoder Deep Sets (Zaheer et al. 2017) + MC dropout (Gal & Ghahramani 2016) ROC-AUC 0.915, trained on ~131K shots with 360 freeze frames
psxg-model Logistic regression on goalmouth coordinates (Butcher et al. 2025) Trained on ~15K on-target shots, JSON-serialised weights
football2vec-360 Transformer encoder (128-dim) + Deep Sets 360 context (16-dim) = 144-dim 323 StatsBomb 360 matches, adversarial team debiasing
pitch-control Physics-based team-control probability surface (Spearman 2017) Heuristic method card — no trained weights; substrate for OBSO / Off-Ball xT / Space Creation
defcon XGBoost counterfactual value estimator (Kim et al. 2025 DEFCON-lite) Inline-trained per run; per-defender credit assignment on open 360/tracking data
off-ball-xt Heuristic xT × pitch-control attribution (Singh 2018, Spearman 2017) Method card — attributes attacking threat to off-ball players
obso-pausa-method OBSO surface + PAUSA pass timing (Spearman 2018; Fernández & Bornn 2018; Lee et al. 2026) Method card — pass-timing counterfactuals on GPU-accelerated OBSO
space-creation-method Counterfactual pitch control (Fernández & Bornn 2018) Method card — per-player EPV-weighted space-creation value

All model serialization uses JSON envelopes — zero pickle files (banned by project security policy). Every model card above carries an EU AI Act — Intended Use and Non-Use stanza per the project's AI_GOVERNANCE.md gap analysis (SEC1, April 2026).

Datasets

Dataset Scale Description
spadl-vaep-action-values ~9.5M actions Per-action offensive/defensive VAEP valuations
line-breaking-passes ~5M passes All passes with defensive line-breaking labels via Ward clustering on 360 freeze frames
football2vec-player-embeddings 87K vectors Pre-computed behavioral (128-d transformer) + statistical (13-d) player vectors
football2vec-training-data ~87K sequences Tokenized SPADL action sequences for transformer training
pitch-control-tracking 38M frames Per-player per-frame Spearman (2017) physics-based pitch control
expected-threat-grids 12x8 grid Data-driven Expected Threat values computed from 2.2M SPADL actions
obso-pausa-inputs 7 matches ELASTIC-synced event-tracking inputs for OBSO/PAUSA computation
obso-pausa-values ~3,500 passes PAUSA pass timing scores with OBSO temporal/spatial decomposition
obso-trained-grids 8 competitions + global Data-driven ball reachability (100×64) + EPV (50×32) grids for OBSO
xg-freeze-frame-data 137K player rows StatsBomb 360 freeze-frame player positions for xG v2 set encoder
xg-shot-data 131K shots Tabular shot features from StatsBomb + Wyscout for xG model training
space-creation-values 875K player-frames Per-player space creation/destruction via differential OBSO (Fernandez & Bornn 2018)
statsbomb-shots-on-target ~15K shots On-target shots with goalmouth coordinates for PSxG training
psxg-predictions ~15K shots Per-shot PSxG probabilities from logistic model
football2vec-360-training-data ~2M actions SPADL action sequences with 360 freeze frame context
football2vec-statsbomb-wyscout 87K vectors Per-match v1 Doc2Vec (32-dim) + v2 transformer (128-dim) raw embeddings
football2vec-360-embeddings ~4K players 144-dim player embeddings from 360-enriched model
scoutgpt-training-data 894K episodes SPADL possession episodes with per-action player attribution (Hong et al. 2025)

Interactive Spaces

Space What it is
Soccer Analytics App Full 16-page Taipy dashboard (Docker SDK) querying Lakebase PostgreSQL via Databricks OAuth. Live data from 380+ matches. Shot maps, pass networks, player comparison, GK analytics, tactical positions, pitch control, PAUSA pass timing, DEFCON defensive pressure, and more.
Soccer Analytics Demo Lightweight 6-tab Gradio explorer with pre-cached Parquet data. No database dependency — instant load for quick exploration.

Compute & Bidirectional Sync

While Databricks handles core data engineering, we use HF Jobs for workloads where a serverless Python environment is the right tool.

Examples:

  • Expected Threat grids run as a CPU-based HF Jobs pipeline — downloads SPADL data from an HF Dataset, computes Markov chain value iteration, and publishes xT grids back to the Hub.
  • xG v2 neural model trains on an A10G GPU via HF Jobs — a Deep Sets architecture with MC dropout, processing 131K shots with 360 freeze-frame context, exporting pure-NumPy weights for serverless inference.
  • Space Creation computes per-player counterfactual pitch control surfaces on A10G via JAX double-vmap — 875K player-frame values across 40K frames in under 6 minutes.

All HF Jobs scripts use PEP 723 inline script metadata for zero-setup reproducibility.

Model weights published to HF Hub are synced back to Databricks UC Volumes for inference in the production Taipy app. This creates a bidirectional flow: Databricks produces training data → HF Hub hosts artifacts → Databricks consumes model weights for scoring.

Academic Foundations

Every analytics module is grounded in peer-reviewed research, cited directly in the platform UI:

Module Foundation
Pitch Control Spearman, "Beyond Expected Goals" (2017)
Expected Threat Karun Singh (2018), Markov chain value iteration
VAEP Decroos et al., "Actions Speak Louder than Goals" (2019)
DEFCON Kim et al., defensive contribution framework (2025)
Player Embeddings Le & Mikolov, Doc2Vec (2014); Theiner et al., football2vec (2022)
Line-Breaking Ward clustering on StatsBomb 360 freeze frames; adapted from Parma Calcio 1913
xG Model Rathke, "An examination of expected goals" (2017); XGBoost with isotonic calibration
PAUSA Lee et al., "Valuing La Pausa: Quantifying Optimal Pass Timing Beyond Speed" (2026)
Space Creation Fernandez & Bornn, "Wide Open Spaces" (2018), differential OBSO integration
xG v2 Set Encoder Zaheer et al., "Deep Sets" (NeurIPS 2017); Gal & Ghahramani, "Dropout as Bayesian Approximation" (ICML 2016)
Pass Networks Pena & Touchette, "A network theory analysis of football strategies" (2012)
ScoutGPT Decoder Hong et al., "ScoutGPT: Player-conditioned Football Language Model for Counterfactual Evaluation" (2025, arXiv:2512.17266)

Engineering Quality

The platform maintains professional-grade engineering standards:

  • Security: OAuth M2M everywhere, HTTPS-only, zero secrets in code, input validation on all identifiers, SSL verification enforced, JSON-only model serialization
  • Type safety: Pyright basic mode, Pydantic models for configuration
  • Testing: 1,118+ pytest unit tests (including performance benchmarks), 381+ dbt data quality tests
  • CI/CD: GitHub Actions with OIDC federation (zero-secret CI), ruff linting, import-linter boundary enforcement, pre-commit hooks
  • UX discipline: 71 of 78 findings resolved across two cognitive interface audits (CHI-AUDIT-180, CHI-AUDIT-190), grounded in 15 HCI frameworks including Norman, Sweller, Gergle, Kahneman, and Cleveland & McGill. Every metric has a help tooltip, every page has academic citations, and every analytics term is defined in a context-sensitive glossary.
  • AI governance: The project is assessed against Regulation (EU) 2024/1689 (the EU AI Act). Under the current operating posture — a solo research project on public data, not sold or licensed to clubs, not used for employment decisions — none of the thirteen per-player evaluative ML systems is classified as high-risk. Every model card carries an explicit intended-use / non-use stanza; the full gap analysis, conformity-assessment mapping, and re-classification triggers live in AI_GOVERNANCE.md. Enforcement is via src/tests/test_ai_governance_md.py, which fails CI if the document drifts from the workflow-card inventory or if the annual review date goes more than 30 days stale.

Links

Named after Monty Python's Four Yorkshiremen sketch, where each comedian one-ups the others about how deprived their childhood was. In data engineering, moving from hand-managed EC2 instances and 5-hop Reverse ETL pipelines to serverless Lakebase truly is... right luxury.