(Right! Luxury!) Lakehouse

community

AI & ML interests

Soccer analytics, sports analytics, player embeddings, pitch control, action valuation, expected threat, tracking data, VAEP, Doc2Vec, entity resolution, pgvector, defensive valuation, line-breaking passes, physics-based models

Recent Activity

karstenskyt updated a dataset about 3 hours ago

luxury-lakehouse/football2vec-training-data

karstenskyt updated a Space about 7 hours ago

luxury-lakehouse/README

karstenskyt updated a model about 7 hours ago

luxury-lakehouse/xg-v2-model-set-encoder

View all activity

Organization Card

Community About org cards

(Right! Luxury!) Lakehouse

"Luxury! We used to dream of serverless!"

Open-source soccer analytics platform built on Databricks Lakebase — replacing a 6-service traditional AWS pipeline with a unified lakehouse architecture that scales to zero. The Hugging Face Hub serves as the public distribution layer for models, datasets, and interactive demos.

Try it now: Full Dashboard — 16-page Taipy app with live data from 380+ matches across 5 providers. Or explore the Gradio Demo for a quick look.

Platform Scale & Data Engineering

The infrastructure uses a Medallion architecture (Bronze → Silver → Gold) provisioned entirely via Terraform IaC, unifying multi-vendor event and tracking data into a single analytical layer.

38M+ tracking frames ingested from three optical tracking providers (25fps and 10fps)
5 distinct data sources unified: StatsBomb, Wyscout, Metrica Sports, IDSSE (Bundesliga), and SkillCorner (A-League)
16 Taipy dashboard pages deployed on Hugging Face Spaces (Docker SDK), querying Lakebase PostgreSQL via Databricks OAuth
34 synced tables with Zero-ETL continuous sync from Gold Delta Lake to Lakebase PostgreSQL 17
56 PostgreSQL indexes (50 btree + 6 HNSW vector indexes: 4x128d + 2x144d) for sub-10ms OLTP queries
Pipeline reliability enforced through 1,118+ unit tests and 381+ dbt data tests

The Hugging Face Footprint

All public artifacts are hosted entirely within the HF ecosystem.

Models

Model	Architecture	Scale
football2vec-v2	Transformer encoder (128-dim) + adversarial team debiasing (Ganin GRL)	87K per-match vectors across 8,950 players, debiased for team identity
football2vec-statsbomb-wyscout	Doc2Vec (PV-DM) 32-dim behavioral embeddings (v1 baseline)	87K per-match vectors across 8,950 players from ~3,000 matches
xg-model-statsbomb-wyscout	Calibrated XGBoost + logistic baseline (13 features)	Trained on ~131K shots, ROC-AUC 0.979 on held-out test set
vaep-model-statsbomb-wyscout	2× XGBClassifier (P(scores) + P(concedes))	Trained on ~2,388 matches from StatsBomb + Wyscout
xg-v2-model-set-encoder	Deep Sets (Zaheer et al. 2017) + MC dropout (Gal & Ghahramani 2016)	ROC-AUC 0.915, trained on ~131K shots with 360 freeze frames
psxg-model	Logistic regression on goalmouth coordinates (Butcher et al. 2025)	Trained on ~15K on-target shots, JSON-serialised weights
football2vec-360	Transformer encoder (128-dim) + Deep Sets 360 context (16-dim) = 144-dim	323 StatsBomb 360 matches, adversarial team debiasing
pitch-control	Physics-based team-control probability surface (Spearman 2017)	Heuristic method card — no trained weights; substrate for OBSO / Off-Ball xT / Space Creation
defcon	XGBoost counterfactual value estimator (Kim et al. 2025 DEFCON-lite)	Inline-trained per run; per-defender credit assignment on open 360/tracking data
off-ball-xt	Heuristic xT × pitch-control attribution (Singh 2018, Spearman 2017)	Method card — attributes attacking threat to off-ball players
obso-pausa-method	OBSO surface + PAUSA pass timing (Spearman 2018; Fernández & Bornn 2018; Lee et al. 2026)	Method card — pass-timing counterfactuals on GPU-accelerated OBSO
space-creation-method	Counterfactual pitch control (Fernández & Bornn 2018)	Method card — per-player EPV-weighted space-creation value

All model serialization uses JSON envelopes — zero pickle files (banned by project security policy). Every model card above carries an EU AI Act — Intended Use and Non-Use stanza per the project's AI_GOVERNANCE.md gap analysis (SEC1, April 2026).

Datasets

Dataset	Scale	Description
spadl-vaep-action-values	~9.5M actions	Per-action offensive/defensive VAEP valuations
line-breaking-passes	~5M passes	All passes with defensive line-breaking labels via Ward clustering on 360 freeze frames
football2vec-player-embeddings	87K vectors	Pre-computed behavioral (128-d transformer) + statistical (13-d) player vectors
football2vec-training-data	~87K sequences	Tokenized SPADL action sequences for transformer training
pitch-control-tracking	38M frames	Per-player per-frame Spearman (2017) physics-based pitch control
expected-threat-grids	12x8 grid	Data-driven Expected Threat values computed from 2.2M SPADL actions
obso-pausa-inputs	7 matches	ELASTIC-synced event-tracking inputs for OBSO/PAUSA computation
obso-pausa-values	~3,500 passes	PAUSA pass timing scores with OBSO temporal/spatial decomposition
obso-trained-grids	8 competitions + global	Data-driven ball reachability (100×64) + EPV (50×32) grids for OBSO
xg-freeze-frame-data	137K player rows	StatsBomb 360 freeze-frame player positions for xG v2 set encoder
xg-shot-data	131K shots	Tabular shot features from StatsBomb + Wyscout for xG model training
space-creation-values	875K player-frames	Per-player space creation/destruction via differential OBSO (Fernandez & Bornn 2018)
statsbomb-shots-on-target	~15K shots	On-target shots with goalmouth coordinates for PSxG training
psxg-predictions	~15K shots	Per-shot PSxG probabilities from logistic model
football2vec-360-training-data	~2M actions	SPADL action sequences with 360 freeze frame context
football2vec-statsbomb-wyscout	87K vectors	Per-match v1 Doc2Vec (32-dim) + v2 transformer (128-dim) raw embeddings
football2vec-360-embeddings	~4K players	144-dim player embeddings from 360-enriched model
scoutgpt-training-data	894K episodes	SPADL possession episodes with per-action player attribution (Hong et al. 2025)

Interactive Spaces

Space	What it is
Soccer Analytics App	Full 16-page Taipy dashboard (Docker SDK) querying Lakebase PostgreSQL via Databricks OAuth. Live data from 380+ matches. Shot maps, pass networks, player comparison, GK analytics, tactical positions, pitch control, PAUSA pass timing, DEFCON defensive pressure, and more.
Soccer Analytics Demo	Lightweight 6-tab Gradio explorer with pre-cached Parquet data. No database dependency — instant load for quick exploration.

Compute & Bidirectional Sync

While Databricks handles core data engineering, we use HF Jobs for workloads where a serverless Python environment is the right tool.

Examples:

Expected Threat grids run as a CPU-based HF Jobs pipeline — downloads SPADL data from an HF Dataset, computes Markov chain value iteration, and publishes xT grids back to the Hub.
xG v2 neural model trains on an A10G GPU via HF Jobs — a Deep Sets architecture with MC dropout, processing 131K shots with 360 freeze-frame context, exporting pure-NumPy weights for serverless inference.
Space Creation computes per-player counterfactual pitch control surfaces on A10G via JAX double-vmap — 875K player-frame values across 40K frames in under 6 minutes.

All HF Jobs scripts use PEP 723 inline script metadata for zero-setup reproducibility.

Model weights published to HF Hub are synced back to Databricks UC Volumes for inference in the production Taipy app. This creates a bidirectional flow: Databricks produces training data → HF Hub hosts artifacts → Databricks consumes model weights for scoring.

Academic Foundations

Every analytics module is grounded in peer-reviewed research, cited directly in the platform UI:

Module	Foundation
Pitch Control	Spearman, "Beyond Expected Goals" (2017)
Expected Threat	Karun Singh (2018), Markov chain value iteration
VAEP	Decroos et al., "Actions Speak Louder than Goals" (2019)
DEFCON	Kim et al., defensive contribution framework (2025)
Player Embeddings	Le & Mikolov, Doc2Vec (2014); Theiner et al., football2vec (2022)
Line-Breaking	Ward clustering on StatsBomb 360 freeze frames; adapted from Parma Calcio 1913
xG Model	Rathke, "An examination of expected goals" (2017); XGBoost with isotonic calibration
PAUSA	Lee et al., "Valuing La Pausa: Quantifying Optimal Pass Timing Beyond Speed" (2026)
Space Creation	Fernandez & Bornn, "Wide Open Spaces" (2018), differential OBSO integration
xG v2 Set Encoder	Zaheer et al., "Deep Sets" (NeurIPS 2017); Gal & Ghahramani, "Dropout as Bayesian Approximation" (ICML 2016)
Pass Networks	Pena & Touchette, "A network theory analysis of football strategies" (2012)
ScoutGPT Decoder	Hong et al., "ScoutGPT: Player-conditioned Football Language Model for Counterfactual Evaluation" (2025, arXiv:2512.17266)

Engineering Quality

The platform maintains professional-grade engineering standards:

Security: OAuth M2M everywhere, HTTPS-only, zero secrets in code, input validation on all identifiers, SSL verification enforced, JSON-only model serialization
Type safety: Pyright basic mode, Pydantic models for configuration
Testing: 1,118+ pytest unit tests (including performance benchmarks), 381+ dbt data quality tests
CI/CD: GitHub Actions with OIDC federation (zero-secret CI), ruff linting, import-linter boundary enforcement, pre-commit hooks
UX discipline: 71 of 78 findings resolved across two cognitive interface audits (CHI-AUDIT-180, CHI-AUDIT-190), grounded in 15 HCI frameworks including Norman, Sweller, Gergle, Kahneman, and Cleveland & McGill. Every metric has a help tooltip, every page has academic citations, and every analytics term is defined in a context-sensitive glossary.
AI governance: The project is assessed against Regulation (EU) 2024/1689 (the EU AI Act). Under the current operating posture — a solo research project on public data, not sold or licensed to clubs, not used for employment decisions — none of the thirteen per-player evaluative ML systems is classified as high-risk. Every model card carries an explicit intended-use / non-use stanza; the full gap analysis, conformity-assessment mapping, and re-classification triggers live in AI_GOVERNANCE.md. Enforcement is via src/tests/test_ai_governance_md.py, which fails CI if the document drifts from the workflow-card inventory or if the annual review date goes more than 30 days stale.

Links

License: Apache 2.0

_{Named after Monty Python's Four Yorkshiremen sketch, where each comedian one-ups the others about how deprived their childhood was. In data engineering, moving from hand-managed EC2 instances and 5-hop Reverse ETL pipelines to serverless Lakebase truly is... right luxury.}