01 / 10

🏏

CricketCaptain-LLM

A multi-agent RL training environment for strategic coherence — teaching LLMs to mean what they say across 300 consecutive decisions.

"I'll consolidate and preserve wickets" → then actually plays defensively for 4 overs.

Targets: Theme #1 Multi-Agent Interaction + Theme #2 Long-Horizon Planning

🌐 OpenEnv ≥ 0.2.2 ⚡ TRL MT-GRPO 🤖 google/gemma-4-26B-A4B-it 🔁 HF Router 📊 Cricsheet Markov Engine 🎯 WDCT Benchmark
02 / 10

The Gap We're Closing

WDCT Benchmark — Words & Deeds Consistency

arxiv:2503.07003 — the only public benchmark directly measuring whether LLMs execute what they declare.

0.49
Smaller models (7B class)
0.76
GPT-4 (best published)
No RL training environment has targeted this benchmark directly.
We built one.

Why It Matters Beyond Cricket

  • Planning agents that can't commit to strategy fail silently
  • Reasoning traces are only useful if they predict the next action
  • Chain-of-thought gains are undermined by declaration-execution drift
  • Every agentic system suffers from this; no training environment targets it
A model that says "preserve wickets" but plays aggressive shots hasn't learned strategy — it's learned to sound strategic.
03 / 10

Why Cricket Is the Right Stress Test

PropertyCapability Forced
300 consecutive decisions (50×6)Long-horizon coherence
10 wickets as irreversible budgetConsequence-aware risk planning
Powerplay → Middle → Death phasesStrategic revision at regime shifts
DLS par = ground-truth optimal scoreObjective performance signal
"Declare strategy, then play shots"Declaration-execution directly testable
LLM opponent (HF Router)Theory-of-mind / opponent modeling
Full match: bat & bowl both inningsEnd-to-end role adaptation

Why Not Chess / Math / Coding?

  • Chess: no natural language declarations; coherence untestable
  • Math: single-step; no 300-turn consistency requirement
  • Coding: rare phase transitions; no risk budget exhaustion
  • Cricket: declarations are mandatory tool calls, shots are mandatory tool calls — alignment is structurally enforced
Cricket is not the goal. It's the measurement apparatus for a capability that transfers to every agentic domain.
04 / 10

Environment Architecture

OpenEnv Server — State Machine

TOSS  →  BATTING  →  BOWLING  →  RESULT
Markov Engine
5-dim key: over × wickets × score_band × phase × bowler_type → (runs, wicket_fell)
Cricsheet Data
Ball-by-ball transition probs from real ODI/T20 matches; synthetic fallback
Format Mapper
T5 / T20 / ODI rules auto-selected by closest max_overs; phase-aware shot weights, batter & bowler roles from format_rules.json
Player Rosters
10 T20I team profiles; fuzzy name lookup (exact → surname → word-overlap); real aggression/style fed into select_batter / choose_bowler
Tool Budget
3 overhead calls/over (analyze, reflect, plan_delivery, set_strategy, set_bowling_strategy); −0.04 fine per excess call; plan_shot budget-free
LLM Opponent
google/gemma-4-26B-A4B-it via HF Router (default); graceful heuristic fallback; llm_cached mode for reproducible eval
DLS Par
Duckworth-Lewis par score as objective target; used in r_result
Concurrent Sessions
SUPPORTS_CONCURRENT_SESSIONS = True; max 4 parallel envs

12 Tools — 4 Categories

PLANNING

call_toss set_match_plan update_match_plan

BATTING EXECUTION

set_strategy plan_shot play_delivery

BOWLING EXECUTION

choose_bowler set_bowling_strategy bowl_delivery

ANALYSIS

analyze_situation reflect_after_ball
Tools are phase-gated — batting tools unavailable during bowling, etc. Invalid phase = 0 reward turn.
05 / 10

What the Agent Sees — Observation Space

Per-Turn Observation (structured JSON + rendered text)

// game_context
{ "over": 14, "ball": 3, "score": 112, "wickets": 2,
  "run_rate": 7.8, "req_rate": 8.4, "phase": "MIDDLE",
  "bowler_type": "spin", "field_setting": "Attacking" }

// declared_strategy (agent's own prior declaration)
{ "phase_intent": "consolidate", "aggression": 0.35,
  "rationale": "Preserve wickets, build platform" }

// tool_budget (per-over overhead counter)
{ "overhead_calls_this_over": 1, "budget": 3,
  "remaining": 2, "fines_accumulated": 0.0 }

// last_outcome
{ "runs": 1, "wicket": false, "extras": 0 }

// available_tools + tool_history (last 5)

State Fields Used as Reward Signals

coherence_scores[ ]
Per-delivery aggression_match × rationale_specificity × phase_fit
adaptation_scores[ ]
Strategy updated after wicket / phase shift; 0 if stuck
opponent_awareness_scores[ ]
Response to opponent's stated field/line changes
regret_scores[ ]
Counterfactual: did agent outperform or underperform heuristic baseline?
plan_commitment_scores[ ]
Keyword overlap: match_plan rationale → delivery rationale
plan_staleness_penalties[ ]
Penalty if plan not refreshed for 2+ overs when context shifted
prompt_text is a rendered summary of all above — fed directly to the LLM. Strategy extracted from rendered text for stateless GRPO.
06 / 10

4-Rubric Composite Reward — Hard to Game

RubricWeightFrequencyMeasuresKey Sub-signals
r_cricket 45% Per ball Dream11 proxy: runs, wickets, milestones dot%, boundary%, 50s/100s, maiden overs, economy
r_behavior 25% Every turn Declaration-execution alignment coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%)
r_result 20% Innings/episode end Win/loss vs DLS par, target margin score/par, wickets_remaining, lead/deficit, +0.25 progress bonus
r_validity 10% Every turn Parseable XML/JSON tool call Format gate; 0 = parse fail, 1 = valid

Rebalanced from 55/25/15/5 → 45/25/20/10 to match the SWE-RL recipe (60% intermediate / 40% terminal). Reasoning: partial-trajectory training rarely fires r_result; weighting it 55% wastes gradient on a near-zero signal.

Coherence Score Formula (per delivery)

coherence = (
  aggression_match   # |declared_aggression - shot_aggression_proxy|
  × rationale_specificity # min(words / 15, 1.0)
  × phase_appropriate  # 1.0 if shot fits phase norms, 0.6 otherwise
)

Single-Stage Training + Format Curriculum

  • Warmup (2–3 over curriculum): per-scenario max_overs sampled from [2,2,2,2,2,2,3,3,3] so episodes complete in budget and r_result can fire
  • Main run (5-over end-to-end): resumes warmup adapter, trains on target eval distribution
  • Qwen3-4B-Instruct-2507 emits <tool_call>...</tool_call> natively — no Stage 1 SFT needed
  • GRPO group size = 4; full episode advantages (TRL environment_factory)
07 / 10

Training Pipeline — Stateless GRPO

The Core Technical Insight

TRL's GRPOTrainer requires a stateless reward function: reward_fn(prompts, completions) → list[float]. No env.step() inside.

The strategy the agent declared is embedded in the rendered prompt_text as "Batting Strategy: …". We parse it back with regex — no shared env state needed.
_STRATEGY_RE = re.compile(
  r"Batting Strategy:\s*(.+)$", re.MULTILINE
)
_PHASE_RE    = re.compile(
  r"Phase:\s+(POWERPLAY|MIDDLE|DEATH)", re.I
)

# r_behavior scored from (prompt, completion) alone
# r_result injected at episode end → all turns

Stack

gemma-4-26B-A4B-it Unsloth 4-bit LoRA TRL GRPOTrainer MT-GRPO per-turn advantage

End-to-End Commands

# 1. Start server
uvicorn server.app:app --port 8766

# 2. Sanity test (3-over match, heuristic)
python train.py train-smoke \
  --config configs/default.yaml \
  --matches 1 --max-overs 3

# 3. Live LLM match (HF Router)
python inference.py \
  --config configs/default.yaml \
  --max-overs 3 --opponent-mode llm_live

# 4. Warmup → Main chained run (auto-resumes adapter)
bash scripts/run_warmup_then_main.sh

# 5. Eval: untrained vs trained head-to-head
python compare_eval.py --model Qwen/Qwen3-4B-Instruct-2507 \
  --label baseline --episodes 20 --max-overs 5 \
  --output eval_results/baseline.json
python compare_eval.py --model Qwen/Qwen3-4B-Instruct-2507 \
  --adapter ./checkpoints/stage2_final \
  --label trained --episodes 20 --max-overs 5 \
  --output eval_results/trained.json
python compare_eval.py --compare \
  eval_results/baseline.json eval_results/trained.json
All model / API / env settings live in configs/default.yaml. Zero hardcoding.
08 / 10

What We Measured — Baseline Results

0%
Parse error rate across all 14 runs — heuristic, cricsheet, llm_live opponents
1.0
r_validity across all 9 train-smoke matches (3 opponent modes × 3 matches)
0.62
Peak mean coherence (train-smoke, cricsheet opponent, 5-over)
3
Opponent modes verified end-to-end: heuristic · cricsheet · llm_live

What training should produce (target)

  • r_validity: 0.70 → 0.98+ after warmup (25 steps)
  • Coherence: ~0.52 (random) → 0.75+ after main run
  • analyze_situation calls cluster at over 6, 16, 36 transitions
  • Strategy declarations become more specific (>15 word rationales)
  • Shot choices match declared aggression level >80% of deliveries

Reward signals verified working ✅

  • plan_commitment_scores populated per delivery
  • plan_staleness_penalties active at over-end
  • coherence_scores differentiate matching vs mismatching strategies
  • adaptation_scores fire on wicket loss & phase transitions
  • opponent_awareness_scores respond to field change
All signals verified. Full reward curves pending GRPO training run. Colab notebook ready.
09 / 10

Why This Scores on Every Judging Dimension

CriterionWeightOur AngleEvidence
Environment Innovation 40% First RL env purpose-built for WDCT / declaration-execution alignment 12-tool multi-phase, multi-agent, 300-decision episodes with Cricsheet Markov engine
Storytelling 30% Clear capability gap → environment design → reward signal → training → measurable WDCT improvement README, this slide deck, Colab notebook, Gradio live demo
Improvement in Rewards 20% Two-stage curriculum produces observable r_validity spike then coherence rise Colab plots, before/after tool call samples, coherence heatmap
Reward & Pipeline 10% 4-rubric composite, hard to game (r_result at episode end enforces real match outcomes) server/reward_calculator.py, server/coherence_grader.py, stateless GRPO reward fn
Unique technical contributions:
Stateless GRPO via prompt-text parsing · Plan staleness penalty · Per-delivery commitment scoring · Phase-gated tool availability · LLM opponent via HF Router · Format-aware rules (T5/T20/ODI) · Tool budget + fine system · Real player roster lookup
Minimum requirements met:
✅ OpenEnv latest   ✅ TRL/Unsloth Colab   ✅ HF Space (ready)   ✅ README with results
✅ 3 opponent modes verified (heuristic · cricsheet · llm_live)   ⚠️ Blog / video post-training
10 / 10

Roadmap to Submission

🔴 Critical Path (on-site, Day 1–2)

  • Run Colab notebook (notebooks/colab_train_minimal.ipynb) → warmup → main chained training
  • Export: reward_curves.png, coherence_heatmap.png, tool_timeline.png
  • Deploy to HuggingFace Spaces → live interactive Gradio demo URL
  • Add HF Space URL + plot images to README
  • Write 500-word mini-blog on HF (problem → env → results)
  • Run Cricsheet data curation (scripts/curate_transitions.py) for real ball probs

💰 The Money Shot for Judges

A heatmap: episode × delivery coherence score, showing the gradient rising from ~0.35 (random) toward 0.75+ as training progresses. This directly visualizes the declared coherence improvement.

🟡 Stretch (improves score)

  • WDCT before/after comparison on canonical states (Over 35, 180/3)
  • Opponent cache for reproducible eval without API calls
  • <2 min screen demo video (Gradio UI + reward curve walkthrough)
github.com/[team]/cricket-captain-llm  ·  huggingface.co/spaces/[team]/cricket-captain
← → arrow keys to navigate