01 / 10

🏏

CricketCaptain-LLM

A multi-agent RL training environment for strategic coherence — teaching LLMs to mean what they say across 300 consecutive decisions.

"I'll consolidate and preserve wickets" → then actually plays defensively for 4 overs.

Targets: Theme #1 Multi-Agent Interaction + Theme #2 Long-Horizon Planning

🌐 OpenEnv ≥ 0.2.2 ⚡ TRL MT-GRPO 🤖 google/gemma-4-26B-A4B-it 🔁 HF Router 📊 Cricsheet Markov Engine 🎯 WDCT Benchmark

02 / 10

The Gap We're Closing

WDCT Benchmark — Words & Deeds Consistency

arxiv:2503.07003 — the only public benchmark directly measuring whether LLMs execute what they declare.

0.49

Smaller models (7B class)

0.76

GPT-4 (best published)

No RL training environment has targeted this benchmark directly.
We built one.

Why It Matters Beyond Cricket

Planning agents that can't commit to strategy fail silently
Reasoning traces are only useful if they predict the next action
Chain-of-thought gains are undermined by declaration-execution drift
Every agentic system suffers from this; no training environment targets it

A model that says "preserve wickets" but plays aggressive shots hasn't learned strategy — it's learned to sound strategic.

03 / 10

Why Cricket Is the Right Stress Test

Property	Capability Forced
300 consecutive decisions (50×6)	Long-horizon coherence
10 wickets as irreversible budget	Consequence-aware risk planning
Powerplay → Middle → Death phases	Strategic revision at regime shifts
DLS par = ground-truth optimal score	Objective performance signal
"Declare strategy, then play shots"	Declaration-execution directly testable
LLM opponent (HF Router)	Theory-of-mind / opponent modeling
Full match: bat & bowl both innings	End-to-end role adaptation

Why Not Chess / Math / Coding?

Chess: no natural language declarations; coherence untestable
Math: single-step; no 300-turn consistency requirement
Coding: rare phase transitions; no risk budget exhaustion
Cricket: declarations are mandatory tool calls, shots are mandatory tool calls — alignment is structurally enforced

Cricket is not the goal. It's the measurement apparatus for a capability that transfers to every agentic domain.

04 / 10

Environment Architecture

OpenEnv Server — State Machine

        TOSS
         → 
        BATTING
         → 
        BOWLING
         → 
        RESULT
      

Markov Engine

5-dim key: over × wickets × score_band × phase × bowler_type → (runs, wicket_fell)

Cricsheet Data

Ball-by-ball transition probs from real ODI/T20 matches; synthetic fallback

Format Mapper

T5 / T20 / ODI rules auto-selected by closest max_overs; phase-aware shot weights, batter & bowler roles from format_rules.json

Player Rosters

10 T20I team profiles; fuzzy name lookup (exact → surname → word-overlap); real aggression/style fed into select_batter / choose_bowler

Tool Budget

3 overhead calls/over (analyze, reflect, plan_delivery, set_strategy, set_bowling_strategy); −0.04 fine per excess call; plan_shot budget-free

LLM Opponent

google/gemma-4-26B-A4B-it via HF Router (default); graceful heuristic fallback; llm_cached mode for reproducible eval

DLS Par

Duckworth-Lewis par score as objective target; used in r_result

Concurrent Sessions

SUPPORTS_CONCURRENT_SESSIONS = True; max 4 parallel envs

12 Tools — 4 Categories

PLANNING

call_toss set_match_plan update_match_plan

BATTING EXECUTION

set_strategy plan_shot play_delivery

BOWLING EXECUTION

choose_bowler set_bowling_strategy bowl_delivery

ANALYSIS

analyze_situation reflect_after_ball

Tools are phase-gated — batting tools unavailable during bowling, etc. Invalid phase = 0 reward turn.

05 / 10

What the Agent Sees — Observation Space

Per-Turn Observation (structured JSON + rendered text)

// game_context
{ "over": 14, "ball": 3, "score": 112, "wickets": 2,
  "run_rate": 7.8, "req_rate": 8.4, "phase": "MIDDLE",
  "bowler_type": "spin", "field_setting": "Attacking" }

// declared_strategy (agent's own prior declaration)
{ "phase_intent": "consolidate", "aggression": 0.35,
  "rationale": "Preserve wickets, build platform" }

// tool_budget (per-over overhead counter)
{ "overhead_calls_this_over": 1, "budget": 3,
  "remaining": 2, "fines_accumulated": 0.0 }

// last_outcome
{ "runs": 1, "wicket": false, "extras": 0 }

// available_tools + tool_history (last 5)

State Fields Used as Reward Signals

coherence_scores[ ]

Per-delivery aggression_match × rationale_specificity × phase_fit

adaptation_scores[ ]

Strategy updated after wicket / phase shift; 0 if stuck

opponent_awareness_scores[ ]

Response to opponent's stated field/line changes

regret_scores[ ]

Counterfactual: did agent outperform or underperform heuristic baseline?

plan_commitment_scores[ ]

Keyword overlap: match_plan rationale → delivery rationale

plan_staleness_penalties[ ]

Penalty if plan not refreshed for 2+ overs when context shifted

prompt_text is a rendered summary of all above — fed directly to the LLM. Strategy extracted from rendered text for stateless GRPO.

06 / 10

4-Rubric Composite Reward — Hard to Game

Rubric	Weight	Frequency	Measures	Key Sub-signals
`r_cricket`	45%	Per ball	Dream11 proxy: runs, wickets, milestones	dot%, boundary%, 50s/100s, maiden overs, economy
`r_behavior`	25%	Every turn	Declaration-execution alignment	coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%)
`r_result`	20%	Innings/episode end	Win/loss vs DLS par, target margin	score/par, wickets_remaining, lead/deficit, +0.25 progress bonus
`r_validity`	10%	Every turn	Parseable XML/JSON tool call	Format gate; 0 = parse fail, 1 = valid

Rebalanced from 55/25/15/5 → 45/25/20/10 to match the SWE-RL recipe (60% intermediate / 40% terminal). Reasoning: partial-trajectory training rarely fires r_result; weighting it 55% wastes gradient on a near-zero signal.

Coherence Score Formula (per delivery)

coherence = (
  aggression_match   # |declared_aggression - shot_aggression_proxy|
  × rationale_specificity # min(words / 15, 1.0)
  × phase_appropriate  # 1.0 if shot fits phase norms, 0.6 otherwise
)

Single-Stage Training + Format Curriculum

Warmup (2–3 over curriculum): per-scenario max_overs sampled from [2,2,2,2,2,2,3,3,3] so episodes complete in budget and r_result can fire
Main run (5-over end-to-end): resumes warmup adapter, trains on target eval distribution
Qwen3-4B-Instruct-2507 emits <tool_call>...</tool_call> natively — no Stage 1 SFT needed
GRPO group size = 4; full episode advantages (TRL environment_factory)

07 / 10

Training Pipeline — Stateless GRPO

The Core Technical Insight

TRL's GRPOTrainer requires a stateless reward function: reward_fn(prompts, completions) → list[float]. No env.step() inside.

The strategy the agent declared is embedded in the rendered prompt_text as "Batting Strategy: …". We parse it back with regex — no shared env state needed.

_STRATEGY_RE = re.compile(
  r"Batting Strategy:\s*(.+)$", re.MULTILINE
)
_PHASE_RE    = re.compile(
  r"Phase:\s+(POWERPLAY|MIDDLE|DEATH)", re.I
)

# r_behavior scored from (prompt, completion) alone
# r_result injected at episode end → all turns

Stack

gemma-4-26B-A4B-it Unsloth 4-bit LoRA TRL GRPOTrainer MT-GRPO per-turn advantage

End-to-End Commands

# 1. Start server
uvicorn server.app:app --port 8766

# 2. Sanity test (3-over match, heuristic)
python train.py train-smoke \
  --config configs/default.yaml \
  --matches 1 --max-overs 3

# 3. Live LLM match (HF Router)
python inference.py \
  --config configs/default.yaml \
  --max-overs 3 --opponent-mode llm_live

# 4. Warmup → Main chained run (auto-resumes adapter)
bash scripts/run_warmup_then_main.sh

# 5. Eval: untrained vs trained head-to-head
python compare_eval.py --model Qwen/Qwen3-4B-Instruct-2507 \
  --label baseline --episodes 20 --max-overs 5 \
  --output eval_results/baseline.json
python compare_eval.py --model Qwen/Qwen3-4B-Instruct-2507 \
  --adapter ./checkpoints/stage2_final \
  --label trained --episodes 20 --max-overs 5 \
  --output eval_results/trained.json
python compare_eval.py --compare \
  eval_results/baseline.json eval_results/trained.json

All model / API / env settings live in configs/default.yaml. Zero hardcoding.

08 / 10

What We Measured — Baseline Results

0%

Parse error rate across all 14 runs — heuristic, cricsheet, llm_live opponents

1.0

r_validity across all 9 train-smoke matches (3 opponent modes × 3 matches)

0.62

Peak mean coherence (train-smoke, cricsheet opponent, 5-over)

3

Opponent modes verified end-to-end: heuristic · cricsheet · llm_live

What training should produce (target)

r_validity: 0.70 → 0.98+ after warmup (25 steps)
Coherence: ~0.52 (random) → 0.75+ after main run
analyze_situation calls cluster at over 6, 16, 36 transitions
Strategy declarations become more specific (>15 word rationales)
Shot choices match declared aggression level >80% of deliveries

Reward signals verified working ✅

plan_commitment_scores populated per delivery
plan_staleness_penalties active at over-end
coherence_scores differentiate matching vs mismatching strategies
adaptation_scores fire on wicket loss & phase transitions
opponent_awareness_scores respond to field change

All signals verified. Full reward curves pending GRPO training run. Colab notebook ready.

09 / 10

Why This Scores on Every Judging Dimension

Criterion	Weight	Our Angle	Evidence
Environment Innovation	40%	First RL env purpose-built for WDCT / declaration-execution alignment	12-tool multi-phase, multi-agent, 300-decision episodes with Cricsheet Markov engine
Storytelling	30%	Clear capability gap → environment design → reward signal → training → measurable WDCT improvement	README, this slide deck, Colab notebook, Gradio live demo
Improvement in Rewards	20%	Two-stage curriculum produces observable r_validity spike then coherence rise	Colab plots, before/after tool call samples, coherence heatmap
Reward & Pipeline	10%	4-rubric composite, hard to game (r_result at episode end enforces real match outcomes)	server/reward_calculator.py, server/coherence_grader.py, stateless GRPO reward fn

Unique technical contributions:
Stateless GRPO via prompt-text parsing · Plan staleness penalty · Per-delivery commitment scoring · Phase-gated tool availability · LLM opponent via HF Router · Format-aware rules (T5/T20/ODI) · Tool budget + fine system · Real player roster lookup

Minimum requirements met:
✅ OpenEnv latest ✅ TRL/Unsloth Colab ✅ HF Space (ready) ✅ README with results
✅ 3 opponent modes verified (heuristic · cricsheet · llm_live) ⚠️ Blog / video post-training

10 / 10

Roadmap to Submission

🔴 Critical Path (on-site, Day 1–2)

Run Colab notebook (notebooks/colab_train_minimal.ipynb) → warmup → main chained training
Export: reward_curves.png, coherence_heatmap.png, tool_timeline.png
Deploy to HuggingFace Spaces → live interactive Gradio demo URL
Add HF Space URL + plot images to README
Write 500-word mini-blog on HF (problem → env → results)
Run Cricsheet data curation (scripts/curate_transitions.py) for real ball probs

💰 The Money Shot for Judges

A heatmap: episode × delivery coherence score, showing the gradient rising from ~0.35 (random) toward 0.75+ as training progresses. This directly visualizes the declared coherence improvement.

🟡 Stretch (improves score)

WDCT before/after comparison on canonical states (Over 35, 180/3)
Opponent cache for reproducible eval without API calls
<2 min screen demo video (Gradio UI + reward curve walkthrough)

github.com/[team]/cricket-captain-llm · huggingface.co/spaces/[team]/cricket-captain