🏏
A multi-agent RL training environment for strategic coherence — teaching LLMs to mean what they say across 300 consecutive decisions.
"I'll consolidate and preserve wickets" → then actually plays defensively for 4 overs.
Targets: Theme #1 Multi-Agent Interaction + Theme #2 Long-Horizon Planning
arxiv:2503.07003 — the only public benchmark directly measuring whether LLMs execute what they declare.
| Property | Capability Forced |
|---|---|
| 300 consecutive decisions (50×6) | Long-horizon coherence |
| 10 wickets as irreversible budget | Consequence-aware risk planning |
| Powerplay → Middle → Death phases | Strategic revision at regime shifts |
| DLS par = ground-truth optimal score | Objective performance signal |
| "Declare strategy, then play shots" | Declaration-execution directly testable |
| LLM opponent (HF Router) | Theory-of-mind / opponent modeling |
| Full match: bat & bowl both innings | End-to-end role adaptation |
format_rules.jsonPLANNING
BATTING EXECUTION
BOWLING EXECUTION
ANALYSIS
// game_context { "over": 14, "ball": 3, "score": 112, "wickets": 2, "run_rate": 7.8, "req_rate": 8.4, "phase": "MIDDLE", "bowler_type": "spin", "field_setting": "Attacking" } // declared_strategy (agent's own prior declaration) { "phase_intent": "consolidate", "aggression": 0.35, "rationale": "Preserve wickets, build platform" } // tool_budget (per-over overhead counter) { "overhead_calls_this_over": 1, "budget": 3, "remaining": 2, "fines_accumulated": 0.0 } // last_outcome { "runs": 1, "wicket": false, "extras": 0 } // available_tools + tool_history (last 5)
prompt_text is a rendered summary of all above — fed directly to the LLM. Strategy extracted from rendered text for stateless GRPO.
| Rubric | Weight | Frequency | Measures | Key Sub-signals |
|---|---|---|---|---|
r_cricket |
45% | Per ball | Dream11 proxy: runs, wickets, milestones | dot%, boundary%, 50s/100s, maiden overs, economy |
r_behavior |
25% | Every turn | Declaration-execution alignment | coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%) |
r_result |
20% | Innings/episode end | Win/loss vs DLS par, target margin | score/par, wickets_remaining, lead/deficit, +0.25 progress bonus |
r_validity |
10% | Every turn | Parseable XML/JSON tool call | Format gate; 0 = parse fail, 1 = valid |
Rebalanced from 55/25/15/5 → 45/25/20/10 to match the SWE-RL recipe (60% intermediate / 40% terminal). Reasoning: partial-trajectory training rarely fires r_result; weighting it 55% wastes gradient on a near-zero signal.
coherence = ( aggression_match # |declared_aggression - shot_aggression_proxy| × rationale_specificity # min(words / 15, 1.0) × phase_appropriate # 1.0 if shot fits phase norms, 0.6 otherwise )
max_overs sampled from [2,2,2,2,2,2,3,3,3] so episodes complete in budget and r_result can fire<tool_call>...</tool_call> natively — no Stage 1 SFT neededenvironment_factory)TRL's GRPOTrainer requires a stateless reward function: reward_fn(prompts, completions) → list[float]. No env.step() inside.
prompt_text as "Batting Strategy: …". We parse it back with regex — no shared env state needed.
_STRATEGY_RE = re.compile( r"Batting Strategy:\s*(.+)$", re.MULTILINE ) _PHASE_RE = re.compile( r"Phase:\s+(POWERPLAY|MIDDLE|DEATH)", re.I ) # r_behavior scored from (prompt, completion) alone # r_result injected at episode end → all turns
# 1. Start server uvicorn server.app:app --port 8766 # 2. Sanity test (3-over match, heuristic) python train.py train-smoke \ --config configs/default.yaml \ --matches 1 --max-overs 3 # 3. Live LLM match (HF Router) python inference.py \ --config configs/default.yaml \ --max-overs 3 --opponent-mode llm_live # 4. Warmup → Main chained run (auto-resumes adapter) bash scripts/run_warmup_then_main.sh # 5. Eval: untrained vs trained head-to-head python compare_eval.py --model Qwen/Qwen3-4B-Instruct-2507 \ --label baseline --episodes 20 --max-overs 5 \ --output eval_results/baseline.json python compare_eval.py --model Qwen/Qwen3-4B-Instruct-2507 \ --adapter ./checkpoints/stage2_final \ --label trained --episodes 20 --max-overs 5 \ --output eval_results/trained.json python compare_eval.py --compare \ eval_results/baseline.json eval_results/trained.json
configs/default.yaml. Zero hardcoding.
| Criterion | Weight | Our Angle | Evidence |
|---|---|---|---|
| Environment Innovation | 40% | First RL env purpose-built for WDCT / declaration-execution alignment | 12-tool multi-phase, multi-agent, 300-decision episodes with Cricsheet Markov engine |
| Storytelling | 30% | Clear capability gap → environment design → reward signal → training → measurable WDCT improvement | README, this slide deck, Colab notebook, Gradio live demo |
| Improvement in Rewards | 20% | Two-stage curriculum produces observable r_validity spike then coherence rise | Colab plots, before/after tool call samples, coherence heatmap |
| Reward & Pipeline | 10% | 4-rubric composite, hard to game (r_result at episode end enforces real match outcomes) | server/reward_calculator.py, server/coherence_grader.py, stateless GRPO reward fn |
scripts/curate_transitions.py) for real ball probs