CricketCaptain Benchmark Explainer
1. Problem Statement
CricketCaptain-LLM is an OpenEnv environment for training language-model agents on adaptive strategic decision-making.
The core problem is not just whether an LLM can produce a good-sounding plan. The harder question is whether the model can keep that plan useful while the world changes.
In cricket, every ball changes the world:
- A wicket can force a lower-risk batting plan.
- A boundary can force a field or bowling change.
- Required run rate changes during a chase.
- Batter/bowler matchups matter.
- Field settings affect which shots are sensible.
- A captain must think before the ball and adapt after the ball.
CricketCaptain turns this into a trainable benchmark. The LLM acts as a cricket captain and must:
- Plan before acting.
- Execute a tool call.
- Observe the ball result.
- Model the opponent's plan.
- Adapt tactics over a long match.
- Manage resources: wickets, overs, target, field, batters, bowlers.
The original motivation came from strategic coherence: LLMs often say one thing and do another. This benchmark extends that into a fuller strategic setting:
Can an LLM maintain, revise, and execute a strategy under adversarial match pressure?
2. Fit With OpenEnv Competition Themes
Multi-Agent Interactions
The submitted captain agent plays against an opponent policy. The opponent can be:
heuristic: fast format-aware cricket logic (T5/T20/ODI rules).cricsheet: real Cricsheet ball-by-ball match data sampled by game context.llm_live: live OpenAI-compatible LLM opponent (google/gemma-4-26B-A4B-it via HF Router).llm_cached: replayed opponent decisions for reproducible evaluation.
This tests whether the agent can reason about another actor's incentives, field settings, and likely plans.
Long-Horizon Planning
A full match has many decisions across innings, phases, wickets, and pressure states. Rewards are partly delayed, so the agent cannot optimize only the next tool call.
World Modeling
The agent observes a partially summarized cricket world: score, over/ball, wickets, target, phase, field, batter profile, bowler profile, previous outcome. It must maintain an internal model of what is happening and update that model after every ball.
Self-Improvement
The same environment can support heuristic curriculum training, cached-opponent official evaluation, live LLM opponent self-play, and future agent-vs-agent training.
3. Environment Flow
At the top level, the match lifecycle is:
TOSS -> BATTING/BOWLING -> INNINGS_SWAP -> BATTING/BOWLING -> FINISHED
Within each batting or bowling phase, the tactical loop is:
PRE_OVER -> PRE_BALL -> BALL_RESOLUTION -> POST_BALL -> next decision
Toss
{"tool": "call_toss", "arguments": {"call": "heads", "decision": "bat"}}
Batting Tools
{"tool": "select_batter", "arguments": {"name": "Virat Kohli", "style": "anchor", "aggression": 0.35, "rationale": "Preserve wickets in the middle overs."}}
{"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Rotate strike against spin and keep wickets in hand."}}
{"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy single."}}
{"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Work into the gap."}}
plan_shot is not an overhead tool. Only set_strategy, set_bowling_strategy, plan_delivery, reflect_after_ball, and analyze_situation count against the 3 / over limit (see Tool budget).
Bowling Tools
{"tool": "choose_bowler", "arguments": {"name": "Jasprit Bumrah", "bowler_type": "pace", "style": "yorker", "rationale": "Attack the stumps at the death."}}
{"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Limit swing room."}}
{"tool": "set_field_setting", "arguments": {"setting": "Defensive"}}
{"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Protect boundaries and force a low-percentage shot."}}
{"tool": "bowl_delivery", "arguments": {}}
Post-Ball Reflection
{"tool": "reflect_after_ball", "arguments": {"reflection": "Boundary came through cover; shift wider and protect the off-side gap."}}
Analysis Tool
{"tool": "analyze_situation", "arguments": {"query_type": "match_situation"}}
4. Tool budget
The environment enforces a 3-call overhead budget per over (see CricketEnvironment.TOOL_BUDGET_PER_OVER and TOOL_FINE_PER_EXCESS in server/cricket_environment.py).
Overhead tools (increment the per-over counter; the 4th+ in the same over are fined):
set_strategy, set_bowling_strategy, plan_delivery, reflect_after_ball, analyze_situation
Not overhead (do not use the 3 free “slots”):
play_delivery, bowl_delivery, plan_shot, call_toss, select_batter, choose_bowler, set_field_setting, set_match_plan, update_match_plan
Each overhead call beyond the third in that over incurs an immediate −0.04 step reward. The prompt shows Tool budget: N/3 overhead calls used this over.
Training connection: train.py train uses real CricketEnvironment steps, so these fines are part of the return GRPO optimizes. That keeps long-horizon training aligned with the benchmark: agents must choose when to pay for analyze_situation and reflect_after_ball, while set_match_plan / update_match_plan let them carry structure across overs without spending overhead budget.
5. OpenEnv Architecture
LLM Agent / Evaluator
|
| WebSocket (OpenEnv)
v
FastAPI server (server/app.py)
|
v
CricketEnvironment (server/cricket_environment.py)
|
+--> MarkovCricketEngine (server/markov_engine.py)
+--> FormatMapper (server/format_mapper.py)
+--> OpponentPolicy (server/opponent_policy.py)
+--> PlayerRoster (server/player_roster.py)
+--> CoherenceGrader (server/coherence_grader.py)
+--> RewardCalculator (server/reward_calculator.py)
+--> FieldModel (server/field_model.py)
Key files:
| File | Role |
|---|---|
server/app.py |
OpenEnv server entry point |
server/cricket_environment.py |
reset, step, state implementation |
server/format_mapper.py |
T5/T20/ODI closest-format selector; phase-aware shot weights, batter/bowler roles |
server/opponent_policy.py |
Heuristic, Cricsheet, live LLM, cached LLM opponent policies |
server/player_roster.py |
Fuzzy player lookup; batter/bowler profile extractor |
models.py |
CricketAction, CricketObservation, CricketState |
client.py |
WebSocket client CricketCaptainEnv |
inference.py |
Random + LLM agent evaluation |
train.py |
MT-GRPO + SFT training pipeline |
eval.py |
Coherence heatmaps, reward curves, tool analytics |
6. Format-Aware Rules
server/format_mapper.py auto-selects T5 / T20 / ODI rules by |max_overs − format_overs|:
| Format | max_overs | Key differences |
|---|---|---|
| T5 | ≤ 7 | High-aggression throughout, powerplay dominates all overs |
| T20 | 8–35 | Three phases (PP/Middle/Death); spin-heavy middle |
| ODI | > 35 | Four phases (PP/Middle-early/Middle-late/Death); anchor roles |
The format mapper provides:
- Phase-aware shot weights: boundary/six probability rises sharply in death overs
- Batter roles with
overs_activewindows (opener, anchor, middle_order, finisher) - Bowler roles with
preferred_phases(pace_opener, spin_controller, death_specialist) - Bowling strategy per phase (line, length, delivery_type, field_setting)
Both the heuristic opponent and the select_batter / choose_bowler tools draw from these tables.
7. Player Rosters
server/player_roster.py loads team profiles from data/player_profiles/ — 10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan.
When the agent calls select_batter or choose_bowler with a player name, the roster performs fuzzy lookup (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths from the profile.
8. What The Observation Contains
Each step returns a CricketObservation with:
game_state: toss / batting / bowling / finishedstrategic_phase: pre_over / pre_ball / ball_resolution / post_ballgame_context: score, wickets, over, ball, target, phase, run_rate, req_ratedeclared_strategy: current batting strategy (aggression, intent, rationale)bowling_strategy: current bowling planfield_setting: Aggressive / Balanced / Defensivecurrent_batter: batter profile (style, aggression, phase strengths)current_bowler: bowler profilelast_outcome: ball outcome + tactical metadata (event type, shot zone, delivery features, field pressure, fielder effect)available_tools: legal tools for current state (phase-gated)tool_budget: overhead calls used this over vs 3-call limitprompt_text: rendered prompt for the LLM
The LLM sees enough information to reason tactically, but not simulator internals.
9. Opponent Policies
Four modes in server/opponent_policy.py:
heuristic
Format-aware local policy using T5/T20/ODI rules from format_mapper.py. Picks shot intent from phase-weighted distributions, adjusts for wicket pressure (shifts conservative under 7+ wickets down), and selects batter/bowler roles by current over and format. Fast, no API key needed.
cricsheet
Samples real Cricsheet ball-by-ball deliveries indexed by (phase, wickets_band, innings_type). Automatically selects T20 or ODI data based on max_overs:
- ≤ 25 overs →
ball_outcomes_t20_v1.pkl(1.17M T20 deliveries from 5,176 matches) 25 overs →
ball_outcomes_odi_v1.pkl(1.65M ODI deliveries from 3,116 matches)
Progressive fallback widening (drop innings_type → drop wickets_band → any phase record) ensures no dead buckets. Heuristic fallback if data file absent.
llm_live
Calls google/gemma-4-26B-A4B-it via HF Router (or any OpenAI-compatible API). Graceful heuristic fallback when no API key is present, so local development never breaks.
llm_cached
Replays pre-recorded opponent decisions from JSONL. Does not call the configured model live. Use for official leaderboard-style evaluation where every compared captain faces identical opponent decisions.
10. Ball Physics And Markov Engine
The simulation uses server/markov_engine.py plus field/zone definitions in server/field_model.py.
Ball transition tables keyed by (over, wickets, score_band, phase, bowler_type):
- Cricsheet-derived:
data/processed/cricket_transitions_v1.pklwhen available - Calibrated synthetic:
data/transition_probs.jsonas fallback
After the base Markov draw, a hybrid tactical layer applies:
- Shot target zones (
cover,point,midwicket,long_on, …) matched against delivery line/length/variation - Field presets (
Aggressive,Balanced,Defensive) expand into named fielder zones - Boundary riders cut off fours/sixes; inner-ring fielders save singles; slips/gully convert edges
- Wides/no-balls, drops, misfields, overthrows, run-outs, caught-in-zone events add bounded stochastic noise
- High chase pressure makes defensive batting less useful
11. Reward Design
Four-rubric composite reward:
| Rubric | Weight | Frequency | Measures |
|---|---|---|---|
r_cricket |
45% | Per ball | Dream11 proxy: runs, wickets, dots, milestones, economy, strike rate |
r_behavior |
25% | Every turn | Coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%) |
r_result |
20% | Innings/episode end | Win/loss vs DLS par, target margin, wickets |
r_validity |
10% | Every turn | Valid tool-call structure and legal phase-gated tool use |
Plus a progress bonus added to r_result: min(0.25, tool_calls_made / 40.0) — caps at +0.25 once the agent makes ≥10 tool calls. Directly rewards escaping the planning-loop trap (where the policy maxes overhead tools without ever calling play_delivery).
Why these weights (rebalanced from the original 55/25/15/5): partial-trajectory training means r_result rarely fires (episodes truncate before completion). Putting 55% weight on a signal that fires <5% of the time washes out the gradient. The new 45/25/20/10 split mirrors the SWE-RL recipe (60% intermediate / 40% terminal) and matches what working coding-agent RL setups actually use.
r_tools is computed and logged but excluded from the composite — tool discipline is measured through outcomes.
Coherence Scoring (batting)
coherence = aggression_match × rationale_specificity × phase_appropriate
aggression_match = 1 − |declared_aggression − shot_aggression_proxy|
rationale_specificity = (word_count_score + cricket_keyword_density) / 2
phase_appropriate = 1 − |declared_aggression − phase_baseline|
phase_baselines: powerplay=0.55, middle=0.35, death=0.75
Single-Stage Training with Format Curriculum
The original two-stage (format → strategy) curriculum was collapsed because
Qwen3-4B-Instruct-2507 emits <tool_call>...</tool_call> natively. The full composite
reward fires from step 0.
What remains is a format-length curriculum within the warmup config: per-scenario
max_overs is sampled from [2, 2, 2, 2, 2, 2, 3, 3, 3] (heavy on T2 so episodes
complete inside the token budget). The main run then trains on full 5-over end-to-end
matches, resuming from the warmup adapter.
12. Data Curation Pipeline
Step 1: Curate Ball Outcomes
python scripts/curate_transitions.py --format t20 # → ball_outcomes_t20_v1.pkl
python scripts/curate_transitions.py --format odi # → ball_outcomes_odi_v1.pkl
Both files already generated:
data/processed/ball_outcomes_t20_v1.pkl— 1.17M T20 deliveries, 5,176 matchesdata/processed/ball_outcomes_odi_v1.pkl— 1.65M ODI deliveries, 3,116 matchesdata/processed/cricket_transitions_v1.pkl— 5,138 Markov keys, 2,878 high-confidence
Step 2: Build Evaluation Pack
python scripts/build_eval_pack.py --eval-pack-id adaptive_t20_v1
Step 3: Generate Opponent Cache
python scripts/generate_opponent_cache.py \
--eval-pack data/eval_packs/adaptive_t20_v1.json \
--mode heuristic \
--split official \
--output data/opponent_cache/adaptive_t20_v1.jsonl
13. Training Pipeline
Recommended: Single-Command Chain
# Warmup (5-over curriculum, 25 steps) → Main (20-over T20, 100 steps).
# Main auto-resumes from warmup adapter at ./checkpoints/stage2_final.
bash scripts/run_warmup_then_main.sh
Run Components Individually
Warmup only — short curriculum, bootstraps the LoRA adapter:
PYTORCH_ALLOC_CONF=expandable_segments:True \
python train.py train --config configs/cricket_train_qwen3_warmup.yaml
Main only — 5-over end-to-end, resumes the warmup adapter (or fresh if resume_from is empty):
PYTORCH_ALLOC_CONF=expandable_segments:True \
python train.py train --config configs/cricket_train_qwen3.yaml
train.py train uses TRL GRPOTrainer with environment_factory=CricketCaptainToolEnv. The captain being trained is loaded locally by Transformers/TRL and interacts with live environment instances through tool methods. opponent-mode llm_live affects only the adversary; it does not mean the trained captain is served through the HF inference endpoint.
The default training model is Qwen/Qwen3-4B-Instruct-2507 (256k native context, native Qwen3ForCausalLM in vLLM, no thinking blocks). The default live opponent model is google/gemma-4-26B-A4B-it. Roster-backed training requires --agent-team or env.agent_team in YAML so select_batter and choose_bowler use real player profiles instead of generic names.
14. Current Status (2026-04-25)
Implemented and verified
| Component | Status |
|---|---|
| OpenEnv server + client | ✅ |
| 14-tool strategic surface | ✅ |
| 4-rubric reward system | ✅ |
| Tool budget system (3/over, −0.04 fine) | ✅ |
| Format mapper (T5/T20/ODI) | ✅ |
| Player rosters (10 T20I teams, fuzzy lookup) | ✅ |
| Cricsheet T20 data (1.17M deliveries) | ✅ |
| Cricsheet ODI data (1.65M deliveries) | ✅ |
| Heuristic opponent (format-aware) | ✅ |
| Cricsheet opponent (T20+ODI, context-indexed) | ✅ |
| LLM live opponent (HF Router / OpenAI-compatible API) | ✅ |
| LLM cached opponent | ✅ |
GRPO training script (environment_factory agent rollouts) |
✅ |
| SFT data generator | ✅ |
| Gradio demo UI | ✅ |
| Colab training notebook | ✅ |
Verified end-to-end (2026-04-25)
All 3 opponent modes verified at 5-over inference + train-smoke:
| Mode | inference parse_err | train-smoke r_validity | coherence |
|---|---|---|---|
| heuristic | 0% | 1.0 | 0.556 |
| cricsheet | 0% | 1.0 | 0.620 |
| llm_live | 0% | 1.0 | 0.537 |
Pending for submission
- Real GRPO training run with reward curves (requires HF compute)
- HF Space deployment URL
- Training-vs-baseline comparison plots
- Mini-blog / video
15. Recommended Demo Story
Show a late chase scenario: Over 16.0, 128/5, target 172
Random/untrained model: invalid tools, blind aggression, ignores field/opponent
Trained model: checks target pressure → selects finisher → plans boundary zones → responds after wicket → changes risk level
Show metrics: parse errors ↓, coherence ↑, adaptation ↑, opponent_awareness ↑, reward ↑
The model learned to captain, not just emit a valid tool-call object.