cricket-captain-llm / docs /benchmark_explainer.md
pratinavseth's picture
sync: pull latest from main (model_server.py, captain LLM toggle in ui.py, 0.6B configs, SUBMISSION + RUNTIME_DURABILITY docs)
e70c305 verified

CricketCaptain Benchmark Explainer

1. Problem Statement

CricketCaptain-LLM is an OpenEnv environment for training language-model agents on adaptive strategic decision-making.

The core problem is not just whether an LLM can produce a good-sounding plan. The harder question is whether the model can keep that plan useful while the world changes.

In cricket, every ball changes the world:

  • A wicket can force a lower-risk batting plan.
  • A boundary can force a field or bowling change.
  • Required run rate changes during a chase.
  • Batter/bowler matchups matter.
  • Field settings affect which shots are sensible.
  • A captain must think before the ball and adapt after the ball.

CricketCaptain turns this into a trainable benchmark. The LLM acts as a cricket captain and must:

  1. Plan before acting.
  2. Execute a tool call.
  3. Observe the ball result.
  4. Model the opponent's plan.
  5. Adapt tactics over a long match.
  6. Manage resources: wickets, overs, target, field, batters, bowlers.

The original motivation came from strategic coherence: LLMs often say one thing and do another. This benchmark extends that into a fuller strategic setting:

Can an LLM maintain, revise, and execute a strategy under adversarial match pressure?

2. Fit With OpenEnv Competition Themes

Multi-Agent Interactions

The submitted captain agent plays against an opponent policy. The opponent can be:

  • heuristic: fast format-aware cricket logic (T5/T20/ODI rules).
  • cricsheet: real Cricsheet ball-by-ball match data sampled by game context.
  • llm_live: live OpenAI-compatible LLM opponent (google/gemma-4-26B-A4B-it via HF Router).
  • llm_cached: replayed opponent decisions for reproducible evaluation.

This tests whether the agent can reason about another actor's incentives, field settings, and likely plans.

Long-Horizon Planning

A full match has many decisions across innings, phases, wickets, and pressure states. Rewards are partly delayed, so the agent cannot optimize only the next tool call.

World Modeling

The agent observes a partially summarized cricket world: score, over/ball, wickets, target, phase, field, batter profile, bowler profile, previous outcome. It must maintain an internal model of what is happening and update that model after every ball.

Self-Improvement

The same environment can support heuristic curriculum training, cached-opponent official evaluation, live LLM opponent self-play, and future agent-vs-agent training.

3. Environment Flow

At the top level, the match lifecycle is:

TOSS -> BATTING/BOWLING -> INNINGS_SWAP -> BATTING/BOWLING -> FINISHED

Within each batting or bowling phase, the tactical loop is:

PRE_OVER -> PRE_BALL -> BALL_RESOLUTION -> POST_BALL -> next decision

Toss

{"tool": "call_toss", "arguments": {"call": "heads", "decision": "bat"}}

Batting Tools

{"tool": "select_batter", "arguments": {"name": "Virat Kohli", "style": "anchor", "aggression": 0.35, "rationale": "Preserve wickets in the middle overs."}}
{"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Rotate strike against spin and keep wickets in hand."}}
{"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy single."}}
{"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Work into the gap."}}

plan_shot is not an overhead tool. Only set_strategy, set_bowling_strategy, plan_delivery, reflect_after_ball, and analyze_situation count against the 3 / over limit (see Tool budget).

Bowling Tools

{"tool": "choose_bowler", "arguments": {"name": "Jasprit Bumrah", "bowler_type": "pace", "style": "yorker", "rationale": "Attack the stumps at the death."}}
{"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Limit swing room."}}
{"tool": "set_field_setting", "arguments": {"setting": "Defensive"}}
{"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Protect boundaries and force a low-percentage shot."}}
{"tool": "bowl_delivery", "arguments": {}}

Post-Ball Reflection

{"tool": "reflect_after_ball", "arguments": {"reflection": "Boundary came through cover; shift wider and protect the off-side gap."}}

Analysis Tool

{"tool": "analyze_situation", "arguments": {"query_type": "match_situation"}}

4. Tool budget

The environment enforces a 3-call overhead budget per over (see CricketEnvironment.TOOL_BUDGET_PER_OVER and TOOL_FINE_PER_EXCESS in server/cricket_environment.py).

Overhead tools (increment the per-over counter; the 4th+ in the same over are fined): set_strategy, set_bowling_strategy, plan_delivery, reflect_after_ball, analyze_situation

Not overhead (do not use the 3 free “slots”): play_delivery, bowl_delivery, plan_shot, call_toss, select_batter, choose_bowler, set_field_setting, set_match_plan, update_match_plan

Each overhead call beyond the third in that over incurs an immediate −0.04 step reward. The prompt shows Tool budget: N/3 overhead calls used this over.

Training connection: train.py train uses real CricketEnvironment steps, so these fines are part of the return GRPO optimizes. That keeps long-horizon training aligned with the benchmark: agents must choose when to pay for analyze_situation and reflect_after_ball, while set_match_plan / update_match_plan let them carry structure across overs without spending overhead budget.

5. OpenEnv Architecture

LLM Agent / Evaluator
        |
        | WebSocket (OpenEnv)
        v
FastAPI server (server/app.py)
        |
        v
CricketEnvironment (server/cricket_environment.py)
        |
        +--> MarkovCricketEngine  (server/markov_engine.py)
        +--> FormatMapper         (server/format_mapper.py)
        +--> OpponentPolicy       (server/opponent_policy.py)
        +--> PlayerRoster         (server/player_roster.py)
        +--> CoherenceGrader      (server/coherence_grader.py)
        +--> RewardCalculator     (server/reward_calculator.py)
        +--> FieldModel           (server/field_model.py)

Key files:

File Role
server/app.py OpenEnv server entry point
server/cricket_environment.py reset, step, state implementation
server/format_mapper.py T5/T20/ODI closest-format selector; phase-aware shot weights, batter/bowler roles
server/opponent_policy.py Heuristic, Cricsheet, live LLM, cached LLM opponent policies
server/player_roster.py Fuzzy player lookup; batter/bowler profile extractor
models.py CricketAction, CricketObservation, CricketState
client.py WebSocket client CricketCaptainEnv
inference.py Random + LLM agent evaluation
train.py MT-GRPO + SFT training pipeline
eval.py Coherence heatmaps, reward curves, tool analytics

6. Format-Aware Rules

server/format_mapper.py auto-selects T5 / T20 / ODI rules by |max_overs − format_overs|:

Format max_overs Key differences
T5 ≤ 7 High-aggression throughout, powerplay dominates all overs
T20 8–35 Three phases (PP/Middle/Death); spin-heavy middle
ODI > 35 Four phases (PP/Middle-early/Middle-late/Death); anchor roles

The format mapper provides:

  • Phase-aware shot weights: boundary/six probability rises sharply in death overs
  • Batter roles with overs_active windows (opener, anchor, middle_order, finisher)
  • Bowler roles with preferred_phases (pace_opener, spin_controller, death_specialist)
  • Bowling strategy per phase (line, length, delivery_type, field_setting)

Both the heuristic opponent and the select_batter / choose_bowler tools draw from these tables.

7. Player Rosters

server/player_roster.py loads team profiles from data/player_profiles/ — 10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan.

When the agent calls select_batter or choose_bowler with a player name, the roster performs fuzzy lookup (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths from the profile.

8. What The Observation Contains

Each step returns a CricketObservation with:

  • game_state: toss / batting / bowling / finished
  • strategic_phase: pre_over / pre_ball / ball_resolution / post_ball
  • game_context: score, wickets, over, ball, target, phase, run_rate, req_rate
  • declared_strategy: current batting strategy (aggression, intent, rationale)
  • bowling_strategy: current bowling plan
  • field_setting: Aggressive / Balanced / Defensive
  • current_batter: batter profile (style, aggression, phase strengths)
  • current_bowler: bowler profile
  • last_outcome: ball outcome + tactical metadata (event type, shot zone, delivery features, field pressure, fielder effect)
  • available_tools: legal tools for current state (phase-gated)
  • tool_budget: overhead calls used this over vs 3-call limit
  • prompt_text: rendered prompt for the LLM

The LLM sees enough information to reason tactically, but not simulator internals.

9. Opponent Policies

Four modes in server/opponent_policy.py:

heuristic

Format-aware local policy using T5/T20/ODI rules from format_mapper.py. Picks shot intent from phase-weighted distributions, adjusts for wicket pressure (shifts conservative under 7+ wickets down), and selects batter/bowler roles by current over and format. Fast, no API key needed.

cricsheet

Samples real Cricsheet ball-by-ball deliveries indexed by (phase, wickets_band, innings_type). Automatically selects T20 or ODI data based on max_overs:

  • ≤ 25 overs → ball_outcomes_t20_v1.pkl (1.17M T20 deliveries from 5,176 matches)
  • 25 overs → ball_outcomes_odi_v1.pkl (1.65M ODI deliveries from 3,116 matches)

Progressive fallback widening (drop innings_type → drop wickets_band → any phase record) ensures no dead buckets. Heuristic fallback if data file absent.

llm_live

Calls google/gemma-4-26B-A4B-it via HF Router (or any OpenAI-compatible API). Graceful heuristic fallback when no API key is present, so local development never breaks.

llm_cached

Replays pre-recorded opponent decisions from JSONL. Does not call the configured model live. Use for official leaderboard-style evaluation where every compared captain faces identical opponent decisions.

10. Ball Physics And Markov Engine

The simulation uses server/markov_engine.py plus field/zone definitions in server/field_model.py.

Ball transition tables keyed by (over, wickets, score_band, phase, bowler_type):

  1. Cricsheet-derived: data/processed/cricket_transitions_v1.pkl when available
  2. Calibrated synthetic: data/transition_probs.json as fallback

After the base Markov draw, a hybrid tactical layer applies:

  • Shot target zones (cover, point, midwicket, long_on, …) matched against delivery line/length/variation
  • Field presets (Aggressive, Balanced, Defensive) expand into named fielder zones
  • Boundary riders cut off fours/sixes; inner-ring fielders save singles; slips/gully convert edges
  • Wides/no-balls, drops, misfields, overthrows, run-outs, caught-in-zone events add bounded stochastic noise
  • High chase pressure makes defensive batting less useful

11. Reward Design

Four-rubric composite reward:

Rubric Weight Frequency Measures
r_cricket 45% Per ball Dream11 proxy: runs, wickets, dots, milestones, economy, strike rate
r_behavior 25% Every turn Coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%)
r_result 20% Innings/episode end Win/loss vs DLS par, target margin, wickets
r_validity 10% Every turn Valid tool-call structure and legal phase-gated tool use

Plus a progress bonus added to r_result: min(0.25, tool_calls_made / 40.0) — caps at +0.25 once the agent makes ≥10 tool calls. Directly rewards escaping the planning-loop trap (where the policy maxes overhead tools without ever calling play_delivery).

Why these weights (rebalanced from the original 55/25/15/5): partial-trajectory training means r_result rarely fires (episodes truncate before completion). Putting 55% weight on a signal that fires <5% of the time washes out the gradient. The new 45/25/20/10 split mirrors the SWE-RL recipe (60% intermediate / 40% terminal) and matches what working coding-agent RL setups actually use.

r_tools is computed and logged but excluded from the composite — tool discipline is measured through outcomes.

Coherence Scoring (batting)

coherence = aggression_match × rationale_specificity × phase_appropriate
aggression_match   = 1 − |declared_aggression − shot_aggression_proxy|
rationale_specificity = (word_count_score + cricket_keyword_density) / 2
phase_appropriate  = 1 − |declared_aggression − phase_baseline|
phase_baselines: powerplay=0.55, middle=0.35, death=0.75

Single-Stage Training with Format Curriculum

The original two-stage (format → strategy) curriculum was collapsed because Qwen3-4B-Instruct-2507 emits <tool_call>...</tool_call> natively. The full composite reward fires from step 0.

What remains is a format-length curriculum within the warmup config: per-scenario max_overs is sampled from [2, 2, 2, 2, 2, 2, 3, 3, 3] (heavy on T2 so episodes complete inside the token budget). The main run then trains on full 5-over end-to-end matches, resuming from the warmup adapter.

12. Data Curation Pipeline

Step 1: Curate Ball Outcomes

python scripts/curate_transitions.py --format t20   # → ball_outcomes_t20_v1.pkl
python scripts/curate_transitions.py --format odi   # → ball_outcomes_odi_v1.pkl

Both files already generated:

  • data/processed/ball_outcomes_t20_v1.pkl — 1.17M T20 deliveries, 5,176 matches
  • data/processed/ball_outcomes_odi_v1.pkl — 1.65M ODI deliveries, 3,116 matches
  • data/processed/cricket_transitions_v1.pkl — 5,138 Markov keys, 2,878 high-confidence

Step 2: Build Evaluation Pack

python scripts/build_eval_pack.py --eval-pack-id adaptive_t20_v1

Step 3: Generate Opponent Cache

python scripts/generate_opponent_cache.py \
  --eval-pack data/eval_packs/adaptive_t20_v1.json \
  --mode heuristic \
  --split official \
  --output data/opponent_cache/adaptive_t20_v1.jsonl

13. Training Pipeline

Recommended: Single-Command Chain

# Warmup (5-over curriculum, 25 steps) → Main (20-over T20, 100 steps).
# Main auto-resumes from warmup adapter at ./checkpoints/stage2_final.
bash scripts/run_warmup_then_main.sh

Run Components Individually

Warmup only — short curriculum, bootstraps the LoRA adapter:

PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3_warmup.yaml

Main only — 5-over end-to-end, resumes the warmup adapter (or fresh if resume_from is empty):

PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3.yaml

train.py train uses TRL GRPOTrainer with environment_factory=CricketCaptainToolEnv. The captain being trained is loaded locally by Transformers/TRL and interacts with live environment instances through tool methods. opponent-mode llm_live affects only the adversary; it does not mean the trained captain is served through the HF inference endpoint.

The default training model is Qwen/Qwen3-4B-Instruct-2507 (256k native context, native Qwen3ForCausalLM in vLLM, no thinking blocks). The default live opponent model is google/gemma-4-26B-A4B-it. Roster-backed training requires --agent-team or env.agent_team in YAML so select_batter and choose_bowler use real player profiles instead of generic names.

14. Current Status (2026-04-25)

Implemented and verified

Component Status
OpenEnv server + client
14-tool strategic surface
4-rubric reward system
Tool budget system (3/over, −0.04 fine)
Format mapper (T5/T20/ODI)
Player rosters (10 T20I teams, fuzzy lookup)
Cricsheet T20 data (1.17M deliveries)
Cricsheet ODI data (1.65M deliveries)
Heuristic opponent (format-aware)
Cricsheet opponent (T20+ODI, context-indexed)
LLM live opponent (HF Router / OpenAI-compatible API)
LLM cached opponent
GRPO training script (environment_factory agent rollouts)
SFT data generator
Gradio demo UI
Colab training notebook

Verified end-to-end (2026-04-25)

All 3 opponent modes verified at 5-over inference + train-smoke:

Mode inference parse_err train-smoke r_validity coherence
heuristic 0% 1.0 0.556
cricsheet 0% 1.0 0.620
llm_live 0% 1.0 0.537

Pending for submission

  • Real GRPO training run with reward curves (requires HF compute)
  • HF Space deployment URL
  • Training-vs-baseline comparison plots
  • Mini-blog / video

15. Recommended Demo Story

  1. Show a late chase scenario: Over 16.0, 128/5, target 172

  2. Random/untrained model: invalid tools, blind aggression, ignores field/opponent

  3. Trained model: checks target pressure → selects finisher → plans boundary zones → responds after wicket → changes risk level

  4. Show metrics: parse errors ↓, coherence ↑, adaptation ↑, opponent_awareness ↑, reward ↑

The model learned to captain, not just emit a valid tool-call object.