# CricketCaptain Benchmark Explainer ## 1. Problem Statement CricketCaptain-LLM is an OpenEnv environment for training language-model agents on **adaptive strategic decision-making**. The core problem is not just whether an LLM can produce a good-sounding plan. The harder question is whether the model can keep that plan useful while the world changes. In cricket, every ball changes the world: - A wicket can force a lower-risk batting plan. - A boundary can force a field or bowling change. - Required run rate changes during a chase. - Batter/bowler matchups matter. - Field settings affect which shots are sensible. - A captain must think before the ball and adapt after the ball. CricketCaptain turns this into a trainable benchmark. The LLM acts as a cricket captain and must: 1. Plan before acting. 2. Execute a tool call. 3. Observe the ball result. 4. Model the opponent's plan. 5. Adapt tactics over a long match. 6. Manage resources: wickets, overs, target, field, batters, bowlers. The original motivation came from strategic coherence: LLMs often say one thing and do another. This benchmark extends that into a fuller strategic setting: > Can an LLM maintain, revise, and execute a strategy under adversarial match pressure? ## 2. Fit With OpenEnv Competition Themes ### Multi-Agent Interactions The submitted captain agent plays against an opponent policy. The opponent can be: - `heuristic`: fast format-aware cricket logic (T5/T20/ODI rules). - `cricsheet`: real Cricsheet ball-by-ball match data sampled by game context. - `llm_live`: live OpenAI-compatible LLM opponent (google/gemma-4-26B-A4B-it via HF Router). - `llm_cached`: replayed opponent decisions for reproducible evaluation. This tests whether the agent can reason about another actor's incentives, field settings, and likely plans. ### Long-Horizon Planning A full match has many decisions across innings, phases, wickets, and pressure states. Rewards are partly delayed, so the agent cannot optimize only the next tool call. ### World Modeling The agent observes a partially summarized cricket world: score, over/ball, wickets, target, phase, field, batter profile, bowler profile, previous outcome. It must maintain an internal model of what is happening and update that model after every ball. ### Self-Improvement The same environment can support heuristic curriculum training, cached-opponent official evaluation, live LLM opponent self-play, and future agent-vs-agent training. ## 3. Environment Flow At the top level, the match lifecycle is: ```text TOSS -> BATTING/BOWLING -> INNINGS_SWAP -> BATTING/BOWLING -> FINISHED ``` Within each batting or bowling phase, the tactical loop is: ```text PRE_OVER -> PRE_BALL -> BALL_RESOLUTION -> POST_BALL -> next decision ``` ### Toss ```json {"tool": "call_toss", "arguments": {"call": "heads", "decision": "bat"}} ``` ### Batting Tools ```json {"tool": "select_batter", "arguments": {"name": "Virat Kohli", "style": "anchor", "aggression": 0.35, "rationale": "Preserve wickets in the middle overs."}} {"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Rotate strike against spin and keep wickets in hand."}} {"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy single."}} {"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Work into the gap."}} ``` `plan_shot` is **not** an overhead tool. Only `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation` count against the 3 / over limit (see Tool budget). ### Bowling Tools ```json {"tool": "choose_bowler", "arguments": {"name": "Jasprit Bumrah", "bowler_type": "pace", "style": "yorker", "rationale": "Attack the stumps at the death."}} {"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Limit swing room."}} {"tool": "set_field_setting", "arguments": {"setting": "Defensive"}} {"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Protect boundaries and force a low-percentage shot."}} {"tool": "bowl_delivery", "arguments": {}} ``` ### Post-Ball Reflection ```json {"tool": "reflect_after_ball", "arguments": {"reflection": "Boundary came through cover; shift wider and protect the off-side gap."}} ``` ### Analysis Tool ```json {"tool": "analyze_situation", "arguments": {"query_type": "match_situation"}} ``` ## 4. Tool budget The environment enforces a **3-call overhead budget per over** (see `CricketEnvironment.TOOL_BUDGET_PER_OVER` and `TOOL_FINE_PER_EXCESS` in `server/cricket_environment.py`). **Overhead tools** (increment the per-over counter; the 4th+ in the same over are fined): `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation` **Not overhead** (do not use the 3 free “slots”): `play_delivery`, `bowl_delivery`, `plan_shot`, `call_toss`, `select_batter`, `choose_bowler`, `set_field_setting`, `set_match_plan`, `update_match_plan` Each overhead call **beyond the third in that over** incurs an immediate **−0.04** step reward. The prompt shows `Tool budget: N/3 overhead calls used this over`. **Training connection:** `train.py train` uses real `CricketEnvironment` steps, so these fines are part of the return GRPO optimizes. That keeps long-horizon training aligned with the benchmark: agents must choose when to pay for `analyze_situation` and `reflect_after_ball`, while `set_match_plan` / `update_match_plan` let them carry structure across overs without spending overhead budget. ## 5. OpenEnv Architecture ```text LLM Agent / Evaluator | | WebSocket (OpenEnv) v FastAPI server (server/app.py) | v CricketEnvironment (server/cricket_environment.py) | +--> MarkovCricketEngine (server/markov_engine.py) +--> FormatMapper (server/format_mapper.py) +--> OpponentPolicy (server/opponent_policy.py) +--> PlayerRoster (server/player_roster.py) +--> CoherenceGrader (server/coherence_grader.py) +--> RewardCalculator (server/reward_calculator.py) +--> FieldModel (server/field_model.py) ``` Key files: | File | Role | |------|------| | `server/app.py` | OpenEnv server entry point | | `server/cricket_environment.py` | `reset`, `step`, `state` implementation | | `server/format_mapper.py` | T5/T20/ODI closest-format selector; phase-aware shot weights, batter/bowler roles | | `server/opponent_policy.py` | Heuristic, Cricsheet, live LLM, cached LLM opponent policies | | `server/player_roster.py` | Fuzzy player lookup; batter/bowler profile extractor | | `models.py` | `CricketAction`, `CricketObservation`, `CricketState` | | `client.py` | WebSocket client `CricketCaptainEnv` | | `inference.py` | Random + LLM agent evaluation | | `train.py` | MT-GRPO + SFT training pipeline | | `eval.py` | Coherence heatmaps, reward curves, tool analytics | ## 6. Format-Aware Rules `server/format_mapper.py` auto-selects T5 / T20 / ODI rules by `|max_overs − format_overs|`: | Format | max_overs | Key differences | |--------|-----------|-----------------| | T5 | ≤ 7 | High-aggression throughout, powerplay dominates all overs | | T20 | 8–35 | Three phases (PP/Middle/Death); spin-heavy middle | | ODI | > 35 | Four phases (PP/Middle-early/Middle-late/Death); anchor roles | The format mapper provides: - **Phase-aware shot weights**: boundary/six probability rises sharply in death overs - **Batter roles** with `overs_active` windows (opener, anchor, middle_order, finisher) - **Bowler roles** with `preferred_phases` (pace_opener, spin_controller, death_specialist) - **Bowling strategy** per phase (line, length, delivery_type, field_setting) Both the heuristic opponent and the `select_batter` / `choose_bowler` tools draw from these tables. ## 7. Player Rosters `server/player_roster.py` loads team profiles from `data/player_profiles/` — 10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan. When the agent calls `select_batter` or `choose_bowler` with a player name, the roster performs **fuzzy lookup** (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths from the profile. ## 8. What The Observation Contains Each step returns a `CricketObservation` with: - `game_state`: toss / batting / bowling / finished - `strategic_phase`: pre_over / pre_ball / ball_resolution / post_ball - `game_context`: score, wickets, over, ball, target, phase, run_rate, req_rate - `declared_strategy`: current batting strategy (aggression, intent, rationale) - `bowling_strategy`: current bowling plan - `field_setting`: Aggressive / Balanced / Defensive - `current_batter`: batter profile (style, aggression, phase strengths) - `current_bowler`: bowler profile - `last_outcome`: ball outcome + tactical metadata (event type, shot zone, delivery features, field pressure, fielder effect) - `available_tools`: legal tools for current state (phase-gated) - `tool_budget`: overhead calls used this over vs 3-call limit - `prompt_text`: rendered prompt for the LLM The LLM sees enough information to reason tactically, but not simulator internals. ## 9. Opponent Policies Four modes in `server/opponent_policy.py`: ### `heuristic` Format-aware local policy using T5/T20/ODI rules from `format_mapper.py`. Picks shot intent from phase-weighted distributions, adjusts for wicket pressure (shifts conservative under 7+ wickets down), and selects batter/bowler roles by current over and format. Fast, no API key needed. ### `cricsheet` Samples real Cricsheet ball-by-ball deliveries indexed by `(phase, wickets_band, innings_type)`. Automatically selects T20 or ODI data based on `max_overs`: - ≤ 25 overs → `ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries from 5,176 matches) - > 25 overs → `ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries from 3,116 matches) Progressive fallback widening (drop innings_type → drop wickets_band → any phase record) ensures no dead buckets. Heuristic fallback if data file absent. ### `llm_live` Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible API). Graceful heuristic fallback when no API key is present, so local development never breaks. ### `llm_cached` Replays pre-recorded opponent decisions from JSONL. Does **not** call the configured model live. Use for official leaderboard-style evaluation where every compared captain faces identical opponent decisions. ## 10. Ball Physics And Markov Engine The simulation uses `server/markov_engine.py` plus field/zone definitions in `server/field_model.py`. Ball transition tables keyed by `(over, wickets, score_band, phase, bowler_type)`: 1. **Cricsheet-derived**: `data/processed/cricket_transitions_v1.pkl` when available 2. **Calibrated synthetic**: `data/transition_probs.json` as fallback After the base Markov draw, a **hybrid tactical layer** applies: - Shot target zones (`cover`, `point`, `midwicket`, `long_on`, …) matched against delivery line/length/variation - Field presets (`Aggressive`, `Balanced`, `Defensive`) expand into named fielder zones - Boundary riders cut off fours/sixes; inner-ring fielders save singles; slips/gully convert edges - Wides/no-balls, drops, misfields, overthrows, run-outs, caught-in-zone events add bounded stochastic noise - High chase pressure makes defensive batting less useful ## 11. Reward Design Four-rubric composite reward: | Rubric | Weight | Frequency | Measures | |--------|--------|-----------|----------| | `r_cricket` | **45%** | Per ball | Dream11 proxy: runs, wickets, dots, milestones, economy, strike rate | | `r_behavior` | **25%** | Every turn | Coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%) | | `r_result` | **20%** | Innings/episode end | Win/loss vs DLS par, target margin, wickets | | `r_validity` | **10%** | Every turn | Valid tool-call structure and legal phase-gated tool use | Plus a **progress bonus** added to `r_result`: `min(0.25, tool_calls_made / 40.0)` — caps at +0.25 once the agent makes ≥10 tool calls. Directly rewards escaping the planning-loop trap (where the policy maxes overhead tools without ever calling `play_delivery`). **Why these weights** (rebalanced from the original 55/25/15/5): partial-trajectory training means `r_result` rarely fires (episodes truncate before completion). Putting 55% weight on a signal that fires <5% of the time washes out the gradient. The new 45/25/20/10 split mirrors the SWE-RL recipe (60% intermediate / 40% terminal) and matches what working coding-agent RL setups actually use. `r_tools` is computed and logged but excluded from the composite — tool discipline is measured through outcomes. ### Coherence Scoring (batting) ``` coherence = aggression_match × rationale_specificity × phase_appropriate aggression_match = 1 − |declared_aggression − shot_aggression_proxy| rationale_specificity = (word_count_score + cricket_keyword_density) / 2 phase_appropriate = 1 − |declared_aggression − phase_baseline| phase_baselines: powerplay=0.55, middle=0.35, death=0.75 ``` ### Single-Stage Training with Format Curriculum The original two-stage (format → strategy) curriculum was collapsed because Qwen3-4B-Instruct-2507 emits `...` natively. The full composite reward fires from step 0. What remains is a **format-length curriculum within the warmup config**: per-scenario `max_overs` is sampled from `[2, 2, 2, 2, 2, 2, 3, 3, 3]` (heavy on T2 so episodes complete inside the token budget). The main run then trains on full 5-over end-to-end matches, resuming from the warmup adapter. ## 12. Data Curation Pipeline ### Step 1: Curate Ball Outcomes ```bash python scripts/curate_transitions.py --format t20 # → ball_outcomes_t20_v1.pkl python scripts/curate_transitions.py --format odi # → ball_outcomes_odi_v1.pkl ``` Both files already generated: - `data/processed/ball_outcomes_t20_v1.pkl` — 1.17M T20 deliveries, 5,176 matches - `data/processed/ball_outcomes_odi_v1.pkl` — 1.65M ODI deliveries, 3,116 matches - `data/processed/cricket_transitions_v1.pkl` — 5,138 Markov keys, 2,878 high-confidence ### Step 2: Build Evaluation Pack ```bash python scripts/build_eval_pack.py --eval-pack-id adaptive_t20_v1 ``` ### Step 3: Generate Opponent Cache ```bash python scripts/generate_opponent_cache.py \ --eval-pack data/eval_packs/adaptive_t20_v1.json \ --mode heuristic \ --split official \ --output data/opponent_cache/adaptive_t20_v1.jsonl ``` ## 13. Training Pipeline ### Recommended: Single-Command Chain ```bash # Warmup (5-over curriculum, 25 steps) → Main (20-over T20, 100 steps). # Main auto-resumes from warmup adapter at ./checkpoints/stage2_final. bash scripts/run_warmup_then_main.sh ``` ### Run Components Individually **Warmup only — short curriculum, bootstraps the LoRA adapter:** ```bash PYTORCH_ALLOC_CONF=expandable_segments:True \ python train.py train --config configs/cricket_train_qwen3_warmup.yaml ``` **Main only — 5-over end-to-end, resumes the warmup adapter (or fresh if `resume_from` is empty):** ```bash PYTORCH_ALLOC_CONF=expandable_segments:True \ python train.py train --config configs/cricket_train_qwen3.yaml ``` `train.py train` uses TRL `GRPOTrainer` with `environment_factory=CricketCaptainToolEnv`. The captain being trained is loaded locally by Transformers/TRL and interacts with live environment instances through tool methods. `opponent-mode llm_live` affects only the adversary; it does not mean the trained captain is served through the HF inference endpoint. The default training model is `Qwen/Qwen3-4B-Instruct-2507` (256k native context, native `Qwen3ForCausalLM` in vLLM, no thinking blocks). The default live opponent model is `google/gemma-4-26B-A4B-it`. Roster-backed training requires `--agent-team` or `env.agent_team` in YAML so `select_batter` and `choose_bowler` use real player profiles instead of generic names. ## 14. Current Status (2026-04-25) ### Implemented and verified | Component | Status | |-----------|--------| | OpenEnv server + client | ✅ | | 14-tool strategic surface | ✅ | | 4-rubric reward system | ✅ | | Tool budget system (3/over, −0.04 fine) | ✅ | | Format mapper (T5/T20/ODI) | ✅ | | Player rosters (10 T20I teams, fuzzy lookup) | ✅ | | Cricsheet T20 data (1.17M deliveries) | ✅ | | Cricsheet ODI data (1.65M deliveries) | ✅ | | Heuristic opponent (format-aware) | ✅ | | Cricsheet opponent (T20+ODI, context-indexed) | ✅ | | LLM live opponent (HF Router / OpenAI-compatible API) | ✅ | | LLM cached opponent | ✅ | | GRPO training script (`environment_factory` agent rollouts) | ✅ | | SFT data generator | ✅ | | Gradio demo UI | ✅ | | Colab training notebook | ✅ | ### Verified end-to-end (2026-04-25) All 3 opponent modes verified at 5-over inference + train-smoke: | Mode | inference parse_err | train-smoke r_validity | coherence | |------|--------------------|-----------------------|-----------| | heuristic | 0% | 1.0 | 0.556 | | cricsheet | 0% | 1.0 | 0.620 | | llm_live | 0% | 1.0 | 0.537 | ### Pending for submission - Real GRPO training run with reward curves (requires HF compute) - HF Space deployment URL - Training-vs-baseline comparison plots - Mini-blog / video ## 15. Recommended Demo Story 1. **Show a late chase scenario**: Over 16.0, 128/5, target 172 2. **Random/untrained model**: invalid tools, blind aggression, ignores field/opponent 3. **Trained model**: checks target pressure → selects finisher → plans boundary zones → responds after wicket → changes risk level 4. **Show metrics**: parse errors ↓, coherence ↑, adaptation ↑, opponent_awareness ↑, reward ↑ > The model learned to captain, not just emit a valid tool-call object.