cricket-captain-llm / docs /benchmark_explainer.md
pratinavseth's picture
sync: pull latest from main (model_server.py, captain LLM toggle in ui.py, 0.6B configs, SUBMISSION + RUNTIME_DURABILITY docs)
e70c305 verified
# CricketCaptain Benchmark Explainer
## 1. Problem Statement
CricketCaptain-LLM is an OpenEnv environment for training language-model agents on **adaptive strategic decision-making**.
The core problem is not just whether an LLM can produce a good-sounding plan. The harder question is whether the model can keep that plan useful while the world changes.
In cricket, every ball changes the world:
- A wicket can force a lower-risk batting plan.
- A boundary can force a field or bowling change.
- Required run rate changes during a chase.
- Batter/bowler matchups matter.
- Field settings affect which shots are sensible.
- A captain must think before the ball and adapt after the ball.
CricketCaptain turns this into a trainable benchmark. The LLM acts as a cricket captain and must:
1. Plan before acting.
2. Execute a tool call.
3. Observe the ball result.
4. Model the opponent's plan.
5. Adapt tactics over a long match.
6. Manage resources: wickets, overs, target, field, batters, bowlers.
The original motivation came from strategic coherence: LLMs often say one thing and do another. This benchmark extends that into a fuller strategic setting:
> Can an LLM maintain, revise, and execute a strategy under adversarial match pressure?
## 2. Fit With OpenEnv Competition Themes
### Multi-Agent Interactions
The submitted captain agent plays against an opponent policy. The opponent can be:
- `heuristic`: fast format-aware cricket logic (T5/T20/ODI rules).
- `cricsheet`: real Cricsheet ball-by-ball match data sampled by game context.
- `llm_live`: live OpenAI-compatible LLM opponent (google/gemma-4-26B-A4B-it via HF Router).
- `llm_cached`: replayed opponent decisions for reproducible evaluation.
This tests whether the agent can reason about another actor's incentives, field settings, and likely plans.
### Long-Horizon Planning
A full match has many decisions across innings, phases, wickets, and pressure states. Rewards are partly delayed, so the agent cannot optimize only the next tool call.
### World Modeling
The agent observes a partially summarized cricket world: score, over/ball, wickets, target, phase, field, batter profile, bowler profile, previous outcome. It must maintain an internal model of what is happening and update that model after every ball.
### Self-Improvement
The same environment can support heuristic curriculum training, cached-opponent official evaluation, live LLM opponent self-play, and future agent-vs-agent training.
## 3. Environment Flow
At the top level, the match lifecycle is:
```text
TOSS -> BATTING/BOWLING -> INNINGS_SWAP -> BATTING/BOWLING -> FINISHED
```
Within each batting or bowling phase, the tactical loop is:
```text
PRE_OVER -> PRE_BALL -> BALL_RESOLUTION -> POST_BALL -> next decision
```
### Toss
```json
{"tool": "call_toss", "arguments": {"call": "heads", "decision": "bat"}}
```
### Batting Tools
```json
{"tool": "select_batter", "arguments": {"name": "Virat Kohli", "style": "anchor", "aggression": 0.35, "rationale": "Preserve wickets in the middle overs."}}
{"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Rotate strike against spin and keep wickets in hand."}}
{"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy single."}}
{"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Work into the gap."}}
```
`plan_shot` is **not** an overhead tool. Only `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation` count against the 3 / over limit (see Tool budget).
### Bowling Tools
```json
{"tool": "choose_bowler", "arguments": {"name": "Jasprit Bumrah", "bowler_type": "pace", "style": "yorker", "rationale": "Attack the stumps at the death."}}
{"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Limit swing room."}}
{"tool": "set_field_setting", "arguments": {"setting": "Defensive"}}
{"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Protect boundaries and force a low-percentage shot."}}
{"tool": "bowl_delivery", "arguments": {}}
```
### Post-Ball Reflection
```json
{"tool": "reflect_after_ball", "arguments": {"reflection": "Boundary came through cover; shift wider and protect the off-side gap."}}
```
### Analysis Tool
```json
{"tool": "analyze_situation", "arguments": {"query_type": "match_situation"}}
```
## 4. Tool budget
The environment enforces a **3-call overhead budget per over** (see `CricketEnvironment.TOOL_BUDGET_PER_OVER` and `TOOL_FINE_PER_EXCESS` in `server/cricket_environment.py`).
**Overhead tools** (increment the per-over counter; the 4th+ in the same over are fined):
`set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`
**Not overhead** (do not use the 3 free “slots”):
`play_delivery`, `bowl_delivery`, `plan_shot`, `call_toss`, `select_batter`, `choose_bowler`, `set_field_setting`, `set_match_plan`, `update_match_plan`
Each overhead call **beyond the third in that over** incurs an immediate **−0.04** step reward. The prompt shows `Tool budget: N/3 overhead calls used this over`.
**Training connection:** `train.py train` uses real `CricketEnvironment` steps, so these fines are part of the return GRPO optimizes. That keeps long-horizon training aligned with the benchmark: agents must choose when to pay for `analyze_situation` and `reflect_after_ball`, while `set_match_plan` / `update_match_plan` let them carry structure across overs without spending overhead budget.
## 5. OpenEnv Architecture
```text
LLM Agent / Evaluator
|
| WebSocket (OpenEnv)
v
FastAPI server (server/app.py)
|
v
CricketEnvironment (server/cricket_environment.py)
|
+--> MarkovCricketEngine (server/markov_engine.py)
+--> FormatMapper (server/format_mapper.py)
+--> OpponentPolicy (server/opponent_policy.py)
+--> PlayerRoster (server/player_roster.py)
+--> CoherenceGrader (server/coherence_grader.py)
+--> RewardCalculator (server/reward_calculator.py)
+--> FieldModel (server/field_model.py)
```
Key files:
| File | Role |
|------|------|
| `server/app.py` | OpenEnv server entry point |
| `server/cricket_environment.py` | `reset`, `step`, `state` implementation |
| `server/format_mapper.py` | T5/T20/ODI closest-format selector; phase-aware shot weights, batter/bowler roles |
| `server/opponent_policy.py` | Heuristic, Cricsheet, live LLM, cached LLM opponent policies |
| `server/player_roster.py` | Fuzzy player lookup; batter/bowler profile extractor |
| `models.py` | `CricketAction`, `CricketObservation`, `CricketState` |
| `client.py` | WebSocket client `CricketCaptainEnv` |
| `inference.py` | Random + LLM agent evaluation |
| `train.py` | MT-GRPO + SFT training pipeline |
| `eval.py` | Coherence heatmaps, reward curves, tool analytics |
## 6. Format-Aware Rules
`server/format_mapper.py` auto-selects T5 / T20 / ODI rules by `|max_overs − format_overs|`:
| Format | max_overs | Key differences |
|--------|-----------|-----------------|
| T5 | ≤ 7 | High-aggression throughout, powerplay dominates all overs |
| T20 | 8–35 | Three phases (PP/Middle/Death); spin-heavy middle |
| ODI | > 35 | Four phases (PP/Middle-early/Middle-late/Death); anchor roles |
The format mapper provides:
- **Phase-aware shot weights**: boundary/six probability rises sharply in death overs
- **Batter roles** with `overs_active` windows (opener, anchor, middle_order, finisher)
- **Bowler roles** with `preferred_phases` (pace_opener, spin_controller, death_specialist)
- **Bowling strategy** per phase (line, length, delivery_type, field_setting)
Both the heuristic opponent and the `select_batter` / `choose_bowler` tools draw from these tables.
## 7. Player Rosters
`server/player_roster.py` loads team profiles from `data/player_profiles/` — 10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan.
When the agent calls `select_batter` or `choose_bowler` with a player name, the roster performs **fuzzy lookup** (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths from the profile.
## 8. What The Observation Contains
Each step returns a `CricketObservation` with:
- `game_state`: toss / batting / bowling / finished
- `strategic_phase`: pre_over / pre_ball / ball_resolution / post_ball
- `game_context`: score, wickets, over, ball, target, phase, run_rate, req_rate
- `declared_strategy`: current batting strategy (aggression, intent, rationale)
- `bowling_strategy`: current bowling plan
- `field_setting`: Aggressive / Balanced / Defensive
- `current_batter`: batter profile (style, aggression, phase strengths)
- `current_bowler`: bowler profile
- `last_outcome`: ball outcome + tactical metadata (event type, shot zone, delivery features, field pressure, fielder effect)
- `available_tools`: legal tools for current state (phase-gated)
- `tool_budget`: overhead calls used this over vs 3-call limit
- `prompt_text`: rendered prompt for the LLM
The LLM sees enough information to reason tactically, but not simulator internals.
## 9. Opponent Policies
Four modes in `server/opponent_policy.py`:
### `heuristic`
Format-aware local policy using T5/T20/ODI rules from `format_mapper.py`. Picks shot intent from phase-weighted distributions, adjusts for wicket pressure (shifts conservative under 7+ wickets down), and selects batter/bowler roles by current over and format. Fast, no API key needed.
### `cricsheet`
Samples real Cricsheet ball-by-ball deliveries indexed by `(phase, wickets_band, innings_type)`. Automatically selects T20 or ODI data based on `max_overs`:
- ≤ 25 overs → `ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries from 5,176 matches)
- > 25 overs → `ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries from 3,116 matches)
Progressive fallback widening (drop innings_type → drop wickets_band → any phase record) ensures no dead buckets. Heuristic fallback if data file absent.
### `llm_live`
Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible API). Graceful heuristic fallback when no API key is present, so local development never breaks.
### `llm_cached`
Replays pre-recorded opponent decisions from JSONL. Does **not** call the configured model live. Use for official leaderboard-style evaluation where every compared captain faces identical opponent decisions.
## 10. Ball Physics And Markov Engine
The simulation uses `server/markov_engine.py` plus field/zone definitions in `server/field_model.py`.
Ball transition tables keyed by `(over, wickets, score_band, phase, bowler_type)`:
1. **Cricsheet-derived**: `data/processed/cricket_transitions_v1.pkl` when available
2. **Calibrated synthetic**: `data/transition_probs.json` as fallback
After the base Markov draw, a **hybrid tactical layer** applies:
- Shot target zones (`cover`, `point`, `midwicket`, `long_on`, …) matched against delivery line/length/variation
- Field presets (`Aggressive`, `Balanced`, `Defensive`) expand into named fielder zones
- Boundary riders cut off fours/sixes; inner-ring fielders save singles; slips/gully convert edges
- Wides/no-balls, drops, misfields, overthrows, run-outs, caught-in-zone events add bounded stochastic noise
- High chase pressure makes defensive batting less useful
## 11. Reward Design
Four-rubric composite reward:
| Rubric | Weight | Frequency | Measures |
|--------|--------|-----------|----------|
| `r_cricket` | **45%** | Per ball | Dream11 proxy: runs, wickets, dots, milestones, economy, strike rate |
| `r_behavior` | **25%** | Every turn | Coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%) |
| `r_result` | **20%** | Innings/episode end | Win/loss vs DLS par, target margin, wickets |
| `r_validity` | **10%** | Every turn | Valid tool-call structure and legal phase-gated tool use |
Plus a **progress bonus** added to `r_result`: `min(0.25, tool_calls_made / 40.0)` — caps at +0.25 once the agent makes ≥10 tool calls. Directly rewards escaping the planning-loop trap (where the policy maxes overhead tools without ever calling `play_delivery`).
**Why these weights** (rebalanced from the original 55/25/15/5): partial-trajectory training means `r_result` rarely fires (episodes truncate before completion). Putting 55% weight on a signal that fires <5% of the time washes out the gradient. The new 45/25/20/10 split mirrors the SWE-RL recipe (60% intermediate / 40% terminal) and matches what working coding-agent RL setups actually use.
`r_tools` is computed and logged but excluded from the composite — tool discipline is measured through outcomes.
### Coherence Scoring (batting)
```
coherence = aggression_match × rationale_specificity × phase_appropriate
aggression_match = 1 − |declared_aggression − shot_aggression_proxy|
rationale_specificity = (word_count_score + cricket_keyword_density) / 2
phase_appropriate = 1 − |declared_aggression − phase_baseline|
phase_baselines: powerplay=0.55, middle=0.35, death=0.75
```
### Single-Stage Training with Format Curriculum
The original two-stage (format → strategy) curriculum was collapsed because
Qwen3-4B-Instruct-2507 emits `<tool_call>...</tool_call>` natively. The full composite
reward fires from step 0.
What remains is a **format-length curriculum within the warmup config**: per-scenario
`max_overs` is sampled from `[2, 2, 2, 2, 2, 2, 3, 3, 3]` (heavy on T2 so episodes
complete inside the token budget). The main run then trains on full 5-over end-to-end
matches, resuming from the warmup adapter.
## 12. Data Curation Pipeline
### Step 1: Curate Ball Outcomes
```bash
python scripts/curate_transitions.py --format t20 # → ball_outcomes_t20_v1.pkl
python scripts/curate_transitions.py --format odi # → ball_outcomes_odi_v1.pkl
```
Both files already generated:
- `data/processed/ball_outcomes_t20_v1.pkl` — 1.17M T20 deliveries, 5,176 matches
- `data/processed/ball_outcomes_odi_v1.pkl` — 1.65M ODI deliveries, 3,116 matches
- `data/processed/cricket_transitions_v1.pkl` — 5,138 Markov keys, 2,878 high-confidence
### Step 2: Build Evaluation Pack
```bash
python scripts/build_eval_pack.py --eval-pack-id adaptive_t20_v1
```
### Step 3: Generate Opponent Cache
```bash
python scripts/generate_opponent_cache.py \
--eval-pack data/eval_packs/adaptive_t20_v1.json \
--mode heuristic \
--split official \
--output data/opponent_cache/adaptive_t20_v1.jsonl
```
## 13. Training Pipeline
### Recommended: Single-Command Chain
```bash
# Warmup (5-over curriculum, 25 steps) → Main (20-over T20, 100 steps).
# Main auto-resumes from warmup adapter at ./checkpoints/stage2_final.
bash scripts/run_warmup_then_main.sh
```
### Run Components Individually
**Warmup only — short curriculum, bootstraps the LoRA adapter:**
```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
python train.py train --config configs/cricket_train_qwen3_warmup.yaml
```
**Main only — 5-over end-to-end, resumes the warmup adapter (or fresh if `resume_from` is empty):**
```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
python train.py train --config configs/cricket_train_qwen3.yaml
```
`train.py train` uses TRL `GRPOTrainer` with `environment_factory=CricketCaptainToolEnv`. The captain being trained is loaded locally by Transformers/TRL and interacts with live environment instances through tool methods. `opponent-mode llm_live` affects only the adversary; it does not mean the trained captain is served through the HF inference endpoint.
The default training model is `Qwen/Qwen3-4B-Instruct-2507` (256k native context, native `Qwen3ForCausalLM` in vLLM, no thinking blocks). The default live opponent model is `google/gemma-4-26B-A4B-it`. Roster-backed training requires `--agent-team` or `env.agent_team` in YAML so `select_batter` and `choose_bowler` use real player profiles instead of generic names.
## 14. Current Status (2026-04-25)
### Implemented and verified
| Component | Status |
|-----------|--------|
| OpenEnv server + client | ✅ |
| 14-tool strategic surface | ✅ |
| 4-rubric reward system | ✅ |
| Tool budget system (3/over, −0.04 fine) | ✅ |
| Format mapper (T5/T20/ODI) | ✅ |
| Player rosters (10 T20I teams, fuzzy lookup) | ✅ |
| Cricsheet T20 data (1.17M deliveries) | ✅ |
| Cricsheet ODI data (1.65M deliveries) | ✅ |
| Heuristic opponent (format-aware) | ✅ |
| Cricsheet opponent (T20+ODI, context-indexed) | ✅ |
| LLM live opponent (HF Router / OpenAI-compatible API) | ✅ |
| LLM cached opponent | ✅ |
| GRPO training script (`environment_factory` agent rollouts) | ✅ |
| SFT data generator | ✅ |
| Gradio demo UI | ✅ |
| Colab training notebook | ✅ |
### Verified end-to-end (2026-04-25)
All 3 opponent modes verified at 5-over inference + train-smoke:
| Mode | inference parse_err | train-smoke r_validity | coherence |
|------|--------------------|-----------------------|-----------|
| heuristic | 0% | 1.0 | 0.556 |
| cricsheet | 0% | 1.0 | 0.620 |
| llm_live | 0% | 1.0 | 0.537 |
### Pending for submission
- Real GRPO training run with reward curves (requires HF compute)
- HF Space deployment URL
- Training-vs-baseline comparison plots
- Mini-blog / video
## 15. Recommended Demo Story
1. **Show a late chase scenario**: Over 16.0, 128/5, target 172
2. **Random/untrained model**: invalid tools, blind aggression, ignores field/opponent
3. **Trained model**: checks target pressure → selects finisher → plans boundary zones → responds after wicket → changes risk level
4. **Show metrics**: parse errors ↓, coherence ↑, adaptation ↑, opponent_awareness ↑, reward ↑
> The model learned to captain, not just emit a valid tool-call object.