File size: 18,204 Bytes
aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a cb4ffbd aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 c29cfbd 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 c29cfbd 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 2ce9b53 2fc50a9 2ce9b53 2fc50a9 2ce9b53 2fc50a9 2ce9b53 2fc50a9 2ce9b53 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a e70c305 2fc50a9 aabe36a 2fc50a9 e70c305 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 e70c305 aabe36a e70c305 aabe36a 2fc50a9 e70c305 aabe36a 2fc50a9 aabe36a e70c305 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 aabe36a 2fc50a9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 | # CricketCaptain Benchmark Explainer
## 1. Problem Statement
CricketCaptain-LLM is an OpenEnv environment for training language-model agents on **adaptive strategic decision-making**.
The core problem is not just whether an LLM can produce a good-sounding plan. The harder question is whether the model can keep that plan useful while the world changes.
In cricket, every ball changes the world:
- A wicket can force a lower-risk batting plan.
- A boundary can force a field or bowling change.
- Required run rate changes during a chase.
- Batter/bowler matchups matter.
- Field settings affect which shots are sensible.
- A captain must think before the ball and adapt after the ball.
CricketCaptain turns this into a trainable benchmark. The LLM acts as a cricket captain and must:
1. Plan before acting.
2. Execute a tool call.
3. Observe the ball result.
4. Model the opponent's plan.
5. Adapt tactics over a long match.
6. Manage resources: wickets, overs, target, field, batters, bowlers.
The original motivation came from strategic coherence: LLMs often say one thing and do another. This benchmark extends that into a fuller strategic setting:
> Can an LLM maintain, revise, and execute a strategy under adversarial match pressure?
## 2. Fit With OpenEnv Competition Themes
### Multi-Agent Interactions
The submitted captain agent plays against an opponent policy. The opponent can be:
- `heuristic`: fast format-aware cricket logic (T5/T20/ODI rules).
- `cricsheet`: real Cricsheet ball-by-ball match data sampled by game context.
- `llm_live`: live OpenAI-compatible LLM opponent (google/gemma-4-26B-A4B-it via HF Router).
- `llm_cached`: replayed opponent decisions for reproducible evaluation.
This tests whether the agent can reason about another actor's incentives, field settings, and likely plans.
### Long-Horizon Planning
A full match has many decisions across innings, phases, wickets, and pressure states. Rewards are partly delayed, so the agent cannot optimize only the next tool call.
### World Modeling
The agent observes a partially summarized cricket world: score, over/ball, wickets, target, phase, field, batter profile, bowler profile, previous outcome. It must maintain an internal model of what is happening and update that model after every ball.
### Self-Improvement
The same environment can support heuristic curriculum training, cached-opponent official evaluation, live LLM opponent self-play, and future agent-vs-agent training.
## 3. Environment Flow
At the top level, the match lifecycle is:
```text
TOSS -> BATTING/BOWLING -> INNINGS_SWAP -> BATTING/BOWLING -> FINISHED
```
Within each batting or bowling phase, the tactical loop is:
```text
PRE_OVER -> PRE_BALL -> BALL_RESOLUTION -> POST_BALL -> next decision
```
### Toss
```json
{"tool": "call_toss", "arguments": {"call": "heads", "decision": "bat"}}
```
### Batting Tools
```json
{"tool": "select_batter", "arguments": {"name": "Virat Kohli", "style": "anchor", "aggression": 0.35, "rationale": "Preserve wickets in the middle overs."}}
{"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Rotate strike against spin and keep wickets in hand."}}
{"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy single."}}
{"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Work into the gap."}}
```
`plan_shot` is **not** an overhead tool. Only `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation` count against the 3 / over limit (see Tool budget).
### Bowling Tools
```json
{"tool": "choose_bowler", "arguments": {"name": "Jasprit Bumrah", "bowler_type": "pace", "style": "yorker", "rationale": "Attack the stumps at the death."}}
{"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Limit swing room."}}
{"tool": "set_field_setting", "arguments": {"setting": "Defensive"}}
{"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Protect boundaries and force a low-percentage shot."}}
{"tool": "bowl_delivery", "arguments": {}}
```
### Post-Ball Reflection
```json
{"tool": "reflect_after_ball", "arguments": {"reflection": "Boundary came through cover; shift wider and protect the off-side gap."}}
```
### Analysis Tool
```json
{"tool": "analyze_situation", "arguments": {"query_type": "match_situation"}}
```
## 4. Tool budget
The environment enforces a **3-call overhead budget per over** (see `CricketEnvironment.TOOL_BUDGET_PER_OVER` and `TOOL_FINE_PER_EXCESS` in `server/cricket_environment.py`).
**Overhead tools** (increment the per-over counter; the 4th+ in the same over are fined):
`set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`
**Not overhead** (do not use the 3 free “slots”):
`play_delivery`, `bowl_delivery`, `plan_shot`, `call_toss`, `select_batter`, `choose_bowler`, `set_field_setting`, `set_match_plan`, `update_match_plan`
Each overhead call **beyond the third in that over** incurs an immediate **−0.04** step reward. The prompt shows `Tool budget: N/3 overhead calls used this over`.
**Training connection:** `train.py train` uses real `CricketEnvironment` steps, so these fines are part of the return GRPO optimizes. That keeps long-horizon training aligned with the benchmark: agents must choose when to pay for `analyze_situation` and `reflect_after_ball`, while `set_match_plan` / `update_match_plan` let them carry structure across overs without spending overhead budget.
## 5. OpenEnv Architecture
```text
LLM Agent / Evaluator
|
| WebSocket (OpenEnv)
v
FastAPI server (server/app.py)
|
v
CricketEnvironment (server/cricket_environment.py)
|
+--> MarkovCricketEngine (server/markov_engine.py)
+--> FormatMapper (server/format_mapper.py)
+--> OpponentPolicy (server/opponent_policy.py)
+--> PlayerRoster (server/player_roster.py)
+--> CoherenceGrader (server/coherence_grader.py)
+--> RewardCalculator (server/reward_calculator.py)
+--> FieldModel (server/field_model.py)
```
Key files:
| File | Role |
|------|------|
| `server/app.py` | OpenEnv server entry point |
| `server/cricket_environment.py` | `reset`, `step`, `state` implementation |
| `server/format_mapper.py` | T5/T20/ODI closest-format selector; phase-aware shot weights, batter/bowler roles |
| `server/opponent_policy.py` | Heuristic, Cricsheet, live LLM, cached LLM opponent policies |
| `server/player_roster.py` | Fuzzy player lookup; batter/bowler profile extractor |
| `models.py` | `CricketAction`, `CricketObservation`, `CricketState` |
| `client.py` | WebSocket client `CricketCaptainEnv` |
| `inference.py` | Random + LLM agent evaluation |
| `train.py` | MT-GRPO + SFT training pipeline |
| `eval.py` | Coherence heatmaps, reward curves, tool analytics |
## 6. Format-Aware Rules
`server/format_mapper.py` auto-selects T5 / T20 / ODI rules by `|max_overs − format_overs|`:
| Format | max_overs | Key differences |
|--------|-----------|-----------------|
| T5 | ≤ 7 | High-aggression throughout, powerplay dominates all overs |
| T20 | 8–35 | Three phases (PP/Middle/Death); spin-heavy middle |
| ODI | > 35 | Four phases (PP/Middle-early/Middle-late/Death); anchor roles |
The format mapper provides:
- **Phase-aware shot weights**: boundary/six probability rises sharply in death overs
- **Batter roles** with `overs_active` windows (opener, anchor, middle_order, finisher)
- **Bowler roles** with `preferred_phases` (pace_opener, spin_controller, death_specialist)
- **Bowling strategy** per phase (line, length, delivery_type, field_setting)
Both the heuristic opponent and the `select_batter` / `choose_bowler` tools draw from these tables.
## 7. Player Rosters
`server/player_roster.py` loads team profiles from `data/player_profiles/` — 10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan.
When the agent calls `select_batter` or `choose_bowler` with a player name, the roster performs **fuzzy lookup** (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths from the profile.
## 8. What The Observation Contains
Each step returns a `CricketObservation` with:
- `game_state`: toss / batting / bowling / finished
- `strategic_phase`: pre_over / pre_ball / ball_resolution / post_ball
- `game_context`: score, wickets, over, ball, target, phase, run_rate, req_rate
- `declared_strategy`: current batting strategy (aggression, intent, rationale)
- `bowling_strategy`: current bowling plan
- `field_setting`: Aggressive / Balanced / Defensive
- `current_batter`: batter profile (style, aggression, phase strengths)
- `current_bowler`: bowler profile
- `last_outcome`: ball outcome + tactical metadata (event type, shot zone, delivery features, field pressure, fielder effect)
- `available_tools`: legal tools for current state (phase-gated)
- `tool_budget`: overhead calls used this over vs 3-call limit
- `prompt_text`: rendered prompt for the LLM
The LLM sees enough information to reason tactically, but not simulator internals.
## 9. Opponent Policies
Four modes in `server/opponent_policy.py`:
### `heuristic`
Format-aware local policy using T5/T20/ODI rules from `format_mapper.py`. Picks shot intent from phase-weighted distributions, adjusts for wicket pressure (shifts conservative under 7+ wickets down), and selects batter/bowler roles by current over and format. Fast, no API key needed.
### `cricsheet`
Samples real Cricsheet ball-by-ball deliveries indexed by `(phase, wickets_band, innings_type)`. Automatically selects T20 or ODI data based on `max_overs`:
- ≤ 25 overs → `ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries from 5,176 matches)
- > 25 overs → `ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries from 3,116 matches)
Progressive fallback widening (drop innings_type → drop wickets_band → any phase record) ensures no dead buckets. Heuristic fallback if data file absent.
### `llm_live`
Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible API). Graceful heuristic fallback when no API key is present, so local development never breaks.
### `llm_cached`
Replays pre-recorded opponent decisions from JSONL. Does **not** call the configured model live. Use for official leaderboard-style evaluation where every compared captain faces identical opponent decisions.
## 10. Ball Physics And Markov Engine
The simulation uses `server/markov_engine.py` plus field/zone definitions in `server/field_model.py`.
Ball transition tables keyed by `(over, wickets, score_band, phase, bowler_type)`:
1. **Cricsheet-derived**: `data/processed/cricket_transitions_v1.pkl` when available
2. **Calibrated synthetic**: `data/transition_probs.json` as fallback
After the base Markov draw, a **hybrid tactical layer** applies:
- Shot target zones (`cover`, `point`, `midwicket`, `long_on`, …) matched against delivery line/length/variation
- Field presets (`Aggressive`, `Balanced`, `Defensive`) expand into named fielder zones
- Boundary riders cut off fours/sixes; inner-ring fielders save singles; slips/gully convert edges
- Wides/no-balls, drops, misfields, overthrows, run-outs, caught-in-zone events add bounded stochastic noise
- High chase pressure makes defensive batting less useful
## 11. Reward Design
Four-rubric composite reward:
| Rubric | Weight | Frequency | Measures |
|--------|--------|-----------|----------|
| `r_cricket` | **45%** | Per ball | Dream11 proxy: runs, wickets, dots, milestones, economy, strike rate |
| `r_behavior` | **25%** | Every turn | Coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%) |
| `r_result` | **20%** | Innings/episode end | Win/loss vs DLS par, target margin, wickets |
| `r_validity` | **10%** | Every turn | Valid tool-call structure and legal phase-gated tool use |
Plus a **progress bonus** added to `r_result`: `min(0.25, tool_calls_made / 40.0)` — caps at +0.25 once the agent makes ≥10 tool calls. Directly rewards escaping the planning-loop trap (where the policy maxes overhead tools without ever calling `play_delivery`).
**Why these weights** (rebalanced from the original 55/25/15/5): partial-trajectory training means `r_result` rarely fires (episodes truncate before completion). Putting 55% weight on a signal that fires <5% of the time washes out the gradient. The new 45/25/20/10 split mirrors the SWE-RL recipe (60% intermediate / 40% terminal) and matches what working coding-agent RL setups actually use.
`r_tools` is computed and logged but excluded from the composite — tool discipline is measured through outcomes.
### Coherence Scoring (batting)
```
coherence = aggression_match × rationale_specificity × phase_appropriate
aggression_match = 1 − |declared_aggression − shot_aggression_proxy|
rationale_specificity = (word_count_score + cricket_keyword_density) / 2
phase_appropriate = 1 − |declared_aggression − phase_baseline|
phase_baselines: powerplay=0.55, middle=0.35, death=0.75
```
### Single-Stage Training with Format Curriculum
The original two-stage (format → strategy) curriculum was collapsed because
Qwen3-4B-Instruct-2507 emits `<tool_call>...</tool_call>` natively. The full composite
reward fires from step 0.
What remains is a **format-length curriculum within the warmup config**: per-scenario
`max_overs` is sampled from `[2, 2, 2, 2, 2, 2, 3, 3, 3]` (heavy on T2 so episodes
complete inside the token budget). The main run then trains on full 5-over end-to-end
matches, resuming from the warmup adapter.
## 12. Data Curation Pipeline
### Step 1: Curate Ball Outcomes
```bash
python scripts/curate_transitions.py --format t20 # → ball_outcomes_t20_v1.pkl
python scripts/curate_transitions.py --format odi # → ball_outcomes_odi_v1.pkl
```
Both files already generated:
- `data/processed/ball_outcomes_t20_v1.pkl` — 1.17M T20 deliveries, 5,176 matches
- `data/processed/ball_outcomes_odi_v1.pkl` — 1.65M ODI deliveries, 3,116 matches
- `data/processed/cricket_transitions_v1.pkl` — 5,138 Markov keys, 2,878 high-confidence
### Step 2: Build Evaluation Pack
```bash
python scripts/build_eval_pack.py --eval-pack-id adaptive_t20_v1
```
### Step 3: Generate Opponent Cache
```bash
python scripts/generate_opponent_cache.py \
--eval-pack data/eval_packs/adaptive_t20_v1.json \
--mode heuristic \
--split official \
--output data/opponent_cache/adaptive_t20_v1.jsonl
```
## 13. Training Pipeline
### Recommended: Single-Command Chain
```bash
# Warmup (5-over curriculum, 25 steps) → Main (20-over T20, 100 steps).
# Main auto-resumes from warmup adapter at ./checkpoints/stage2_final.
bash scripts/run_warmup_then_main.sh
```
### Run Components Individually
**Warmup only — short curriculum, bootstraps the LoRA adapter:**
```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
python train.py train --config configs/cricket_train_qwen3_warmup.yaml
```
**Main only — 5-over end-to-end, resumes the warmup adapter (or fresh if `resume_from` is empty):**
```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
python train.py train --config configs/cricket_train_qwen3.yaml
```
`train.py train` uses TRL `GRPOTrainer` with `environment_factory=CricketCaptainToolEnv`. The captain being trained is loaded locally by Transformers/TRL and interacts with live environment instances through tool methods. `opponent-mode llm_live` affects only the adversary; it does not mean the trained captain is served through the HF inference endpoint.
The default training model is `Qwen/Qwen3-4B-Instruct-2507` (256k native context, native `Qwen3ForCausalLM` in vLLM, no thinking blocks). The default live opponent model is `google/gemma-4-26B-A4B-it`. Roster-backed training requires `--agent-team` or `env.agent_team` in YAML so `select_batter` and `choose_bowler` use real player profiles instead of generic names.
## 14. Current Status (2026-04-25)
### Implemented and verified
| Component | Status |
|-----------|--------|
| OpenEnv server + client | ✅ |
| 14-tool strategic surface | ✅ |
| 4-rubric reward system | ✅ |
| Tool budget system (3/over, −0.04 fine) | ✅ |
| Format mapper (T5/T20/ODI) | ✅ |
| Player rosters (10 T20I teams, fuzzy lookup) | ✅ |
| Cricsheet T20 data (1.17M deliveries) | ✅ |
| Cricsheet ODI data (1.65M deliveries) | ✅ |
| Heuristic opponent (format-aware) | ✅ |
| Cricsheet opponent (T20+ODI, context-indexed) | ✅ |
| LLM live opponent (HF Router / OpenAI-compatible API) | ✅ |
| LLM cached opponent | ✅ |
| GRPO training script (`environment_factory` agent rollouts) | ✅ |
| SFT data generator | ✅ |
| Gradio demo UI | ✅ |
| Colab training notebook | ✅ |
### Verified end-to-end (2026-04-25)
All 3 opponent modes verified at 5-over inference + train-smoke:
| Mode | inference parse_err | train-smoke r_validity | coherence |
|------|--------------------|-----------------------|-----------|
| heuristic | 0% | 1.0 | 0.556 |
| cricsheet | 0% | 1.0 | 0.620 |
| llm_live | 0% | 1.0 | 0.537 |
### Pending for submission
- Real GRPO training run with reward curves (requires HF compute)
- HF Space deployment URL
- Training-vs-baseline comparison plots
- Mini-blog / video
## 15. Recommended Demo Story
1. **Show a late chase scenario**: Over 16.0, 128/5, target 172
2. **Random/untrained model**: invalid tools, blind aggression, ignores field/opponent
3. **Trained model**: checks target pressure → selects finisher → plans boundary zones → responds after wicket → changes risk level
4. **Show metrics**: parse errors ↓, coherence ↑, adaptation ↑, opponent_awareness ↑, reward ↑
> The model learned to captain, not just emit a valid tool-call object.
|