# CricketCaptain Benchmark Explainer

## 1. Problem Statement

CricketCaptain-LLM is an OpenEnv environment for training language-model agents on **adaptive strategic decision-making**.

The core problem is not just whether an LLM can produce a good-sounding plan. The harder question is whether the model can keep that plan useful while the world changes.

In cricket, every ball changes the world:

- A wicket can force a lower-risk batting plan.
- A boundary can force a field or bowling change.
- Required run rate changes during a chase.
- Batter/bowler matchups matter.
- Field settings affect which shots are sensible.
- A captain must think before the ball and adapt after the ball.

CricketCaptain turns this into a trainable benchmark. The LLM acts as a cricket captain and must:

1. Plan before acting.
2. Execute a tool call.
3. Observe the ball result.
4. Model the opponent's plan.
5. Adapt tactics over a long match.
6. Manage resources: wickets, overs, target, field, batters, bowlers.

The original motivation came from strategic coherence: LLMs often say one thing and do another. This benchmark extends that into a fuller strategic setting:

> Can an LLM maintain, revise, and execute a strategy under adversarial match pressure?

## 2. Fit With OpenEnv Competition Themes

### Multi-Agent Interactions

The submitted captain agent plays against an opponent policy. The opponent can be:

- `heuristic`: fast format-aware cricket logic (T5/T20/ODI rules).
- `cricsheet`: real Cricsheet ball-by-ball match data sampled by game context.
- `llm_live`: live OpenAI-compatible LLM opponent (google/gemma-4-26B-A4B-it via HF Router).
- `llm_cached`: replayed opponent decisions for reproducible evaluation.

This tests whether the agent can reason about another actor's incentives, field settings, and likely plans.

### Long-Horizon Planning

A full match has many decisions across innings, phases, wickets, and pressure states. Rewards are partly delayed, so the agent cannot optimize only the next tool call.

### World Modeling

The agent observes a partially summarized cricket world: score, over/ball, wickets, target, phase, field, batter profile, bowler profile, previous outcome. It must maintain an internal model of what is happening and update that model after every ball.

### Self-Improvement

The same environment can support heuristic curriculum training, cached-opponent official evaluation, live LLM opponent self-play, and future agent-vs-agent training.

## 3. Environment Flow

At the top level, the match lifecycle is:

```text
TOSS -> BATTING/BOWLING -> INNINGS_SWAP -> BATTING/BOWLING -> FINISHED
```

Within each batting or bowling phase, the tactical loop is:

```text
PRE_OVER -> PRE_BALL -> BALL_RESOLUTION -> POST_BALL -> next decision
```

### Toss

```json
{"tool": "call_toss", "arguments": {"call": "heads", "decision": "bat"}}
```

### Batting Tools

```json
{"tool": "select_batter", "arguments": {"name": "Virat Kohli", "style": "anchor", "aggression": 0.35, "rationale": "Preserve wickets in the middle overs."}}
{"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Rotate strike against spin and keep wickets in hand."}}
{"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy single."}}
{"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Work into the gap."}}
```

`plan_shot` is **not** an overhead tool. Only `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation` count against the 3 / over limit (see Tool budget).

### Bowling Tools

```json
{"tool": "choose_bowler", "arguments": {"name": "Jasprit Bumrah", "bowler_type": "pace", "style": "yorker", "rationale": "Attack the stumps at the death."}}
{"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Limit swing room."}}
{"tool": "set_field_setting", "arguments": {"setting": "Defensive"}}
{"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Protect boundaries and force a low-percentage shot."}}
{"tool": "bowl_delivery", "arguments": {}}
```

### Post-Ball Reflection

```json
{"tool": "reflect_after_ball", "arguments": {"reflection": "Boundary came through cover; shift wider and protect the off-side gap."}}
```

### Analysis Tool

```json
{"tool": "analyze_situation", "arguments": {"query_type": "match_situation"}}
```

## 4. Tool budget

The environment enforces a **3-call overhead budget per over** (see `CricketEnvironment.TOOL_BUDGET_PER_OVER` and `TOOL_FINE_PER_EXCESS` in `server/cricket_environment.py`).

**Overhead tools** (increment the per-over counter; the 4th+ in the same over are fined):
`set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`

**Not overhead** (do not use the 3 free “slots”):
`play_delivery`, `bowl_delivery`, `plan_shot`, `call_toss`, `select_batter`, `choose_bowler`, `set_field_setting`, `set_match_plan`, `update_match_plan`

Each overhead call **beyond the third in that over** incurs an immediate **−0.04** step reward. The prompt shows `Tool budget: N/3 overhead calls used this over`.

**Training connection:** `train.py train` uses real `CricketEnvironment` steps, so these fines are part of the return GRPO optimizes. That keeps long-horizon training aligned with the benchmark: agents must choose when to pay for `analyze_situation` and `reflect_after_ball`, while `set_match_plan` / `update_match_plan` let them carry structure across overs without spending overhead budget.

## 5. OpenEnv Architecture

```text
LLM Agent / Evaluator
        |
        | WebSocket (OpenEnv)
        v
FastAPI server (server/app.py)
        |
        v
CricketEnvironment (server/cricket_environment.py)
        |
        +--> MarkovCricketEngine  (server/markov_engine.py)
        +--> FormatMapper         (server/format_mapper.py)
        +--> OpponentPolicy       (server/opponent_policy.py)
        +--> PlayerRoster         (server/player_roster.py)
        +--> CoherenceGrader      (server/coherence_grader.py)
        +--> RewardCalculator     (server/reward_calculator.py)
        +--> FieldModel           (server/field_model.py)
```

Key files:

| File | Role |
|------|------|
| `server/app.py` | OpenEnv server entry point |
| `server/cricket_environment.py` | `reset`, `step`, `state` implementation |
| `server/format_mapper.py` | T5/T20/ODI closest-format selector; phase-aware shot weights, batter/bowler roles |
| `server/opponent_policy.py` | Heuristic, Cricsheet, live LLM, cached LLM opponent policies |
| `server/player_roster.py` | Fuzzy player lookup; batter/bowler profile extractor |
| `models.py` | `CricketAction`, `CricketObservation`, `CricketState` |
| `client.py` | WebSocket client `CricketCaptainEnv` |
| `inference.py` | Random + LLM agent evaluation |
| `train.py` | MT-GRPO + SFT training pipeline |
| `eval.py` | Coherence heatmaps, reward curves, tool analytics |

## 6. Format-Aware Rules

`server/format_mapper.py` auto-selects T5 / T20 / ODI rules by `|max_overs − format_overs|`:

| Format | max_overs | Key differences |
|--------|-----------|-----------------|
| T5 | ≤ 7 | High-aggression throughout, powerplay dominates all overs |
| T20 | 8–35 | Three phases (PP/Middle/Death); spin-heavy middle |
| ODI | > 35 | Four phases (PP/Middle-early/Middle-late/Death); anchor roles |

The format mapper provides:
- **Phase-aware shot weights**: boundary/six probability rises sharply in death overs
- **Batter roles** with `overs_active` windows (opener, anchor, middle_order, finisher)
- **Bowler roles** with `preferred_phases` (pace_opener, spin_controller, death_specialist)
- **Bowling strategy** per phase (line, length, delivery_type, field_setting)

Both the heuristic opponent and the `select_batter` / `choose_bowler` tools draw from these tables.

## 7. Player Rosters

`server/player_roster.py` loads team profiles from `data/player_profiles/` — 10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan.

When the agent calls `select_batter` or `choose_bowler` with a player name, the roster performs **fuzzy lookup** (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths from the profile.

## 8. What The Observation Contains

Each step returns a `CricketObservation` with:

- `game_state`: toss / batting / bowling / finished
- `strategic_phase`: pre_over / pre_ball / ball_resolution / post_ball
- `game_context`: score, wickets, over, ball, target, phase, run_rate, req_rate
- `declared_strategy`: current batting strategy (aggression, intent, rationale)
- `bowling_strategy`: current bowling plan
- `field_setting`: Aggressive / Balanced / Defensive
- `current_batter`: batter profile (style, aggression, phase strengths)
- `current_bowler`: bowler profile
- `last_outcome`: ball outcome + tactical metadata (event type, shot zone, delivery features, field pressure, fielder effect)
- `available_tools`: legal tools for current state (phase-gated)
- `tool_budget`: overhead calls used this over vs 3-call limit
- `prompt_text`: rendered prompt for the LLM

The LLM sees enough information to reason tactically, but not simulator internals.

## 9. Opponent Policies

Four modes in `server/opponent_policy.py`:

### `heuristic`

Format-aware local policy using T5/T20/ODI rules from `format_mapper.py`. Picks shot intent from phase-weighted distributions, adjusts for wicket pressure (shifts conservative under 7+ wickets down), and selects batter/bowler roles by current over and format. Fast, no API key needed.

### `cricsheet`

Samples real Cricsheet ball-by-ball deliveries indexed by `(phase, wickets_band, innings_type)`. Automatically selects T20 or ODI data based on `max_overs`:
- ≤ 25 overs → `ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries from 5,176 matches)
- > 25 overs → `ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries from 3,116 matches)

Progressive fallback widening (drop innings_type → drop wickets_band → any phase record) ensures no dead buckets. Heuristic fallback if data file absent.

### `llm_live`

Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible API). Graceful heuristic fallback when no API key is present, so local development never breaks.

### `llm_cached`

Replays pre-recorded opponent decisions from JSONL. Does **not** call the configured model live. Use for official leaderboard-style evaluation where every compared captain faces identical opponent decisions.

## 10. Ball Physics And Markov Engine

The simulation uses `server/markov_engine.py` plus field/zone definitions in `server/field_model.py`.

Ball transition tables keyed by `(over, wickets, score_band, phase, bowler_type)`:
1. **Cricsheet-derived**: `data/processed/cricket_transitions_v1.pkl` when available
2. **Calibrated synthetic**: `data/transition_probs.json` as fallback

After the base Markov draw, a **hybrid tactical layer** applies:
- Shot target zones (`cover`, `point`, `midwicket`, `long_on`, …) matched against delivery line/length/variation
- Field presets (`Aggressive`, `Balanced`, `Defensive`) expand into named fielder zones
- Boundary riders cut off fours/sixes; inner-ring fielders save singles; slips/gully convert edges
- Wides/no-balls, drops, misfields, overthrows, run-outs, caught-in-zone events add bounded stochastic noise
- High chase pressure makes defensive batting less useful

## 11. Reward Design

Four-rubric composite reward:

| Rubric | Weight | Frequency | Measures |
|--------|--------|-----------|----------|
| `r_cricket` | **45%** | Per ball | Dream11 proxy: runs, wickets, dots, milestones, economy, strike rate |
| `r_behavior` | **25%** | Every turn | Coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%) |
| `r_result` | **20%** | Innings/episode end | Win/loss vs DLS par, target margin, wickets |
| `r_validity` | **10%** | Every turn | Valid tool-call structure and legal phase-gated tool use |

Plus a **progress bonus** added to `r_result`: `min(0.25, tool_calls_made / 40.0)` — caps at +0.25 once the agent makes ≥10 tool calls. Directly rewards escaping the planning-loop trap (where the policy maxes overhead tools without ever calling `play_delivery`).

**Why these weights** (rebalanced from the original 55/25/15/5): partial-trajectory training means `r_result` rarely fires (episodes truncate before completion). Putting 55% weight on a signal that fires <5% of the time washes out the gradient. The new 45/25/20/10 split mirrors the SWE-RL recipe (60% intermediate / 40% terminal) and matches what working coding-agent RL setups actually use.

`r_tools` is computed and logged but excluded from the composite — tool discipline is measured through outcomes.

### Coherence Scoring (batting)

```
coherence = aggression_match × rationale_specificity × phase_appropriate
aggression_match   = 1 − |declared_aggression − shot_aggression_proxy|
rationale_specificity = (word_count_score + cricket_keyword_density) / 2
phase_appropriate  = 1 − |declared_aggression − phase_baseline|
phase_baselines: powerplay=0.55, middle=0.35, death=0.75
```

### Single-Stage Training with Format Curriculum

The original two-stage (format → strategy) curriculum was collapsed because
Qwen3-4B-Instruct-2507 emits `<tool_call>...</tool_call>` natively. The full composite
reward fires from step 0.

What remains is a **format-length curriculum within the warmup config**: per-scenario
`max_overs` is sampled from `[2, 2, 2, 2, 2, 2, 3, 3, 3]` (heavy on T2 so episodes
complete inside the token budget). The main run then trains on full 5-over end-to-end
matches, resuming from the warmup adapter.

## 12. Data Curation Pipeline

### Step 1: Curate Ball Outcomes

```bash
python scripts/curate_transitions.py --format t20   # → ball_outcomes_t20_v1.pkl
python scripts/curate_transitions.py --format odi   # → ball_outcomes_odi_v1.pkl
```

Both files already generated:
- `data/processed/ball_outcomes_t20_v1.pkl` — 1.17M T20 deliveries, 5,176 matches
- `data/processed/ball_outcomes_odi_v1.pkl` — 1.65M ODI deliveries, 3,116 matches
- `data/processed/cricket_transitions_v1.pkl` — 5,138 Markov keys, 2,878 high-confidence

### Step 2: Build Evaluation Pack

```bash
python scripts/build_eval_pack.py --eval-pack-id adaptive_t20_v1
```

### Step 3: Generate Opponent Cache

```bash
python scripts/generate_opponent_cache.py \
  --eval-pack data/eval_packs/adaptive_t20_v1.json \
  --mode heuristic \
  --split official \
  --output data/opponent_cache/adaptive_t20_v1.jsonl
```

## 13. Training Pipeline

### Recommended: Single-Command Chain

```bash
# Warmup (5-over curriculum, 25 steps) → Main (20-over T20, 100 steps).
# Main auto-resumes from warmup adapter at ./checkpoints/stage2_final.
bash scripts/run_warmup_then_main.sh
```

### Run Components Individually

**Warmup only — short curriculum, bootstraps the LoRA adapter:**

```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3_warmup.yaml
```

**Main only — 5-over end-to-end, resumes the warmup adapter (or fresh if `resume_from` is empty):**

```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3.yaml
```

`train.py train` uses TRL `GRPOTrainer` with `environment_factory=CricketCaptainToolEnv`. The captain being trained is loaded locally by Transformers/TRL and interacts with live environment instances through tool methods. `opponent-mode llm_live` affects only the adversary; it does not mean the trained captain is served through the HF inference endpoint.

The default training model is `Qwen/Qwen3-4B-Instruct-2507` (256k native context, native `Qwen3ForCausalLM` in vLLM, no thinking blocks). The default live opponent model is `google/gemma-4-26B-A4B-it`. Roster-backed training requires `--agent-team` or `env.agent_team` in YAML so `select_batter` and `choose_bowler` use real player profiles instead of generic names.

## 14. Current Status (2026-04-25)

### Implemented and verified

| Component | Status |
|-----------|--------|
| OpenEnv server + client | ✅ |
| 14-tool strategic surface | ✅ |
| 4-rubric reward system | ✅ |
| Tool budget system (3/over, −0.04 fine) | ✅ |
| Format mapper (T5/T20/ODI) | ✅ |
| Player rosters (10 T20I teams, fuzzy lookup) | ✅ |
| Cricsheet T20 data (1.17M deliveries) | ✅ |
| Cricsheet ODI data (1.65M deliveries) | ✅ |
| Heuristic opponent (format-aware) | ✅ |
| Cricsheet opponent (T20+ODI, context-indexed) | ✅ |
| LLM live opponent (HF Router / OpenAI-compatible API) | ✅ |
| LLM cached opponent | ✅ |
| GRPO training script (`environment_factory` agent rollouts) | ✅ |
| SFT data generator | ✅ |
| Gradio demo UI | ✅ |
| Colab training notebook | ✅ |

### Verified end-to-end (2026-04-25)

All 3 opponent modes verified at 5-over inference + train-smoke:

| Mode | inference parse_err | train-smoke r_validity | coherence |
|------|--------------------|-----------------------|-----------|
| heuristic | 0% | 1.0 | 0.556 |
| cricsheet | 0% | 1.0 | 0.620 |
| llm_live  | 0% | 1.0 | 0.537 |

### Pending for submission

- Real GRPO training run with reward curves (requires HF compute)
- HF Space deployment URL
- Training-vs-baseline comparison plots
- Mini-blog / video

## 15. Recommended Demo Story

1. **Show a late chase scenario**: Over 16.0, 128/5, target 172

2. **Random/untrained model**: invalid tools, blind aggression, ignores field/opponent

3. **Trained model**: checks target pressure → selects finisher → plans boundary zones → responds after wicket → changes risk level

4. **Show metrics**: parse errors ↓, coherence ↑, adaptation ↑, opponent_awareness ↑, reward ↑

> The model learned to captain, not just emit a valid tool-call object.