Spaces:

pratinavseth
/

cricket-captain-llm

Running

cricket-captain-llm / docs /benchmark_explainer.md

sync: pull latest from main (model_server.py, captain LLM toggle in ui.py, 0.6B configs, SUBMISSION + RUNTIME_DURABILITY docs)

e70c305 verified 12 days ago

preview code

raw

history blame contribute delete

18.2 kB

	# CricketCaptain Benchmark Explainer

	## 1. Problem Statement

	CricketCaptain-LLM is an OpenEnv environment for training language-model agents on adaptive strategic decision-making.

	The core problem is not just whether an LLM can produce a good-sounding plan. The harder question is whether the model can keep that plan useful while the world changes.

	In cricket, every ball changes the world:

	- A wicket can force a lower-risk batting plan.
	- A boundary can force a field or bowling change.
	- Required run rate changes during a chase.
	- Batter/bowler matchups matter.
	- Field settings affect which shots are sensible.
	- A captain must think before the ball and adapt after the ball.

	CricketCaptain turns this into a trainable benchmark. The LLM acts as a cricket captain and must:

	1. Plan before acting.
	2. Execute a tool call.
	3. Observe the ball result.
	4. Model the opponent's plan.
	5. Adapt tactics over a long match.
	6. Manage resources: wickets, overs, target, field, batters, bowlers.

	The original motivation came from strategic coherence: LLMs often say one thing and do another. This benchmark extends that into a fuller strategic setting:

	> Can an LLM maintain, revise, and execute a strategy under adversarial match pressure?

	## 2. Fit With OpenEnv Competition Themes

	### Multi-Agent Interactions

	The submitted captain agent plays against an opponent policy. The opponent can be:

	- `heuristic`: fast format-aware cricket logic (T5/T20/ODI rules).
	- `cricsheet`: real Cricsheet ball-by-ball match data sampled by game context.
	- `llm_live`: live OpenAI-compatible LLM opponent (google/gemma-4-26B-A4B-it via HF Router).
	- `llm_cached`: replayed opponent decisions for reproducible evaluation.

	This tests whether the agent can reason about another actor's incentives, field settings, and likely plans.

	### Long-Horizon Planning

	A full match has many decisions across innings, phases, wickets, and pressure states. Rewards are partly delayed, so the agent cannot optimize only the next tool call.

	### World Modeling

	The agent observes a partially summarized cricket world: score, over/ball, wickets, target, phase, field, batter profile, bowler profile, previous outcome. It must maintain an internal model of what is happening and update that model after every ball.

	### Self-Improvement

	The same environment can support heuristic curriculum training, cached-opponent official evaluation, live LLM opponent self-play, and future agent-vs-agent training.

	## 3. Environment Flow

	At the top level, the match lifecycle is:

	```text
	TOSS -> BATTING/BOWLING -> INNINGS_SWAP -> BATTING/BOWLING -> FINISHED
	```

	Within each batting or bowling phase, the tactical loop is:

	```text
	PRE_OVER -> PRE_BALL -> BALL_RESOLUTION -> POST_BALL -> next decision
	```

	### Toss

	```json
	{"tool": "call_toss", "arguments": {"call": "heads", "decision": "bat"}}
	```

	### Batting Tools

	```json
	{"tool": "select_batter", "arguments": {"name": "Virat Kohli", "style": "anchor", "aggression": 0.35, "rationale": "Preserve wickets in the middle overs."}}
	{"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Rotate strike against spin and keep wickets in hand."}}
	{"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy single."}}
	{"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Work into the gap."}}
	```

	`plan_shot` is not an overhead tool. Only `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation` count against the 3 / over limit (see Tool budget).

	### Bowling Tools

	```json
	{"tool": "choose_bowler", "arguments": {"name": "Jasprit Bumrah", "bowler_type": "pace", "style": "yorker", "rationale": "Attack the stumps at the death."}}
	{"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Limit swing room."}}
	{"tool": "set_field_setting", "arguments": {"setting": "Defensive"}}
	{"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Protect boundaries and force a low-percentage shot."}}
	{"tool": "bowl_delivery", "arguments": {}}
	```

	### Post-Ball Reflection

	```json
	{"tool": "reflect_after_ball", "arguments": {"reflection": "Boundary came through cover; shift wider and protect the off-side gap."}}
	```

	### Analysis Tool

	```json
	{"tool": "analyze_situation", "arguments": {"query_type": "match_situation"}}
	```

	## 4. Tool budget

	The environment enforces a 3-call overhead budget per over (see `CricketEnvironment.TOOL_BUDGET_PER_OVER` and `TOOL_FINE_PER_EXCESS` in `server/cricket_environment.py`).

	Overhead tools (increment the per-over counter; the 4th+ in the same over are fined):
	`set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`

	Not overhead (do not use the 3 free “slots”):
	`play_delivery`, `bowl_delivery`, `plan_shot`, `call_toss`, `select_batter`, `choose_bowler`, `set_field_setting`, `set_match_plan`, `update_match_plan`

	Each overhead call beyond the third in that over incurs an immediate −0.04 step reward. The prompt shows `Tool budget: N/3 overhead calls used this over`.

	Training connection: `train.py train` uses real `CricketEnvironment` steps, so these fines are part of the return GRPO optimizes. That keeps long-horizon training aligned with the benchmark: agents must choose when to pay for `analyze_situation` and `reflect_after_ball`, while `set_match_plan` / `update_match_plan` let them carry structure across overs without spending overhead budget.

	## 5. OpenEnv Architecture

	```text
	LLM Agent / Evaluator
	\|
	\| WebSocket (OpenEnv)
	v
	FastAPI server (server/app.py)
	\|
	v
	CricketEnvironment (server/cricket_environment.py)
	\|
	+--> MarkovCricketEngine (server/markov_engine.py)
	+--> FormatMapper (server/format_mapper.py)
	+--> OpponentPolicy (server/opponent_policy.py)
	+--> PlayerRoster (server/player_roster.py)
	+--> CoherenceGrader (server/coherence_grader.py)
	+--> RewardCalculator (server/reward_calculator.py)
	+--> FieldModel (server/field_model.py)
	```

	Key files:

	\| File \| Role \|
	\|------\|------\|
	\| `server/app.py` \| OpenEnv server entry point \|
	\| `server/cricket_environment.py` \| `reset`, `step`, `state` implementation \|
	\| `server/format_mapper.py` \| T5/T20/ODI closest-format selector; phase-aware shot weights, batter/bowler roles \|
	\| `server/opponent_policy.py` \| Heuristic, Cricsheet, live LLM, cached LLM opponent policies \|
	\| `server/player_roster.py` \| Fuzzy player lookup; batter/bowler profile extractor \|
	\| `models.py` \| `CricketAction`, `CricketObservation`, `CricketState` \|
	\| `client.py` \| WebSocket client `CricketCaptainEnv` \|
	\| `inference.py` \| Random + LLM agent evaluation \|
	\| `train.py` \| MT-GRPO + SFT training pipeline \|
	\| `eval.py` \| Coherence heatmaps, reward curves, tool analytics \|

	## 6. Format-Aware Rules

	`server/format_mapper.py` auto-selects T5 / T20 / ODI rules by `\|max_overs − format_overs\|`:

	\| Format \| max_overs \| Key differences \|
	\|--------\|-----------\|-----------------\|
	\| T5 \| ≤ 7 \| High-aggression throughout, powerplay dominates all overs \|
	\| T20 \| 8–35 \| Three phases (PP/Middle/Death); spin-heavy middle \|
	\| ODI \| > 35 \| Four phases (PP/Middle-early/Middle-late/Death); anchor roles \|

	The format mapper provides:
	- Phase-aware shot weights: boundary/six probability rises sharply in death overs
	- Batter roles with `overs_active` windows (opener, anchor, middle_order, finisher)
	- Bowler roles with `preferred_phases` (pace_opener, spin_controller, death_specialist)
	- Bowling strategy per phase (line, length, delivery_type, field_setting)

	Both the heuristic opponent and the `select_batter` / `choose_bowler` tools draw from these tables.

	## 7. Player Rosters

	`server/player_roster.py` loads team profiles from `data/player_profiles/` — 10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan.

	When the agent calls `select_batter` or `choose_bowler` with a player name, the roster performs fuzzy lookup (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths from the profile.

	## 8. What The Observation Contains

	Each step returns a `CricketObservation` with:

	- `game_state`: toss / batting / bowling / finished
	- `strategic_phase`: pre_over / pre_ball / ball_resolution / post_ball
	- `game_context`: score, wickets, over, ball, target, phase, run_rate, req_rate
	- `declared_strategy`: current batting strategy (aggression, intent, rationale)
	- `bowling_strategy`: current bowling plan
	- `field_setting`: Aggressive / Balanced / Defensive
	- `current_batter`: batter profile (style, aggression, phase strengths)
	- `current_bowler`: bowler profile
	- `last_outcome`: ball outcome + tactical metadata (event type, shot zone, delivery features, field pressure, fielder effect)
	- `available_tools`: legal tools for current state (phase-gated)
	- `tool_budget`: overhead calls used this over vs 3-call limit
	- `prompt_text`: rendered prompt for the LLM

	The LLM sees enough information to reason tactically, but not simulator internals.

	## 9. Opponent Policies

	Four modes in `server/opponent_policy.py`:

	### `heuristic`

	Format-aware local policy using T5/T20/ODI rules from `format_mapper.py`. Picks shot intent from phase-weighted distributions, adjusts for wicket pressure (shifts conservative under 7+ wickets down), and selects batter/bowler roles by current over and format. Fast, no API key needed.

	### `cricsheet`

	Samples real Cricsheet ball-by-ball deliveries indexed by `(phase, wickets_band, innings_type)`. Automatically selects T20 or ODI data based on `max_overs`:
	- ≤ 25 overs → `ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries from 5,176 matches)
	- > 25 overs → `ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries from 3,116 matches)

	Progressive fallback widening (drop innings_type → drop wickets_band → any phase record) ensures no dead buckets. Heuristic fallback if data file absent.

	### `llm_live`

	Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible API). Graceful heuristic fallback when no API key is present, so local development never breaks.

	### `llm_cached`

	Replays pre-recorded opponent decisions from JSONL. Does not call the configured model live. Use for official leaderboard-style evaluation where every compared captain faces identical opponent decisions.

	## 10. Ball Physics And Markov Engine

	The simulation uses `server/markov_engine.py` plus field/zone definitions in `server/field_model.py`.

	Ball transition tables keyed by `(over, wickets, score_band, phase, bowler_type)`:
	1. Cricsheet-derived: `data/processed/cricket_transitions_v1.pkl` when available
	2. Calibrated synthetic: `data/transition_probs.json` as fallback

	After the base Markov draw, a hybrid tactical layer applies:
	- Shot target zones (`cover`, `point`, `midwicket`, `long_on`, …) matched against delivery line/length/variation
	- Field presets (`Aggressive`, `Balanced`, `Defensive`) expand into named fielder zones
	- Boundary riders cut off fours/sixes; inner-ring fielders save singles; slips/gully convert edges
	- Wides/no-balls, drops, misfields, overthrows, run-outs, caught-in-zone events add bounded stochastic noise
	- High chase pressure makes defensive batting less useful

	## 11. Reward Design

	Four-rubric composite reward:

	\| Rubric \| Weight \| Frequency \| Measures \|
	\|--------\|--------\|-----------\|----------\|
	\| `r_cricket` \| 45% \| Per ball \| Dream11 proxy: runs, wickets, dots, milestones, economy, strike rate \|
	\| `r_behavior` \| 25% \| Every turn \| Coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%) \|
	\| `r_result` \| 20% \| Innings/episode end \| Win/loss vs DLS par, target margin, wickets \|
	\| `r_validity` \| 10% \| Every turn \| Valid tool-call structure and legal phase-gated tool use \|

	Plus a progress bonus added to `r_result`: `min(0.25, tool_calls_made / 40.0)` — caps at +0.25 once the agent makes ≥10 tool calls. Directly rewards escaping the planning-loop trap (where the policy maxes overhead tools without ever calling `play_delivery`).

	Why these weights (rebalanced from the original 55/25/15/5): partial-trajectory training means `r_result` rarely fires (episodes truncate before completion). Putting 55% weight on a signal that fires <5% of the time washes out the gradient. The new 45/25/20/10 split mirrors the SWE-RL recipe (60% intermediate / 40% terminal) and matches what working coding-agent RL setups actually use.

	`r_tools` is computed and logged but excluded from the composite — tool discipline is measured through outcomes.

	### Coherence Scoring (batting)

	```
	coherence = aggression_match × rationale_specificity × phase_appropriate
	aggression_match = 1 − \|declared_aggression − shot_aggression_proxy\|
	rationale_specificity = (word_count_score + cricket_keyword_density) / 2
	phase_appropriate = 1 − \|declared_aggression − phase_baseline\|
	phase_baselines: powerplay=0.55, middle=0.35, death=0.75
	```

	### Single-Stage Training with Format Curriculum

	The original two-stage (format → strategy) curriculum was collapsed because
	Qwen3-4B-Instruct-2507 emits `<tool_call>...</tool_call>` natively. The full composite
	reward fires from step 0.

	What remains is a format-length curriculum within the warmup config: per-scenario
	`max_overs` is sampled from `[2, 2, 2, 2, 2, 2, 3, 3, 3]` (heavy on T2 so episodes
	complete inside the token budget). The main run then trains on full 5-over end-to-end
	matches, resuming from the warmup adapter.

	## 12. Data Curation Pipeline

	### Step 1: Curate Ball Outcomes

	```bash
	python scripts/curate_transitions.py --format t20 # → ball_outcomes_t20_v1.pkl
	python scripts/curate_transitions.py --format odi # → ball_outcomes_odi_v1.pkl
	```

	Both files already generated:
	- `data/processed/ball_outcomes_t20_v1.pkl` — 1.17M T20 deliveries, 5,176 matches
	- `data/processed/ball_outcomes_odi_v1.pkl` — 1.65M ODI deliveries, 3,116 matches
	- `data/processed/cricket_transitions_v1.pkl` — 5,138 Markov keys, 2,878 high-confidence

	### Step 2: Build Evaluation Pack

	```bash
	python scripts/build_eval_pack.py --eval-pack-id adaptive_t20_v1
	```

	### Step 3: Generate Opponent Cache

	```bash
	python scripts/generate_opponent_cache.py \
	--eval-pack data/eval_packs/adaptive_t20_v1.json \
	--mode heuristic \
	--split official \
	--output data/opponent_cache/adaptive_t20_v1.jsonl
	```

	## 13. Training Pipeline

	### Recommended: Single-Command Chain

	```bash
	# Warmup (5-over curriculum, 25 steps) → Main (20-over T20, 100 steps).
	# Main auto-resumes from warmup adapter at ./checkpoints/stage2_final.
	bash scripts/run_warmup_then_main.sh
	```

	### Run Components Individually

	Warmup only — short curriculum, bootstraps the LoRA adapter:

	```bash
	PYTORCH_ALLOC_CONF=expandable_segments:True \
	python train.py train --config configs/cricket_train_qwen3_warmup.yaml
	```

	Main only — 5-over end-to-end, resumes the warmup adapter (or fresh if `resume_from` is empty):

	```bash
	PYTORCH_ALLOC_CONF=expandable_segments:True \
	python train.py train --config configs/cricket_train_qwen3.yaml
	```

	`train.py train` uses TRL `GRPOTrainer` with `environment_factory=CricketCaptainToolEnv`. The captain being trained is loaded locally by Transformers/TRL and interacts with live environment instances through tool methods. `opponent-mode llm_live` affects only the adversary; it does not mean the trained captain is served through the HF inference endpoint.

	The default training model is `Qwen/Qwen3-4B-Instruct-2507` (256k native context, native `Qwen3ForCausalLM` in vLLM, no thinking blocks). The default live opponent model is `google/gemma-4-26B-A4B-it`. Roster-backed training requires `--agent-team` or `env.agent_team` in YAML so `select_batter` and `choose_bowler` use real player profiles instead of generic names.

	## 14. Current Status (2026-04-25)

	### Implemented and verified

	\| Component \| Status \|
	\|-----------\|--------\|
	\| OpenEnv server + client \| ✅ \|
	\| 14-tool strategic surface \| ✅ \|
	\| 4-rubric reward system \| ✅ \|
	\| Tool budget system (3/over, −0.04 fine) \| ✅ \|
	\| Format mapper (T5/T20/ODI) \| ✅ \|
	\| Player rosters (10 T20I teams, fuzzy lookup) \| ✅ \|
	\| Cricsheet T20 data (1.17M deliveries) \| ✅ \|
	\| Cricsheet ODI data (1.65M deliveries) \| ✅ \|
	\| Heuristic opponent (format-aware) \| ✅ \|
	\| Cricsheet opponent (T20+ODI, context-indexed) \| ✅ \|
	\| LLM live opponent (HF Router / OpenAI-compatible API) \| ✅ \|
	\| LLM cached opponent \| ✅ \|
	\| GRPO training script (`environment_factory` agent rollouts) \| ✅ \|
	\| SFT data generator \| ✅ \|
	\| Gradio demo UI \| ✅ \|
	\| Colab training notebook \| ✅ \|

	### Verified end-to-end (2026-04-25)

	All 3 opponent modes verified at 5-over inference + train-smoke:

	\| Mode \| inference parse_err \| train-smoke r_validity \| coherence \|
	\|------\|--------------------\|-----------------------\|-----------\|
	\| heuristic \| 0% \| 1.0 \| 0.556 \|
	\| cricsheet \| 0% \| 1.0 \| 0.620 \|
	\| llm_live \| 0% \| 1.0 \| 0.537 \|

	### Pending for submission

	- Real GRPO training run with reward curves (requires HF compute)
	- HF Space deployment URL
	- Training-vs-baseline comparison plots
	- Mini-blog / video

	## 15. Recommended Demo Story

	1. Show a late chase scenario: Over 16.0, 128/5, target 172

	2. Random/untrained model: invalid tools, blind aggression, ignores field/opponent

	3. Trained model: checks target pressure → selects finisher → plans boundary zones → responds after wicket → changes risk level

	4. Show metrics: parse errors ↓, coherence ↑, adaptation ↑, opponent_awareness ↑, reward ↑

	> The model learned to captain, not just emit a valid tool-call object.