File size: 18,204 Bytes
aabe36a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2fc50a9
 
 
aabe36a
 
 
 
 
 
 
 
 
 
2fc50a9
aabe36a
 
 
2fc50a9
aabe36a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2fc50a9
aabe36a
cb4ffbd
aabe36a
 
 
2fc50a9
aabe36a
2fc50a9
aabe36a
 
2fc50a9
aabe36a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2fc50a9
 
 
 
 
 
 
 
 
aabe36a
2fc50a9
 
 
 
 
aabe36a
 
 
 
2fc50a9
aabe36a
2fc50a9
aabe36a
 
2fc50a9
aabe36a
2fc50a9
 
 
 
 
 
 
c29cfbd
 
2fc50a9
aabe36a
2fc50a9
 
 
 
 
 
 
 
 
 
 
 
aabe36a
2fc50a9
c29cfbd
2fc50a9
aabe36a
2fc50a9
 
 
 
 
aabe36a
2fc50a9
 
 
 
 
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
 
 
 
 
 
 
 
 
 
 
 
aabe36a
2fc50a9
aabe36a
2fc50a9
2ce9b53
2fc50a9
2ce9b53
2fc50a9
2ce9b53
2fc50a9
2ce9b53
2fc50a9
2ce9b53
2fc50a9
 
 
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
 
 
aabe36a
2fc50a9
 
 
 
 
 
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
 
 
 
 
 
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
 
2fc50a9
 
 
 
 
aabe36a
 
2fc50a9
aabe36a
e70c305
 
2fc50a9
aabe36a
2fc50a9
e70c305
 
 
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
 
2fc50a9
 
aabe36a
 
2fc50a9
 
 
 
aabe36a
2fc50a9
aabe36a
 
 
 
 
2fc50a9
aabe36a
 
 
 
 
 
 
 
 
2fc50a9
aabe36a
2fc50a9
aabe36a
 
2fc50a9
 
 
aabe36a
 
2fc50a9
aabe36a
2fc50a9
aabe36a
 
2fc50a9
e70c305
aabe36a
 
e70c305
aabe36a
 
2fc50a9
e70c305
aabe36a
 
2fc50a9
aabe36a
e70c305
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
 
 
 
 
aabe36a
2fc50a9
aabe36a
2fc50a9
 
 
 
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
aabe36a
2fc50a9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
# CricketCaptain Benchmark Explainer

## 1. Problem Statement

CricketCaptain-LLM is an OpenEnv environment for training language-model agents on **adaptive strategic decision-making**.

The core problem is not just whether an LLM can produce a good-sounding plan. The harder question is whether the model can keep that plan useful while the world changes.

In cricket, every ball changes the world:

- A wicket can force a lower-risk batting plan.
- A boundary can force a field or bowling change.
- Required run rate changes during a chase.
- Batter/bowler matchups matter.
- Field settings affect which shots are sensible.
- A captain must think before the ball and adapt after the ball.

CricketCaptain turns this into a trainable benchmark. The LLM acts as a cricket captain and must:

1. Plan before acting.
2. Execute a tool call.
3. Observe the ball result.
4. Model the opponent's plan.
5. Adapt tactics over a long match.
6. Manage resources: wickets, overs, target, field, batters, bowlers.

The original motivation came from strategic coherence: LLMs often say one thing and do another. This benchmark extends that into a fuller strategic setting:

> Can an LLM maintain, revise, and execute a strategy under adversarial match pressure?

## 2. Fit With OpenEnv Competition Themes

### Multi-Agent Interactions

The submitted captain agent plays against an opponent policy. The opponent can be:

- `heuristic`: fast format-aware cricket logic (T5/T20/ODI rules).
- `cricsheet`: real Cricsheet ball-by-ball match data sampled by game context.
- `llm_live`: live OpenAI-compatible LLM opponent (google/gemma-4-26B-A4B-it via HF Router).
- `llm_cached`: replayed opponent decisions for reproducible evaluation.

This tests whether the agent can reason about another actor's incentives, field settings, and likely plans.

### Long-Horizon Planning

A full match has many decisions across innings, phases, wickets, and pressure states. Rewards are partly delayed, so the agent cannot optimize only the next tool call.

### World Modeling

The agent observes a partially summarized cricket world: score, over/ball, wickets, target, phase, field, batter profile, bowler profile, previous outcome. It must maintain an internal model of what is happening and update that model after every ball.

### Self-Improvement

The same environment can support heuristic curriculum training, cached-opponent official evaluation, live LLM opponent self-play, and future agent-vs-agent training.

## 3. Environment Flow

At the top level, the match lifecycle is:

```text
TOSS -> BATTING/BOWLING -> INNINGS_SWAP -> BATTING/BOWLING -> FINISHED
```

Within each batting or bowling phase, the tactical loop is:

```text
PRE_OVER -> PRE_BALL -> BALL_RESOLUTION -> POST_BALL -> next decision
```

### Toss

```json
{"tool": "call_toss", "arguments": {"call": "heads", "decision": "bat"}}
```

### Batting Tools

```json
{"tool": "select_batter", "arguments": {"name": "Virat Kohli", "style": "anchor", "aggression": 0.35, "rationale": "Preserve wickets in the middle overs."}}
{"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Rotate strike against spin and keep wickets in hand."}}
{"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy single."}}
{"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Work into the gap."}}
```

`plan_shot` is **not** an overhead tool. Only `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation` count against the 3 / over limit (see Tool budget).

### Bowling Tools

```json
{"tool": "choose_bowler", "arguments": {"name": "Jasprit Bumrah", "bowler_type": "pace", "style": "yorker", "rationale": "Attack the stumps at the death."}}
{"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Limit swing room."}}
{"tool": "set_field_setting", "arguments": {"setting": "Defensive"}}
{"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Protect boundaries and force a low-percentage shot."}}
{"tool": "bowl_delivery", "arguments": {}}
```

### Post-Ball Reflection

```json
{"tool": "reflect_after_ball", "arguments": {"reflection": "Boundary came through cover; shift wider and protect the off-side gap."}}
```

### Analysis Tool

```json
{"tool": "analyze_situation", "arguments": {"query_type": "match_situation"}}
```

## 4. Tool budget

The environment enforces a **3-call overhead budget per over** (see `CricketEnvironment.TOOL_BUDGET_PER_OVER` and `TOOL_FINE_PER_EXCESS` in `server/cricket_environment.py`).

**Overhead tools** (increment the per-over counter; the 4th+ in the same over are fined):
`set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`

**Not overhead** (do not use the 3 free “slots”):
`play_delivery`, `bowl_delivery`, `plan_shot`, `call_toss`, `select_batter`, `choose_bowler`, `set_field_setting`, `set_match_plan`, `update_match_plan`

Each overhead call **beyond the third in that over** incurs an immediate **−0.04** step reward. The prompt shows `Tool budget: N/3 overhead calls used this over`.

**Training connection:** `train.py train` uses real `CricketEnvironment` steps, so these fines are part of the return GRPO optimizes. That keeps long-horizon training aligned with the benchmark: agents must choose when to pay for `analyze_situation` and `reflect_after_ball`, while `set_match_plan` / `update_match_plan` let them carry structure across overs without spending overhead budget.

## 5. OpenEnv Architecture

```text
LLM Agent / Evaluator
        |
        | WebSocket (OpenEnv)
        v
FastAPI server (server/app.py)
        |
        v
CricketEnvironment (server/cricket_environment.py)
        |
        +--> MarkovCricketEngine  (server/markov_engine.py)
        +--> FormatMapper         (server/format_mapper.py)
        +--> OpponentPolicy       (server/opponent_policy.py)
        +--> PlayerRoster         (server/player_roster.py)
        +--> CoherenceGrader      (server/coherence_grader.py)
        +--> RewardCalculator     (server/reward_calculator.py)
        +--> FieldModel           (server/field_model.py)
```

Key files:

| File | Role |
|------|------|
| `server/app.py` | OpenEnv server entry point |
| `server/cricket_environment.py` | `reset`, `step`, `state` implementation |
| `server/format_mapper.py` | T5/T20/ODI closest-format selector; phase-aware shot weights, batter/bowler roles |
| `server/opponent_policy.py` | Heuristic, Cricsheet, live LLM, cached LLM opponent policies |
| `server/player_roster.py` | Fuzzy player lookup; batter/bowler profile extractor |
| `models.py` | `CricketAction`, `CricketObservation`, `CricketState` |
| `client.py` | WebSocket client `CricketCaptainEnv` |
| `inference.py` | Random + LLM agent evaluation |
| `train.py` | MT-GRPO + SFT training pipeline |
| `eval.py` | Coherence heatmaps, reward curves, tool analytics |

## 6. Format-Aware Rules

`server/format_mapper.py` auto-selects T5 / T20 / ODI rules by `|max_overs − format_overs|`:

| Format | max_overs | Key differences |
|--------|-----------|-----------------|
| T5 | ≤ 7 | High-aggression throughout, powerplay dominates all overs |
| T20 | 8–35 | Three phases (PP/Middle/Death); spin-heavy middle |
| ODI | > 35 | Four phases (PP/Middle-early/Middle-late/Death); anchor roles |

The format mapper provides:
- **Phase-aware shot weights**: boundary/six probability rises sharply in death overs
- **Batter roles** with `overs_active` windows (opener, anchor, middle_order, finisher)
- **Bowler roles** with `preferred_phases` (pace_opener, spin_controller, death_specialist)
- **Bowling strategy** per phase (line, length, delivery_type, field_setting)

Both the heuristic opponent and the `select_batter` / `choose_bowler` tools draw from these tables.

## 7. Player Rosters

`server/player_roster.py` loads team profiles from `data/player_profiles/` — 10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan.

When the agent calls `select_batter` or `choose_bowler` with a player name, the roster performs **fuzzy lookup** (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths from the profile.

## 8. What The Observation Contains

Each step returns a `CricketObservation` with:

- `game_state`: toss / batting / bowling / finished
- `strategic_phase`: pre_over / pre_ball / ball_resolution / post_ball
- `game_context`: score, wickets, over, ball, target, phase, run_rate, req_rate
- `declared_strategy`: current batting strategy (aggression, intent, rationale)
- `bowling_strategy`: current bowling plan
- `field_setting`: Aggressive / Balanced / Defensive
- `current_batter`: batter profile (style, aggression, phase strengths)
- `current_bowler`: bowler profile
- `last_outcome`: ball outcome + tactical metadata (event type, shot zone, delivery features, field pressure, fielder effect)
- `available_tools`: legal tools for current state (phase-gated)
- `tool_budget`: overhead calls used this over vs 3-call limit
- `prompt_text`: rendered prompt for the LLM

The LLM sees enough information to reason tactically, but not simulator internals.

## 9. Opponent Policies

Four modes in `server/opponent_policy.py`:

### `heuristic`

Format-aware local policy using T5/T20/ODI rules from `format_mapper.py`. Picks shot intent from phase-weighted distributions, adjusts for wicket pressure (shifts conservative under 7+ wickets down), and selects batter/bowler roles by current over and format. Fast, no API key needed.

### `cricsheet`

Samples real Cricsheet ball-by-ball deliveries indexed by `(phase, wickets_band, innings_type)`. Automatically selects T20 or ODI data based on `max_overs`:
- ≤ 25 overs → `ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries from 5,176 matches)
- > 25 overs → `ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries from 3,116 matches)

Progressive fallback widening (drop innings_type → drop wickets_band → any phase record) ensures no dead buckets. Heuristic fallback if data file absent.

### `llm_live`

Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible API). Graceful heuristic fallback when no API key is present, so local development never breaks.

### `llm_cached`

Replays pre-recorded opponent decisions from JSONL. Does **not** call the configured model live. Use for official leaderboard-style evaluation where every compared captain faces identical opponent decisions.

## 10. Ball Physics And Markov Engine

The simulation uses `server/markov_engine.py` plus field/zone definitions in `server/field_model.py`.

Ball transition tables keyed by `(over, wickets, score_band, phase, bowler_type)`:
1. **Cricsheet-derived**: `data/processed/cricket_transitions_v1.pkl` when available
2. **Calibrated synthetic**: `data/transition_probs.json` as fallback

After the base Markov draw, a **hybrid tactical layer** applies:
- Shot target zones (`cover`, `point`, `midwicket`, `long_on`, …) matched against delivery line/length/variation
- Field presets (`Aggressive`, `Balanced`, `Defensive`) expand into named fielder zones
- Boundary riders cut off fours/sixes; inner-ring fielders save singles; slips/gully convert edges
- Wides/no-balls, drops, misfields, overthrows, run-outs, caught-in-zone events add bounded stochastic noise
- High chase pressure makes defensive batting less useful

## 11. Reward Design

Four-rubric composite reward:

| Rubric | Weight | Frequency | Measures |
|--------|--------|-----------|----------|
| `r_cricket` | **45%** | Per ball | Dream11 proxy: runs, wickets, dots, milestones, economy, strike rate |
| `r_behavior` | **25%** | Every turn | Coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%) |
| `r_result` | **20%** | Innings/episode end | Win/loss vs DLS par, target margin, wickets |
| `r_validity` | **10%** | Every turn | Valid tool-call structure and legal phase-gated tool use |

Plus a **progress bonus** added to `r_result`: `min(0.25, tool_calls_made / 40.0)` — caps at +0.25 once the agent makes ≥10 tool calls. Directly rewards escaping the planning-loop trap (where the policy maxes overhead tools without ever calling `play_delivery`).

**Why these weights** (rebalanced from the original 55/25/15/5): partial-trajectory training means `r_result` rarely fires (episodes truncate before completion). Putting 55% weight on a signal that fires <5% of the time washes out the gradient. The new 45/25/20/10 split mirrors the SWE-RL recipe (60% intermediate / 40% terminal) and matches what working coding-agent RL setups actually use.

`r_tools` is computed and logged but excluded from the composite — tool discipline is measured through outcomes.

### Coherence Scoring (batting)

```
coherence = aggression_match × rationale_specificity × phase_appropriate
aggression_match   = 1 − |declared_aggression − shot_aggression_proxy|
rationale_specificity = (word_count_score + cricket_keyword_density) / 2
phase_appropriate  = 1 − |declared_aggression − phase_baseline|
phase_baselines: powerplay=0.55, middle=0.35, death=0.75
```

### Single-Stage Training with Format Curriculum

The original two-stage (format → strategy) curriculum was collapsed because
Qwen3-4B-Instruct-2507 emits `<tool_call>...</tool_call>` natively. The full composite
reward fires from step 0.

What remains is a **format-length curriculum within the warmup config**: per-scenario
`max_overs` is sampled from `[2, 2, 2, 2, 2, 2, 3, 3, 3]` (heavy on T2 so episodes
complete inside the token budget). The main run then trains on full 5-over end-to-end
matches, resuming from the warmup adapter.

## 12. Data Curation Pipeline

### Step 1: Curate Ball Outcomes

```bash
python scripts/curate_transitions.py --format t20   # → ball_outcomes_t20_v1.pkl
python scripts/curate_transitions.py --format odi   # → ball_outcomes_odi_v1.pkl
```

Both files already generated:
- `data/processed/ball_outcomes_t20_v1.pkl` — 1.17M T20 deliveries, 5,176 matches
- `data/processed/ball_outcomes_odi_v1.pkl` — 1.65M ODI deliveries, 3,116 matches
- `data/processed/cricket_transitions_v1.pkl` — 5,138 Markov keys, 2,878 high-confidence

### Step 2: Build Evaluation Pack

```bash
python scripts/build_eval_pack.py --eval-pack-id adaptive_t20_v1
```

### Step 3: Generate Opponent Cache

```bash
python scripts/generate_opponent_cache.py \
  --eval-pack data/eval_packs/adaptive_t20_v1.json \
  --mode heuristic \
  --split official \
  --output data/opponent_cache/adaptive_t20_v1.jsonl
```

## 13. Training Pipeline

### Recommended: Single-Command Chain

```bash
# Warmup (5-over curriculum, 25 steps) → Main (20-over T20, 100 steps).
# Main auto-resumes from warmup adapter at ./checkpoints/stage2_final.
bash scripts/run_warmup_then_main.sh
```

### Run Components Individually

**Warmup only — short curriculum, bootstraps the LoRA adapter:**

```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3_warmup.yaml
```

**Main only — 5-over end-to-end, resumes the warmup adapter (or fresh if `resume_from` is empty):**

```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3.yaml
```

`train.py train` uses TRL `GRPOTrainer` with `environment_factory=CricketCaptainToolEnv`. The captain being trained is loaded locally by Transformers/TRL and interacts with live environment instances through tool methods. `opponent-mode llm_live` affects only the adversary; it does not mean the trained captain is served through the HF inference endpoint.

The default training model is `Qwen/Qwen3-4B-Instruct-2507` (256k native context, native `Qwen3ForCausalLM` in vLLM, no thinking blocks). The default live opponent model is `google/gemma-4-26B-A4B-it`. Roster-backed training requires `--agent-team` or `env.agent_team` in YAML so `select_batter` and `choose_bowler` use real player profiles instead of generic names.

## 14. Current Status (2026-04-25)

### Implemented and verified

| Component | Status |
|-----------|--------|
| OpenEnv server + client | ✅ |
| 14-tool strategic surface | ✅ |
| 4-rubric reward system | ✅ |
| Tool budget system (3/over, −0.04 fine) | ✅ |
| Format mapper (T5/T20/ODI) | ✅ |
| Player rosters (10 T20I teams, fuzzy lookup) | ✅ |
| Cricsheet T20 data (1.17M deliveries) | ✅ |
| Cricsheet ODI data (1.65M deliveries) | ✅ |
| Heuristic opponent (format-aware) | ✅ |
| Cricsheet opponent (T20+ODI, context-indexed) | ✅ |
| LLM live opponent (HF Router / OpenAI-compatible API) | ✅ |
| LLM cached opponent | ✅ |
| GRPO training script (`environment_factory` agent rollouts) | ✅ |
| SFT data generator | ✅ |
| Gradio demo UI | ✅ |
| Colab training notebook | ✅ |

### Verified end-to-end (2026-04-25)

All 3 opponent modes verified at 5-over inference + train-smoke:

| Mode | inference parse_err | train-smoke r_validity | coherence |
|------|--------------------|-----------------------|-----------|
| heuristic | 0% | 1.0 | 0.556 |
| cricsheet | 0% | 1.0 | 0.620 |
| llm_live  | 0% | 1.0 | 0.537 |

### Pending for submission

- Real GRPO training run with reward curves (requires HF compute)
- HF Space deployment URL
- Training-vs-baseline comparison plots
- Mini-blog / video

## 15. Recommended Demo Story

1. **Show a late chase scenario**: Over 16.0, 128/5, target 172

2. **Random/untrained model**: invalid tools, blind aggression, ignores field/opponent

3. **Trained model**: checks target pressure → selects finisher → plans boundary zones → responds after wicket → changes risk level

4. **Show metrics**: parse errors ↓, coherence ↑, adaptation ↑, opponent_awareness ↑, reward ↑

> The model learned to captain, not just emit a valid tool-call object.