File size: 12,028 Bytes
885299b
 
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
885299b
 
 
 
 
 
 
 
 
 
 
 
2fc50a9
885299b
 
2fc50a9
885299b
2fc50a9
885299b
e70c305
885299b
e70c305
885299b
2fc50a9
885299b
 
2fc50a9
885299b
 
2fc50a9
 
 
 
885299b
2fc50a9
 
885299b
2fc50a9
885299b
 
2fc50a9
885299b
 
2fc50a9
885299b
2fc50a9
 
 
885299b
2fc50a9
 
885299b
2fc50a9
885299b
 
2fc50a9
 
885299b
 
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
 
 
885299b
 
2fc50a9
885299b
2fc50a9
885299b
 
2fc50a9
 
 
885299b
2fc50a9
 
885299b
 
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
885299b
 
2fc50a9
885299b
2fc50a9
885299b
 
2fc50a9
 
 
885299b
 
2fc50a9
 
 
 
885299b
 
2fc50a9
885299b
2fc50a9
 
 
 
 
885299b
2fc50a9
885299b
2fc50a9
885299b
 
2fc50a9
 
 
885299b
 
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
 
885299b
 
2fc50a9
885299b
 
2fc50a9
 
 
 
 
 
 
 
885299b
 
2fc50a9
 
 
 
 
 
885299b
2fc50a9
885299b
 
2fc50a9
885299b
 
2fc50a9
885299b
e70c305
885299b
e70c305
 
 
885299b
 
2fc50a9
e70c305
885299b
 
e70c305
 
885299b
e70c305
 
 
885299b
e70c305
885299b
 
2fc50a9
e70c305
c29cfbd
 
e70c305
 
 
c29cfbd
e70c305
 
 
 
885299b
2fc50a9
 
 
885299b
2fc50a9
 
885299b
2fc50a9
885299b
2fc50a9
885299b
 
2fc50a9
885299b
2fc50a9
 
 
885299b
 
e70c305
885299b
2fc50a9
885299b
2fc50a9
 
 
 
 
 
885299b
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
7dc381b
2fc50a9
7dc381b
2fc50a9
7dc381b
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
e70c305
 
885299b
2fc50a9
e70c305
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
885299b
2fc50a9
 
 
 
 
 
 
 
 
 
 
 
 
885299b
 
2fc50a9
885299b
2fc50a9
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
# Experiment Workflow: Baselines, Opponents, Short Runs, and Training

This document explains how to run CricketCaptain experiments in a practical order: smoke checks, 5-over baselines, training, then longer evaluation.

## 1. Why Start With 5-Over Smoke + Baselines?

A full T20 innings is 20 overs. That is useful for final evaluation but slow for debugging.

For early experiments, 5-over runs are better because they quickly answer:

- Is the OpenEnv server working?
- Is the client connecting correctly?
- Is the model producing valid JSON?
- Are tool calls accepted?
- Are rewards non-zero and sensible?
- Does the opponent policy work?
- Do metrics like coherence, adaptation, and opponent awareness update?

The workflow should be:

```text
5-over smoke → 5-over untrained baseline → short training → 5-over trained eval → 20-over final eval
```

Do not start with full 20-over training unless the 5-over smoke loop is stable.

## 2. Opponent Modes

Four modes in `server/opponent_policy.py`. Controlled via `--opponent-mode`, `CRICKET_OPPONENT_MODE`, or the `opponent.mode` field in any training/eval YAML.

**Default in the current Qwen3 training configs is `heuristic`** (`configs/cricket_train_qwen3*.yaml`) for fast iteration. Switch to `llm_live` for adversarial evaluation when an HF token is set. The legacy `configs/extras/default.yaml` defaulted to `llm_live`.

### 2.1 Heuristic Opponent

```bash
--opponent-mode heuristic
```

Format-aware local policy. Uses T5/T20/ODI rules from `server/format_mapper.py`:
- Phase-weighted shot distributions (powerplay/middle/death per format)
- Wicket-pressure shift (heavier weight toward defensive shots when 7+ down)
- Batter/bowler roles selected from `data/format_rules.json`

Use for: fast local tests, cheap training rollouts, deterministic-ish baselines.
**No API key needed.**

### 2.2 Cricsheet Opponent

```bash
--opponent-mode cricsheet
```

Samples real Cricsheet ball-by-ball deliveries, indexed by `(phase, wickets_band, innings_type)`.

Auto-selects data by format:
- `max_overs ≤ 25``ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries)
- `max_overs > 25``ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries)

Progressive fallback: drop innings_type → drop wickets_band → any phase record.
**No API key needed.** Data files must be present under `data/processed/`.

### 2.3 Live LLM Opponent

```bash
export HF_TOKEN=hf_...
--opponent-mode llm_live
```

Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible endpoint). Set `HF_TOKEN` or pass `--opponent-api-key`; otherwise use `--opponent-mode heuristic` for local runs.

Use for: demos, realistic opponent behavior, self-play experiments.

### 2.4 Cached LLM Opponent

```bash
--opponent-mode llm_cached
export CRICKET_OPPONENT_CACHE=data/opponent_cache/adaptive_t20_v1.jsonl
```

Replays pre-recorded decisions. Does **not** call any live model. Use for official/reproducible eval — every compared captain faces identical opponent decisions.

## 3. Starting The Server

```bash
# Recommended (uvicorn auto-reload)
cd cricket_captain
python -m uvicorn server.app:app --port 8001

# Or via app.py directly
PYTHONPATH=. python server/app.py --port 8001
```

Health check: `curl http://localhost:8001/health``{"status":"healthy"}`

Set the URL for runners:
```bash
export CRICKET_CAPTAIN_ENV_URL=http://localhost:8001
```

On Lightning / remote runtimes, expose the port and pass the external URL:
```bash
export CRICKET_CAPTAIN_ENV_URL=ws://<lightning-exposed-host>/ws
```

## 4. Step 1: Random Baseline (all 3 local modes)

No API key, no GPU needed. Verify the full loop works.

```bash
# Run all 3 modes in parallel
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode heuristic  --env-url http://localhost:8001
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode cricsheet  --env-url http://localhost:8001
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode llm_live   --env-url http://localhost:8001
```

**Verified baselines (2026-04-25):**

| Opponent | score | coherence | reward | parse_err |
|----------|-------|-----------|--------|-----------|
| heuristic | 20.8 | 0.556 | −0.826 | 0% |
| cricsheet | 28.0 | 0.527 | −0.410 | 0% |
| llm_live  | 27.4 | 0.537 | −0.723 | 0% |

## 5. Step 2: Train-Smoke (verify reward signals, no GPU)

`train.py train-smoke` runs direct `CricketEnvironment` rollouts — **no server needed**.

```bash
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode heuristic
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode cricsheet
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode llm_live
```

**Verified train-smoke baselines (2026-04-25):** r_validity=1.0 on all 9 matches (3 modes × 3 matches). All reward signals active: coherence, adaptation, opponent_awareness, plan_commitment, staleness, regret.

Quick 2-over smoke for CI-style checks:

```bash
python train.py train-smoke --matches 1 --max-overs 2 --max-steps 240 --log-steps 90 --opponent-mode heuristic
```

## 6. Step 3: Untrained LLM Baseline (requires HF token)

```bash
export HF_TOKEN=hf_...
python inference.py \
  --model google/gemma-4-26B-A4B-it \
  --api-base https://router.huggingface.co/v1 \
  --api-key "$HF_TOKEN" \
  --episodes 3 --max-overs 5 \
  --opponent-mode llm_live \
  --env-url http://localhost:8001
```

**Verified LLM captain run (2026-04-25):**
```
model: google/gemma-4-26B-A4B-it via HF Router
coherence: 0.657 | adaptation: 0.502 | opp_aware: 0.750
parse errors: 0.0% | reward: +0.168
```

## 7. Step 4: SFT Tool Warmup

```bash
python train.py sft-data --output data/training/tool_sft_examples.jsonl --examples 500
```

Teaches tool-call shape, tool names, and argument schemas before RL. Not the main objective — just reduces wasted GRPO rollouts on syntax/tool-selection errors.

## 8. Step 5: GRPO Warmup (2–3 over curriculum)

Single-stage training (full composite reward from step 0) because Qwen3-4B-Instruct-2507
emits `<tool_call>...</tool_call>` natively. Warmup is a fast curriculum on short formats
to bootstrap the LoRA adapter before the longer main run:

```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3_warmup.yaml
```

Config: `overs_distribution=[2,2,2,2,2,2,3,3,3]`, `steps=30`, `num_generations=4`, `batch_size=1`,
`grad_accum=4`, `max_completion_length=16384`, `save_steps=5`. Approx 50–60 min on H200.

Goal: bootstrap the LoRA adapter on short matches that complete inside the token budget.
Watch in WandB: `reward/composite_mean`, `reward/r_result_mean`, `tools/freq_*`,
`rollout/match_completion_rate`, `completions/clipped_ratio`.

## 9. Step 6: GRPO Main Run (5-over end-to-end)

```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3.yaml
```

Config: `max_overs=5`, `steps=100`, `num_generations=4`, `batch_size=1`, `grad_accum=4`,
`max_completion_length=24576`, `save_steps=20`, `beta=0.0` (no reference model).
Approx 3–4 min/step → ~5–7 hrs total on H200.

The config sets `resume_from: ./checkpoints/stage2_final` so the main run picks up the
warmup adapter automatically. Comment out that line (or pass `--resume-from ""`) for a
fresh-adapter main run. The base model still loads from `Qwen/Qwen3-4B-Instruct-2507`;
only the LoRA weights resume.

Goal: improve coherence, adaptation, opponent awareness, match outcomes on full T20s.
Watch in WandB: `reward/r_result_mean` (sparse outcome), `reward/r_coherence_mean`,
`reward/r_adaptation_mean`, `episode/tool_calls_mean` (should approach 720), `episode/agent_score_mean`.

Switch the opponent in YAML (`opponent.mode: llm_live`) and set `HF_TOKEN` for adversarial
training against live Gemma. Use `cricsheet` or `heuristic` for cheaper ablations.

`train.py train` uses TRL `GRPOTrainer(environment_factory=CricketCaptainToolEnv)`, so the model interacts with a live environment over multiple tool-calling turns. This is not inference through the HF Router; the trained captain model is loaded locally by Transformers/TRL, with LoRA adapters when using quantized weights.

## 10. Step 7: Evaluation

```bash
python eval.py \
  --episodes 20 \
  --env-url http://localhost:8001 \
  --eval-pack-id adaptive_t20_v1 \
  --opponent-mode llm_cached
```

Compare: random baseline → untrained Qwen3-4B-Instruct-2507 → trained Qwen3-4B-Instruct-2507 (warmup + main adapter via `compare_eval.py`).

## 11. Format Comparison

| max_overs | Format selected | Data used by cricsheet | Typical target |
|-----------|----------------|------------------------|---------------|
| 5 | T5 | T20 pkl (closest) | ~47 runs |
| 7 | T5 | T20 pkl | ~66 runs |
| 20 | T20 | T20 pkl | ~160 runs |
| 50 | ODI | ODI pkl | ~290 runs |

All formats work with all opponent modes. Use `--max-overs N` with any runner.

## 12. Tool budget and training

Implemented in `server/cricket_environment.py` (`TOOL_BUDGET_PER_OVER=3`, `TOOL_FINE_PER_EXCESS=0.04`).

**Overhead tools (only these increment the per-over counter):** `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`.

**Not overhead:** `plan_shot`, `set_match_plan`, `update_match_plan`, `select_batter`, `choose_bowler`, `set_field_setting`, `play_delivery`, `bowl_delivery`, `call_toss`, and other tools that advance or directly execute the ball.

**Rule:** the first 3 overhead calls in each over are not fined; each further overhead call in that over adds **−0.04** to the step reward. The prompt includes `Tool budget: N/3 overhead calls used this over`.

**Why this matters for GRPO:** training uses the same environment as inference. Fines are part of the reward the trainer optimizes, so the policy learns to use reflection and `analyze_situation` when they matter, and to lean on `plan_shot` plus match-level planning (`set_match_plan` / `update_match_plan`) for routine structure without spending the 3 free overhead calls every ball.

## 13. Configs

```bash
# Start server with the legacy default config (llm_live opponent, 5-over default)
PYTHONPATH=. python server/app.py --port 8001 --config configs/extras/default.yaml

# Start with reproducible cached eval
PYTHONPATH=. python server/app.py --port 8001 --config configs/extras/cached_eval.yaml
```

Config controls: `env.agent_team`, `env.max_overs`, `env.eval_pack_id`, `train.model`, `train.max_completion_length`, `train.max_tool_calling_iterations`, `opponent.mode`, `opponent.model`, `opponent.api_base`, `captain.model`, and `captain.api_base`.

## 14. Latest Verified Run Results (2026-04-25)

All runs in `illustrations/`. Zero parse errors across all 14 inference runs. r_validity=1.0 across all train-smoke matches.

```
Random agent — 5-over, heuristic:   score=20.8  coherence=0.556  reward=−0.826
Random agent — 5-over, cricsheet:   score=28.0  coherence=0.527  reward=−0.410
Random agent — 5-over, llm_live:    score=27.4  coherence=0.537  reward=−0.723
Random agent — 20-over, cricsheet:  score=63.6  coherence=0.568  reward=−5.632
Random agent — 20-over, heuristic:  score=82.4  coherence=0.545  reward=−8.174

Train-smoke — 5-over, heuristic:    r_validity=1.0  coherence=0.596  3W/0L
Train-smoke — 5-over, cricsheet:    r_validity=1.0  coherence=0.620  2W/1L
Train-smoke — 5-over, llm_live:     r_validity=1.0  coherence=0.552  2W/1L

LLM captain (gemma-4-26B) — 3-over, llm_live:
  coherence=0.657  adaptation=0.502  opp_aware=0.750  parse_err=0%
```

## 15. Immediate Next Steps

1. **Run GRPO training** via `bash scripts/run_warmup_then_main.sh` (warmup curriculum + main run + auto-resume) to produce reward curves.
2. **Deploy HF Space** for live Gradio demo — Dockerfile present, just needs HF push.
3. **Generate opponent cache** using `llm_live` for reproducible official eval.
4. **Produce training plots** — coherence heatmap, reward curve, tool-usage timeline.