File size: 12,028 Bytes
885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b e70c305 885299b e70c305 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b e70c305 885299b e70c305 885299b 2fc50a9 e70c305 885299b e70c305 885299b e70c305 885299b e70c305 885299b 2fc50a9 e70c305 c29cfbd e70c305 c29cfbd e70c305 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b e70c305 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 7dc381b 2fc50a9 7dc381b 2fc50a9 7dc381b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 e70c305 885299b 2fc50a9 e70c305 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 885299b 2fc50a9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 | # Experiment Workflow: Baselines, Opponents, Short Runs, and Training
This document explains how to run CricketCaptain experiments in a practical order: smoke checks, 5-over baselines, training, then longer evaluation.
## 1. Why Start With 5-Over Smoke + Baselines?
A full T20 innings is 20 overs. That is useful for final evaluation but slow for debugging.
For early experiments, 5-over runs are better because they quickly answer:
- Is the OpenEnv server working?
- Is the client connecting correctly?
- Is the model producing valid JSON?
- Are tool calls accepted?
- Are rewards non-zero and sensible?
- Does the opponent policy work?
- Do metrics like coherence, adaptation, and opponent awareness update?
The workflow should be:
```text
5-over smoke → 5-over untrained baseline → short training → 5-over trained eval → 20-over final eval
```
Do not start with full 20-over training unless the 5-over smoke loop is stable.
## 2. Opponent Modes
Four modes in `server/opponent_policy.py`. Controlled via `--opponent-mode`, `CRICKET_OPPONENT_MODE`, or the `opponent.mode` field in any training/eval YAML.
**Default in the current Qwen3 training configs is `heuristic`** (`configs/cricket_train_qwen3*.yaml`) for fast iteration. Switch to `llm_live` for adversarial evaluation when an HF token is set. The legacy `configs/extras/default.yaml` defaulted to `llm_live`.
### 2.1 Heuristic Opponent
```bash
--opponent-mode heuristic
```
Format-aware local policy. Uses T5/T20/ODI rules from `server/format_mapper.py`:
- Phase-weighted shot distributions (powerplay/middle/death per format)
- Wicket-pressure shift (heavier weight toward defensive shots when 7+ down)
- Batter/bowler roles selected from `data/format_rules.json`
Use for: fast local tests, cheap training rollouts, deterministic-ish baselines.
**No API key needed.**
### 2.2 Cricsheet Opponent
```bash
--opponent-mode cricsheet
```
Samples real Cricsheet ball-by-ball deliveries, indexed by `(phase, wickets_band, innings_type)`.
Auto-selects data by format:
- `max_overs ≤ 25` → `ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries)
- `max_overs > 25` → `ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries)
Progressive fallback: drop innings_type → drop wickets_band → any phase record.
**No API key needed.** Data files must be present under `data/processed/`.
### 2.3 Live LLM Opponent
```bash
export HF_TOKEN=hf_...
--opponent-mode llm_live
```
Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible endpoint). Set `HF_TOKEN` or pass `--opponent-api-key`; otherwise use `--opponent-mode heuristic` for local runs.
Use for: demos, realistic opponent behavior, self-play experiments.
### 2.4 Cached LLM Opponent
```bash
--opponent-mode llm_cached
export CRICKET_OPPONENT_CACHE=data/opponent_cache/adaptive_t20_v1.jsonl
```
Replays pre-recorded decisions. Does **not** call any live model. Use for official/reproducible eval — every compared captain faces identical opponent decisions.
## 3. Starting The Server
```bash
# Recommended (uvicorn auto-reload)
cd cricket_captain
python -m uvicorn server.app:app --port 8001
# Or via app.py directly
PYTHONPATH=. python server/app.py --port 8001
```
Health check: `curl http://localhost:8001/health` → `{"status":"healthy"}`
Set the URL for runners:
```bash
export CRICKET_CAPTAIN_ENV_URL=http://localhost:8001
```
On Lightning / remote runtimes, expose the port and pass the external URL:
```bash
export CRICKET_CAPTAIN_ENV_URL=ws://<lightning-exposed-host>/ws
```
## 4. Step 1: Random Baseline (all 3 local modes)
No API key, no GPU needed. Verify the full loop works.
```bash
# Run all 3 modes in parallel
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode heuristic --env-url http://localhost:8001
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode cricsheet --env-url http://localhost:8001
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode llm_live --env-url http://localhost:8001
```
**Verified baselines (2026-04-25):**
| Opponent | score | coherence | reward | parse_err |
|----------|-------|-----------|--------|-----------|
| heuristic | 20.8 | 0.556 | −0.826 | 0% |
| cricsheet | 28.0 | 0.527 | −0.410 | 0% |
| llm_live | 27.4 | 0.537 | −0.723 | 0% |
## 5. Step 2: Train-Smoke (verify reward signals, no GPU)
`train.py train-smoke` runs direct `CricketEnvironment` rollouts — **no server needed**.
```bash
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode heuristic
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode cricsheet
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode llm_live
```
**Verified train-smoke baselines (2026-04-25):** r_validity=1.0 on all 9 matches (3 modes × 3 matches). All reward signals active: coherence, adaptation, opponent_awareness, plan_commitment, staleness, regret.
Quick 2-over smoke for CI-style checks:
```bash
python train.py train-smoke --matches 1 --max-overs 2 --max-steps 240 --log-steps 90 --opponent-mode heuristic
```
## 6. Step 3: Untrained LLM Baseline (requires HF token)
```bash
export HF_TOKEN=hf_...
python inference.py \
--model google/gemma-4-26B-A4B-it \
--api-base https://router.huggingface.co/v1 \
--api-key "$HF_TOKEN" \
--episodes 3 --max-overs 5 \
--opponent-mode llm_live \
--env-url http://localhost:8001
```
**Verified LLM captain run (2026-04-25):**
```
model: google/gemma-4-26B-A4B-it via HF Router
coherence: 0.657 | adaptation: 0.502 | opp_aware: 0.750
parse errors: 0.0% | reward: +0.168
```
## 7. Step 4: SFT Tool Warmup
```bash
python train.py sft-data --output data/training/tool_sft_examples.jsonl --examples 500
```
Teaches tool-call shape, tool names, and argument schemas before RL. Not the main objective — just reduces wasted GRPO rollouts on syntax/tool-selection errors.
## 8. Step 5: GRPO Warmup (2–3 over curriculum)
Single-stage training (full composite reward from step 0) because Qwen3-4B-Instruct-2507
emits `<tool_call>...</tool_call>` natively. Warmup is a fast curriculum on short formats
to bootstrap the LoRA adapter before the longer main run:
```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
python train.py train --config configs/cricket_train_qwen3_warmup.yaml
```
Config: `overs_distribution=[2,2,2,2,2,2,3,3,3]`, `steps=30`, `num_generations=4`, `batch_size=1`,
`grad_accum=4`, `max_completion_length=16384`, `save_steps=5`. Approx 50–60 min on H200.
Goal: bootstrap the LoRA adapter on short matches that complete inside the token budget.
Watch in WandB: `reward/composite_mean`, `reward/r_result_mean`, `tools/freq_*`,
`rollout/match_completion_rate`, `completions/clipped_ratio`.
## 9. Step 6: GRPO Main Run (5-over end-to-end)
```bash
PYTORCH_ALLOC_CONF=expandable_segments:True \
python train.py train --config configs/cricket_train_qwen3.yaml
```
Config: `max_overs=5`, `steps=100`, `num_generations=4`, `batch_size=1`, `grad_accum=4`,
`max_completion_length=24576`, `save_steps=20`, `beta=0.0` (no reference model).
Approx 3–4 min/step → ~5–7 hrs total on H200.
The config sets `resume_from: ./checkpoints/stage2_final` so the main run picks up the
warmup adapter automatically. Comment out that line (or pass `--resume-from ""`) for a
fresh-adapter main run. The base model still loads from `Qwen/Qwen3-4B-Instruct-2507`;
only the LoRA weights resume.
Goal: improve coherence, adaptation, opponent awareness, match outcomes on full T20s.
Watch in WandB: `reward/r_result_mean` (sparse outcome), `reward/r_coherence_mean`,
`reward/r_adaptation_mean`, `episode/tool_calls_mean` (should approach 720), `episode/agent_score_mean`.
Switch the opponent in YAML (`opponent.mode: llm_live`) and set `HF_TOKEN` for adversarial
training against live Gemma. Use `cricsheet` or `heuristic` for cheaper ablations.
`train.py train` uses TRL `GRPOTrainer(environment_factory=CricketCaptainToolEnv)`, so the model interacts with a live environment over multiple tool-calling turns. This is not inference through the HF Router; the trained captain model is loaded locally by Transformers/TRL, with LoRA adapters when using quantized weights.
## 10. Step 7: Evaluation
```bash
python eval.py \
--episodes 20 \
--env-url http://localhost:8001 \
--eval-pack-id adaptive_t20_v1 \
--opponent-mode llm_cached
```
Compare: random baseline → untrained Qwen3-4B-Instruct-2507 → trained Qwen3-4B-Instruct-2507 (warmup + main adapter via `compare_eval.py`).
## 11. Format Comparison
| max_overs | Format selected | Data used by cricsheet | Typical target |
|-----------|----------------|------------------------|---------------|
| 5 | T5 | T20 pkl (closest) | ~47 runs |
| 7 | T5 | T20 pkl | ~66 runs |
| 20 | T20 | T20 pkl | ~160 runs |
| 50 | ODI | ODI pkl | ~290 runs |
All formats work with all opponent modes. Use `--max-overs N` with any runner.
## 12. Tool budget and training
Implemented in `server/cricket_environment.py` (`TOOL_BUDGET_PER_OVER=3`, `TOOL_FINE_PER_EXCESS=0.04`).
**Overhead tools (only these increment the per-over counter):** `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`.
**Not overhead:** `plan_shot`, `set_match_plan`, `update_match_plan`, `select_batter`, `choose_bowler`, `set_field_setting`, `play_delivery`, `bowl_delivery`, `call_toss`, and other tools that advance or directly execute the ball.
**Rule:** the first 3 overhead calls in each over are not fined; each further overhead call in that over adds **−0.04** to the step reward. The prompt includes `Tool budget: N/3 overhead calls used this over`.
**Why this matters for GRPO:** training uses the same environment as inference. Fines are part of the reward the trainer optimizes, so the policy learns to use reflection and `analyze_situation` when they matter, and to lean on `plan_shot` plus match-level planning (`set_match_plan` / `update_match_plan`) for routine structure without spending the 3 free overhead calls every ball.
## 13. Configs
```bash
# Start server with the legacy default config (llm_live opponent, 5-over default)
PYTHONPATH=. python server/app.py --port 8001 --config configs/extras/default.yaml
# Start with reproducible cached eval
PYTHONPATH=. python server/app.py --port 8001 --config configs/extras/cached_eval.yaml
```
Config controls: `env.agent_team`, `env.max_overs`, `env.eval_pack_id`, `train.model`, `train.max_completion_length`, `train.max_tool_calling_iterations`, `opponent.mode`, `opponent.model`, `opponent.api_base`, `captain.model`, and `captain.api_base`.
## 14. Latest Verified Run Results (2026-04-25)
All runs in `illustrations/`. Zero parse errors across all 14 inference runs. r_validity=1.0 across all train-smoke matches.
```
Random agent — 5-over, heuristic: score=20.8 coherence=0.556 reward=−0.826
Random agent — 5-over, cricsheet: score=28.0 coherence=0.527 reward=−0.410
Random agent — 5-over, llm_live: score=27.4 coherence=0.537 reward=−0.723
Random agent — 20-over, cricsheet: score=63.6 coherence=0.568 reward=−5.632
Random agent — 20-over, heuristic: score=82.4 coherence=0.545 reward=−8.174
Train-smoke — 5-over, heuristic: r_validity=1.0 coherence=0.596 3W/0L
Train-smoke — 5-over, cricsheet: r_validity=1.0 coherence=0.620 2W/1L
Train-smoke — 5-over, llm_live: r_validity=1.0 coherence=0.552 2W/1L
LLM captain (gemma-4-26B) — 3-over, llm_live:
coherence=0.657 adaptation=0.502 opp_aware=0.750 parse_err=0%
```
## 15. Immediate Next Steps
1. **Run GRPO training** via `bash scripts/run_warmup_then_main.sh` (warmup curriculum + main run + auto-resume) to produce reward curves.
2. **Deploy HF Space** for live Gradio demo — Dockerfile present, just needs HF push.
3. **Generate opponent cache** using `llm_live` for reproducible official eval.
4. **Produce training plots** — coherence heatmap, reward curve, tool-usage timeline.
|