sync: pull latest from main (model_server.py, captain LLM toggle in ui.py, 0.6B configs, SUBMISSION + RUNTIME_DURABILITY docs)
e70c305 verified | # Experiment Workflow: Baselines, Opponents, Short Runs, and Training | |
| This document explains how to run CricketCaptain experiments in a practical order: smoke checks, 5-over baselines, training, then longer evaluation. | |
| ## 1. Why Start With 5-Over Smoke + Baselines? | |
| A full T20 innings is 20 overs. That is useful for final evaluation but slow for debugging. | |
| For early experiments, 5-over runs are better because they quickly answer: | |
| - Is the OpenEnv server working? | |
| - Is the client connecting correctly? | |
| - Is the model producing valid JSON? | |
| - Are tool calls accepted? | |
| - Are rewards non-zero and sensible? | |
| - Does the opponent policy work? | |
| - Do metrics like coherence, adaptation, and opponent awareness update? | |
| The workflow should be: | |
| ```text | |
| 5-over smoke → 5-over untrained baseline → short training → 5-over trained eval → 20-over final eval | |
| ``` | |
| Do not start with full 20-over training unless the 5-over smoke loop is stable. | |
| ## 2. Opponent Modes | |
| Four modes in `server/opponent_policy.py`. Controlled via `--opponent-mode`, `CRICKET_OPPONENT_MODE`, or the `opponent.mode` field in any training/eval YAML. | |
| **Default in the current Qwen3 training configs is `heuristic`** (`configs/cricket_train_qwen3*.yaml`) for fast iteration. Switch to `llm_live` for adversarial evaluation when an HF token is set. The legacy `configs/extras/default.yaml` defaulted to `llm_live`. | |
| ### 2.1 Heuristic Opponent | |
| ```bash | |
| --opponent-mode heuristic | |
| ``` | |
| Format-aware local policy. Uses T5/T20/ODI rules from `server/format_mapper.py`: | |
| - Phase-weighted shot distributions (powerplay/middle/death per format) | |
| - Wicket-pressure shift (heavier weight toward defensive shots when 7+ down) | |
| - Batter/bowler roles selected from `data/format_rules.json` | |
| Use for: fast local tests, cheap training rollouts, deterministic-ish baselines. | |
| **No API key needed.** | |
| ### 2.2 Cricsheet Opponent | |
| ```bash | |
| --opponent-mode cricsheet | |
| ``` | |
| Samples real Cricsheet ball-by-ball deliveries, indexed by `(phase, wickets_band, innings_type)`. | |
| Auto-selects data by format: | |
| - `max_overs ≤ 25` → `ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries) | |
| - `max_overs > 25` → `ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries) | |
| Progressive fallback: drop innings_type → drop wickets_band → any phase record. | |
| **No API key needed.** Data files must be present under `data/processed/`. | |
| ### 2.3 Live LLM Opponent | |
| ```bash | |
| export HF_TOKEN=hf_... | |
| --opponent-mode llm_live | |
| ``` | |
| Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible endpoint). Set `HF_TOKEN` or pass `--opponent-api-key`; otherwise use `--opponent-mode heuristic` for local runs. | |
| Use for: demos, realistic opponent behavior, self-play experiments. | |
| ### 2.4 Cached LLM Opponent | |
| ```bash | |
| --opponent-mode llm_cached | |
| export CRICKET_OPPONENT_CACHE=data/opponent_cache/adaptive_t20_v1.jsonl | |
| ``` | |
| Replays pre-recorded decisions. Does **not** call any live model. Use for official/reproducible eval — every compared captain faces identical opponent decisions. | |
| ## 3. Starting The Server | |
| ```bash | |
| # Recommended (uvicorn auto-reload) | |
| cd cricket_captain | |
| python -m uvicorn server.app:app --port 8001 | |
| # Or via app.py directly | |
| PYTHONPATH=. python server/app.py --port 8001 | |
| ``` | |
| Health check: `curl http://localhost:8001/health` → `{"status":"healthy"}` | |
| Set the URL for runners: | |
| ```bash | |
| export CRICKET_CAPTAIN_ENV_URL=http://localhost:8001 | |
| ``` | |
| On Lightning / remote runtimes, expose the port and pass the external URL: | |
| ```bash | |
| export CRICKET_CAPTAIN_ENV_URL=ws://<lightning-exposed-host>/ws | |
| ``` | |
| ## 4. Step 1: Random Baseline (all 3 local modes) | |
| No API key, no GPU needed. Verify the full loop works. | |
| ```bash | |
| # Run all 3 modes in parallel | |
| python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode heuristic --env-url http://localhost:8001 | |
| python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode cricsheet --env-url http://localhost:8001 | |
| python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode llm_live --env-url http://localhost:8001 | |
| ``` | |
| **Verified baselines (2026-04-25):** | |
| | Opponent | score | coherence | reward | parse_err | | |
| |----------|-------|-----------|--------|-----------| | |
| | heuristic | 20.8 | 0.556 | −0.826 | 0% | | |
| | cricsheet | 28.0 | 0.527 | −0.410 | 0% | | |
| | llm_live | 27.4 | 0.537 | −0.723 | 0% | | |
| ## 5. Step 2: Train-Smoke (verify reward signals, no GPU) | |
| `train.py train-smoke` runs direct `CricketEnvironment` rollouts — **no server needed**. | |
| ```bash | |
| python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode heuristic | |
| python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode cricsheet | |
| python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode llm_live | |
| ``` | |
| **Verified train-smoke baselines (2026-04-25):** r_validity=1.0 on all 9 matches (3 modes × 3 matches). All reward signals active: coherence, adaptation, opponent_awareness, plan_commitment, staleness, regret. | |
| Quick 2-over smoke for CI-style checks: | |
| ```bash | |
| python train.py train-smoke --matches 1 --max-overs 2 --max-steps 240 --log-steps 90 --opponent-mode heuristic | |
| ``` | |
| ## 6. Step 3: Untrained LLM Baseline (requires HF token) | |
| ```bash | |
| export HF_TOKEN=hf_... | |
| python inference.py \ | |
| --model google/gemma-4-26B-A4B-it \ | |
| --api-base https://router.huggingface.co/v1 \ | |
| --api-key "$HF_TOKEN" \ | |
| --episodes 3 --max-overs 5 \ | |
| --opponent-mode llm_live \ | |
| --env-url http://localhost:8001 | |
| ``` | |
| **Verified LLM captain run (2026-04-25):** | |
| ``` | |
| model: google/gemma-4-26B-A4B-it via HF Router | |
| coherence: 0.657 | adaptation: 0.502 | opp_aware: 0.750 | |
| parse errors: 0.0% | reward: +0.168 | |
| ``` | |
| ## 7. Step 4: SFT Tool Warmup | |
| ```bash | |
| python train.py sft-data --output data/training/tool_sft_examples.jsonl --examples 500 | |
| ``` | |
| Teaches tool-call shape, tool names, and argument schemas before RL. Not the main objective — just reduces wasted GRPO rollouts on syntax/tool-selection errors. | |
| ## 8. Step 5: GRPO Warmup (2–3 over curriculum) | |
| Single-stage training (full composite reward from step 0) because Qwen3-4B-Instruct-2507 | |
| emits `<tool_call>...</tool_call>` natively. Warmup is a fast curriculum on short formats | |
| to bootstrap the LoRA adapter before the longer main run: | |
| ```bash | |
| PYTORCH_ALLOC_CONF=expandable_segments:True \ | |
| python train.py train --config configs/cricket_train_qwen3_warmup.yaml | |
| ``` | |
| Config: `overs_distribution=[2,2,2,2,2,2,3,3,3]`, `steps=30`, `num_generations=4`, `batch_size=1`, | |
| `grad_accum=4`, `max_completion_length=16384`, `save_steps=5`. Approx 50–60 min on H200. | |
| Goal: bootstrap the LoRA adapter on short matches that complete inside the token budget. | |
| Watch in WandB: `reward/composite_mean`, `reward/r_result_mean`, `tools/freq_*`, | |
| `rollout/match_completion_rate`, `completions/clipped_ratio`. | |
| ## 9. Step 6: GRPO Main Run (5-over end-to-end) | |
| ```bash | |
| PYTORCH_ALLOC_CONF=expandable_segments:True \ | |
| python train.py train --config configs/cricket_train_qwen3.yaml | |
| ``` | |
| Config: `max_overs=5`, `steps=100`, `num_generations=4`, `batch_size=1`, `grad_accum=4`, | |
| `max_completion_length=24576`, `save_steps=20`, `beta=0.0` (no reference model). | |
| Approx 3–4 min/step → ~5–7 hrs total on H200. | |
| The config sets `resume_from: ./checkpoints/stage2_final` so the main run picks up the | |
| warmup adapter automatically. Comment out that line (or pass `--resume-from ""`) for a | |
| fresh-adapter main run. The base model still loads from `Qwen/Qwen3-4B-Instruct-2507`; | |
| only the LoRA weights resume. | |
| Goal: improve coherence, adaptation, opponent awareness, match outcomes on full T20s. | |
| Watch in WandB: `reward/r_result_mean` (sparse outcome), `reward/r_coherence_mean`, | |
| `reward/r_adaptation_mean`, `episode/tool_calls_mean` (should approach 720), `episode/agent_score_mean`. | |
| Switch the opponent in YAML (`opponent.mode: llm_live`) and set `HF_TOKEN` for adversarial | |
| training against live Gemma. Use `cricsheet` or `heuristic` for cheaper ablations. | |
| `train.py train` uses TRL `GRPOTrainer(environment_factory=CricketCaptainToolEnv)`, so the model interacts with a live environment over multiple tool-calling turns. This is not inference through the HF Router; the trained captain model is loaded locally by Transformers/TRL, with LoRA adapters when using quantized weights. | |
| ## 10. Step 7: Evaluation | |
| ```bash | |
| python eval.py \ | |
| --episodes 20 \ | |
| --env-url http://localhost:8001 \ | |
| --eval-pack-id adaptive_t20_v1 \ | |
| --opponent-mode llm_cached | |
| ``` | |
| Compare: random baseline → untrained Qwen3-4B-Instruct-2507 → trained Qwen3-4B-Instruct-2507 (warmup + main adapter via `compare_eval.py`). | |
| ## 11. Format Comparison | |
| | max_overs | Format selected | Data used by cricsheet | Typical target | | |
| |-----------|----------------|------------------------|---------------| | |
| | 5 | T5 | T20 pkl (closest) | ~47 runs | | |
| | 7 | T5 | T20 pkl | ~66 runs | | |
| | 20 | T20 | T20 pkl | ~160 runs | | |
| | 50 | ODI | ODI pkl | ~290 runs | | |
| All formats work with all opponent modes. Use `--max-overs N` with any runner. | |
| ## 12. Tool budget and training | |
| Implemented in `server/cricket_environment.py` (`TOOL_BUDGET_PER_OVER=3`, `TOOL_FINE_PER_EXCESS=0.04`). | |
| **Overhead tools (only these increment the per-over counter):** `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`. | |
| **Not overhead:** `plan_shot`, `set_match_plan`, `update_match_plan`, `select_batter`, `choose_bowler`, `set_field_setting`, `play_delivery`, `bowl_delivery`, `call_toss`, and other tools that advance or directly execute the ball. | |
| **Rule:** the first 3 overhead calls in each over are not fined; each further overhead call in that over adds **−0.04** to the step reward. The prompt includes `Tool budget: N/3 overhead calls used this over`. | |
| **Why this matters for GRPO:** training uses the same environment as inference. Fines are part of the reward the trainer optimizes, so the policy learns to use reflection and `analyze_situation` when they matter, and to lean on `plan_shot` plus match-level planning (`set_match_plan` / `update_match_plan`) for routine structure without spending the 3 free overhead calls every ball. | |
| ## 13. Configs | |
| ```bash | |
| # Start server with the legacy default config (llm_live opponent, 5-over default) | |
| PYTHONPATH=. python server/app.py --port 8001 --config configs/extras/default.yaml | |
| # Start with reproducible cached eval | |
| PYTHONPATH=. python server/app.py --port 8001 --config configs/extras/cached_eval.yaml | |
| ``` | |
| Config controls: `env.agent_team`, `env.max_overs`, `env.eval_pack_id`, `train.model`, `train.max_completion_length`, `train.max_tool_calling_iterations`, `opponent.mode`, `opponent.model`, `opponent.api_base`, `captain.model`, and `captain.api_base`. | |
| ## 14. Latest Verified Run Results (2026-04-25) | |
| All runs in `illustrations/`. Zero parse errors across all 14 inference runs. r_validity=1.0 across all train-smoke matches. | |
| ``` | |
| Random agent — 5-over, heuristic: score=20.8 coherence=0.556 reward=−0.826 | |
| Random agent — 5-over, cricsheet: score=28.0 coherence=0.527 reward=−0.410 | |
| Random agent — 5-over, llm_live: score=27.4 coherence=0.537 reward=−0.723 | |
| Random agent — 20-over, cricsheet: score=63.6 coherence=0.568 reward=−5.632 | |
| Random agent — 20-over, heuristic: score=82.4 coherence=0.545 reward=−8.174 | |
| Train-smoke — 5-over, heuristic: r_validity=1.0 coherence=0.596 3W/0L | |
| Train-smoke — 5-over, cricsheet: r_validity=1.0 coherence=0.620 2W/1L | |
| Train-smoke — 5-over, llm_live: r_validity=1.0 coherence=0.552 2W/1L | |
| LLM captain (gemma-4-26B) — 3-over, llm_live: | |
| coherence=0.657 adaptation=0.502 opp_aware=0.750 parse_err=0% | |
| ``` | |
| ## 15. Immediate Next Steps | |
| 1. **Run GRPO training** via `bash scripts/run_warmup_then_main.sh` (warmup curriculum + main run + auto-resume) to produce reward curves. | |
| 2. **Deploy HF Space** for live Gradio demo — Dockerfile present, just needs HF push. | |
| 3. **Generate opponent cache** using `llm_live` for reproducible official eval. | |
| 4. **Produce training plots** — coherence heatmap, reward curve, tool-usage timeline. | |