Experiment Workflow: Baselines, Opponents, Short Runs, and Training
This document explains how to run CricketCaptain experiments in a practical order: smoke checks, 5-over baselines, training, then longer evaluation.
1. Why Start With 5-Over Smoke + Baselines?
A full T20 innings is 20 overs. That is useful for final evaluation but slow for debugging.
For early experiments, 5-over runs are better because they quickly answer:
- Is the OpenEnv server working?
- Is the client connecting correctly?
- Is the model producing valid JSON?
- Are tool calls accepted?
- Are rewards non-zero and sensible?
- Does the opponent policy work?
- Do metrics like coherence, adaptation, and opponent awareness update?
The workflow should be:
5-over smoke → 5-over untrained baseline → short training → 5-over trained eval → 20-over final eval
Do not start with full 20-over training unless the 5-over smoke loop is stable.
2. Opponent Modes
Four modes in server/opponent_policy.py. Controlled via --opponent-mode, CRICKET_OPPONENT_MODE, or the opponent.mode field in any training/eval YAML.
Default in the current Qwen3 training configs is heuristic (configs/cricket_train_qwen3*.yaml) for fast iteration. Switch to llm_live for adversarial evaluation when an HF token is set. The legacy configs/extras/default.yaml defaulted to llm_live.
2.1 Heuristic Opponent
--opponent-mode heuristic
Format-aware local policy. Uses T5/T20/ODI rules from server/format_mapper.py:
- Phase-weighted shot distributions (powerplay/middle/death per format)
- Wicket-pressure shift (heavier weight toward defensive shots when 7+ down)
- Batter/bowler roles selected from
data/format_rules.json
Use for: fast local tests, cheap training rollouts, deterministic-ish baselines. No API key needed.
2.2 Cricsheet Opponent
--opponent-mode cricsheet
Samples real Cricsheet ball-by-ball deliveries, indexed by (phase, wickets_band, innings_type).
Auto-selects data by format:
max_overs ≤ 25→ball_outcomes_t20_v1.pkl(1.17M T20 deliveries)max_overs > 25→ball_outcomes_odi_v1.pkl(1.65M ODI deliveries)
Progressive fallback: drop innings_type → drop wickets_band → any phase record.
No API key needed. Data files must be present under data/processed/.
2.3 Live LLM Opponent
export HF_TOKEN=hf_...
--opponent-mode llm_live
Calls google/gemma-4-26B-A4B-it via HF Router (or any OpenAI-compatible endpoint). Set HF_TOKEN or pass --opponent-api-key; otherwise use --opponent-mode heuristic for local runs.
Use for: demos, realistic opponent behavior, self-play experiments.
2.4 Cached LLM Opponent
--opponent-mode llm_cached
export CRICKET_OPPONENT_CACHE=data/opponent_cache/adaptive_t20_v1.jsonl
Replays pre-recorded decisions. Does not call any live model. Use for official/reproducible eval — every compared captain faces identical opponent decisions.
3. Starting The Server
# Recommended (uvicorn auto-reload)
cd cricket_captain
python -m uvicorn server.app:app --port 8001
# Or via app.py directly
PYTHONPATH=. python server/app.py --port 8001
Health check: curl http://localhost:8001/health → {"status":"healthy"}
Set the URL for runners:
export CRICKET_CAPTAIN_ENV_URL=http://localhost:8001
On Lightning / remote runtimes, expose the port and pass the external URL:
export CRICKET_CAPTAIN_ENV_URL=ws://<lightning-exposed-host>/ws
4. Step 1: Random Baseline (all 3 local modes)
No API key, no GPU needed. Verify the full loop works.
# Run all 3 modes in parallel
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode heuristic --env-url http://localhost:8001
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode cricsheet --env-url http://localhost:8001
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode llm_live --env-url http://localhost:8001
Verified baselines (2026-04-25):
| Opponent | score | coherence | reward | parse_err |
|---|---|---|---|---|
| heuristic | 20.8 | 0.556 | −0.826 | 0% |
| cricsheet | 28.0 | 0.527 | −0.410 | 0% |
| llm_live | 27.4 | 0.537 | −0.723 | 0% |
5. Step 2: Train-Smoke (verify reward signals, no GPU)
train.py train-smoke runs direct CricketEnvironment rollouts — no server needed.
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode heuristic
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode cricsheet
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode llm_live
Verified train-smoke baselines (2026-04-25): r_validity=1.0 on all 9 matches (3 modes × 3 matches). All reward signals active: coherence, adaptation, opponent_awareness, plan_commitment, staleness, regret.
Quick 2-over smoke for CI-style checks:
python train.py train-smoke --matches 1 --max-overs 2 --max-steps 240 --log-steps 90 --opponent-mode heuristic
6. Step 3: Untrained LLM Baseline (requires HF token)
export HF_TOKEN=hf_...
python inference.py \
--model google/gemma-4-26B-A4B-it \
--api-base https://router.huggingface.co/v1 \
--api-key "$HF_TOKEN" \
--episodes 3 --max-overs 5 \
--opponent-mode llm_live \
--env-url http://localhost:8001
Verified LLM captain run (2026-04-25):
model: google/gemma-4-26B-A4B-it via HF Router
coherence: 0.657 | adaptation: 0.502 | opp_aware: 0.750
parse errors: 0.0% | reward: +0.168
7. Step 4: SFT Tool Warmup
python train.py sft-data --output data/training/tool_sft_examples.jsonl --examples 500
Teaches tool-call shape, tool names, and argument schemas before RL. Not the main objective — just reduces wasted GRPO rollouts on syntax/tool-selection errors.
8. Step 5: GRPO Warmup (2–3 over curriculum)
Single-stage training (full composite reward from step 0) because Qwen3-4B-Instruct-2507
emits <tool_call>...</tool_call> natively. Warmup is a fast curriculum on short formats
to bootstrap the LoRA adapter before the longer main run:
PYTORCH_ALLOC_CONF=expandable_segments:True \
python train.py train --config configs/cricket_train_qwen3_warmup.yaml
Config: overs_distribution=[2,2,2,2,2,2,3,3,3], steps=30, num_generations=4, batch_size=1,
grad_accum=4, max_completion_length=16384, save_steps=5. Approx 50–60 min on H200.
Goal: bootstrap the LoRA adapter on short matches that complete inside the token budget.
Watch in WandB: reward/composite_mean, reward/r_result_mean, tools/freq_*,
rollout/match_completion_rate, completions/clipped_ratio.
9. Step 6: GRPO Main Run (5-over end-to-end)
PYTORCH_ALLOC_CONF=expandable_segments:True \
python train.py train --config configs/cricket_train_qwen3.yaml
Config: max_overs=5, steps=100, num_generations=4, batch_size=1, grad_accum=4,
max_completion_length=24576, save_steps=20, beta=0.0 (no reference model).
Approx 3–4 min/step → ~5–7 hrs total on H200.
The config sets resume_from: ./checkpoints/stage2_final so the main run picks up the
warmup adapter automatically. Comment out that line (or pass --resume-from "") for a
fresh-adapter main run. The base model still loads from Qwen/Qwen3-4B-Instruct-2507;
only the LoRA weights resume.
Goal: improve coherence, adaptation, opponent awareness, match outcomes on full T20s.
Watch in WandB: reward/r_result_mean (sparse outcome), reward/r_coherence_mean,
reward/r_adaptation_mean, episode/tool_calls_mean (should approach 720), episode/agent_score_mean.
Switch the opponent in YAML (opponent.mode: llm_live) and set HF_TOKEN for adversarial
training against live Gemma. Use cricsheet or heuristic for cheaper ablations.
train.py train uses TRL GRPOTrainer(environment_factory=CricketCaptainToolEnv), so the model interacts with a live environment over multiple tool-calling turns. This is not inference through the HF Router; the trained captain model is loaded locally by Transformers/TRL, with LoRA adapters when using quantized weights.
10. Step 7: Evaluation
python eval.py \
--episodes 20 \
--env-url http://localhost:8001 \
--eval-pack-id adaptive_t20_v1 \
--opponent-mode llm_cached
Compare: random baseline → untrained Qwen3-4B-Instruct-2507 → trained Qwen3-4B-Instruct-2507 (warmup + main adapter via compare_eval.py).
11. Format Comparison
| max_overs | Format selected | Data used by cricsheet | Typical target |
|---|---|---|---|
| 5 | T5 | T20 pkl (closest) | ~47 runs |
| 7 | T5 | T20 pkl | ~66 runs |
| 20 | T20 | T20 pkl | ~160 runs |
| 50 | ODI | ODI pkl | ~290 runs |
All formats work with all opponent modes. Use --max-overs N with any runner.
12. Tool budget and training
Implemented in server/cricket_environment.py (TOOL_BUDGET_PER_OVER=3, TOOL_FINE_PER_EXCESS=0.04).
Overhead tools (only these increment the per-over counter): set_strategy, set_bowling_strategy, plan_delivery, reflect_after_ball, analyze_situation.
Not overhead: plan_shot, set_match_plan, update_match_plan, select_batter, choose_bowler, set_field_setting, play_delivery, bowl_delivery, call_toss, and other tools that advance or directly execute the ball.
Rule: the first 3 overhead calls in each over are not fined; each further overhead call in that over adds −0.04 to the step reward. The prompt includes Tool budget: N/3 overhead calls used this over.
Why this matters for GRPO: training uses the same environment as inference. Fines are part of the reward the trainer optimizes, so the policy learns to use reflection and analyze_situation when they matter, and to lean on plan_shot plus match-level planning (set_match_plan / update_match_plan) for routine structure without spending the 3 free overhead calls every ball.
13. Configs
# Start server with the legacy default config (llm_live opponent, 5-over default)
PYTHONPATH=. python server/app.py --port 8001 --config configs/extras/default.yaml
# Start with reproducible cached eval
PYTHONPATH=. python server/app.py --port 8001 --config configs/extras/cached_eval.yaml
Config controls: env.agent_team, env.max_overs, env.eval_pack_id, train.model, train.max_completion_length, train.max_tool_calling_iterations, opponent.mode, opponent.model, opponent.api_base, captain.model, and captain.api_base.
14. Latest Verified Run Results (2026-04-25)
All runs in illustrations/. Zero parse errors across all 14 inference runs. r_validity=1.0 across all train-smoke matches.
Random agent — 5-over, heuristic: score=20.8 coherence=0.556 reward=−0.826
Random agent — 5-over, cricsheet: score=28.0 coherence=0.527 reward=−0.410
Random agent — 5-over, llm_live: score=27.4 coherence=0.537 reward=−0.723
Random agent — 20-over, cricsheet: score=63.6 coherence=0.568 reward=−5.632
Random agent — 20-over, heuristic: score=82.4 coherence=0.545 reward=−8.174
Train-smoke — 5-over, heuristic: r_validity=1.0 coherence=0.596 3W/0L
Train-smoke — 5-over, cricsheet: r_validity=1.0 coherence=0.620 2W/1L
Train-smoke — 5-over, llm_live: r_validity=1.0 coherence=0.552 2W/1L
LLM captain (gemma-4-26B) — 3-over, llm_live:
coherence=0.657 adaptation=0.502 opp_aware=0.750 parse_err=0%
15. Immediate Next Steps
- Run GRPO training via
bash scripts/run_warmup_then_main.sh(warmup curriculum + main run + auto-resume) to produce reward curves. - Deploy HF Space for live Gradio demo — Dockerfile present, just needs HF push.
- Generate opponent cache using
llm_livefor reproducible official eval. - Produce training plots — coherence heatmap, reward curve, tool-usage timeline.