cricket-captain-llm / docs /experiment_workflow.md
pratinavseth's picture
sync: pull latest from main (model_server.py, captain LLM toggle in ui.py, 0.6B configs, SUBMISSION + RUNTIME_DURABILITY docs)
e70c305 verified

Experiment Workflow: Baselines, Opponents, Short Runs, and Training

This document explains how to run CricketCaptain experiments in a practical order: smoke checks, 5-over baselines, training, then longer evaluation.

1. Why Start With 5-Over Smoke + Baselines?

A full T20 innings is 20 overs. That is useful for final evaluation but slow for debugging.

For early experiments, 5-over runs are better because they quickly answer:

  • Is the OpenEnv server working?
  • Is the client connecting correctly?
  • Is the model producing valid JSON?
  • Are tool calls accepted?
  • Are rewards non-zero and sensible?
  • Does the opponent policy work?
  • Do metrics like coherence, adaptation, and opponent awareness update?

The workflow should be:

5-over smoke → 5-over untrained baseline → short training → 5-over trained eval → 20-over final eval

Do not start with full 20-over training unless the 5-over smoke loop is stable.

2. Opponent Modes

Four modes in server/opponent_policy.py. Controlled via --opponent-mode, CRICKET_OPPONENT_MODE, or the opponent.mode field in any training/eval YAML.

Default in the current Qwen3 training configs is heuristic (configs/cricket_train_qwen3*.yaml) for fast iteration. Switch to llm_live for adversarial evaluation when an HF token is set. The legacy configs/extras/default.yaml defaulted to llm_live.

2.1 Heuristic Opponent

--opponent-mode heuristic

Format-aware local policy. Uses T5/T20/ODI rules from server/format_mapper.py:

  • Phase-weighted shot distributions (powerplay/middle/death per format)
  • Wicket-pressure shift (heavier weight toward defensive shots when 7+ down)
  • Batter/bowler roles selected from data/format_rules.json

Use for: fast local tests, cheap training rollouts, deterministic-ish baselines. No API key needed.

2.2 Cricsheet Opponent

--opponent-mode cricsheet

Samples real Cricsheet ball-by-ball deliveries, indexed by (phase, wickets_band, innings_type).

Auto-selects data by format:

  • max_overs ≤ 25ball_outcomes_t20_v1.pkl (1.17M T20 deliveries)
  • max_overs > 25ball_outcomes_odi_v1.pkl (1.65M ODI deliveries)

Progressive fallback: drop innings_type → drop wickets_band → any phase record. No API key needed. Data files must be present under data/processed/.

2.3 Live LLM Opponent

export HF_TOKEN=hf_...
--opponent-mode llm_live

Calls google/gemma-4-26B-A4B-it via HF Router (or any OpenAI-compatible endpoint). Set HF_TOKEN or pass --opponent-api-key; otherwise use --opponent-mode heuristic for local runs.

Use for: demos, realistic opponent behavior, self-play experiments.

2.4 Cached LLM Opponent

--opponent-mode llm_cached
export CRICKET_OPPONENT_CACHE=data/opponent_cache/adaptive_t20_v1.jsonl

Replays pre-recorded decisions. Does not call any live model. Use for official/reproducible eval — every compared captain faces identical opponent decisions.

3. Starting The Server

# Recommended (uvicorn auto-reload)
cd cricket_captain
python -m uvicorn server.app:app --port 8001

# Or via app.py directly
PYTHONPATH=. python server/app.py --port 8001

Health check: curl http://localhost:8001/health{"status":"healthy"}

Set the URL for runners:

export CRICKET_CAPTAIN_ENV_URL=http://localhost:8001

On Lightning / remote runtimes, expose the port and pass the external URL:

export CRICKET_CAPTAIN_ENV_URL=ws://<lightning-exposed-host>/ws

4. Step 1: Random Baseline (all 3 local modes)

No API key, no GPU needed. Verify the full loop works.

# Run all 3 modes in parallel
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode heuristic  --env-url http://localhost:8001
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode cricsheet  --env-url http://localhost:8001
python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode llm_live   --env-url http://localhost:8001

Verified baselines (2026-04-25):

Opponent score coherence reward parse_err
heuristic 20.8 0.556 −0.826 0%
cricsheet 28.0 0.527 −0.410 0%
llm_live 27.4 0.537 −0.723 0%

5. Step 2: Train-Smoke (verify reward signals, no GPU)

train.py train-smoke runs direct CricketEnvironment rollouts — no server needed.

python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode heuristic
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode cricsheet
python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode llm_live

Verified train-smoke baselines (2026-04-25): r_validity=1.0 on all 9 matches (3 modes × 3 matches). All reward signals active: coherence, adaptation, opponent_awareness, plan_commitment, staleness, regret.

Quick 2-over smoke for CI-style checks:

python train.py train-smoke --matches 1 --max-overs 2 --max-steps 240 --log-steps 90 --opponent-mode heuristic

6. Step 3: Untrained LLM Baseline (requires HF token)

export HF_TOKEN=hf_...
python inference.py \
  --model google/gemma-4-26B-A4B-it \
  --api-base https://router.huggingface.co/v1 \
  --api-key "$HF_TOKEN" \
  --episodes 3 --max-overs 5 \
  --opponent-mode llm_live \
  --env-url http://localhost:8001

Verified LLM captain run (2026-04-25):

model: google/gemma-4-26B-A4B-it via HF Router
coherence: 0.657 | adaptation: 0.502 | opp_aware: 0.750
parse errors: 0.0% | reward: +0.168

7. Step 4: SFT Tool Warmup

python train.py sft-data --output data/training/tool_sft_examples.jsonl --examples 500

Teaches tool-call shape, tool names, and argument schemas before RL. Not the main objective — just reduces wasted GRPO rollouts on syntax/tool-selection errors.

8. Step 5: GRPO Warmup (2–3 over curriculum)

Single-stage training (full composite reward from step 0) because Qwen3-4B-Instruct-2507 emits <tool_call>...</tool_call> natively. Warmup is a fast curriculum on short formats to bootstrap the LoRA adapter before the longer main run:

PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3_warmup.yaml

Config: overs_distribution=[2,2,2,2,2,2,3,3,3], steps=30, num_generations=4, batch_size=1, grad_accum=4, max_completion_length=16384, save_steps=5. Approx 50–60 min on H200.

Goal: bootstrap the LoRA adapter on short matches that complete inside the token budget. Watch in WandB: reward/composite_mean, reward/r_result_mean, tools/freq_*, rollout/match_completion_rate, completions/clipped_ratio.

9. Step 6: GRPO Main Run (5-over end-to-end)

PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3.yaml

Config: max_overs=5, steps=100, num_generations=4, batch_size=1, grad_accum=4, max_completion_length=24576, save_steps=20, beta=0.0 (no reference model). Approx 3–4 min/step → ~5–7 hrs total on H200.

The config sets resume_from: ./checkpoints/stage2_final so the main run picks up the warmup adapter automatically. Comment out that line (or pass --resume-from "") for a fresh-adapter main run. The base model still loads from Qwen/Qwen3-4B-Instruct-2507; only the LoRA weights resume.

Goal: improve coherence, adaptation, opponent awareness, match outcomes on full T20s. Watch in WandB: reward/r_result_mean (sparse outcome), reward/r_coherence_mean, reward/r_adaptation_mean, episode/tool_calls_mean (should approach 720), episode/agent_score_mean.

Switch the opponent in YAML (opponent.mode: llm_live) and set HF_TOKEN for adversarial training against live Gemma. Use cricsheet or heuristic for cheaper ablations.

train.py train uses TRL GRPOTrainer(environment_factory=CricketCaptainToolEnv), so the model interacts with a live environment over multiple tool-calling turns. This is not inference through the HF Router; the trained captain model is loaded locally by Transformers/TRL, with LoRA adapters when using quantized weights.

10. Step 7: Evaluation

python eval.py \
  --episodes 20 \
  --env-url http://localhost:8001 \
  --eval-pack-id adaptive_t20_v1 \
  --opponent-mode llm_cached

Compare: random baseline → untrained Qwen3-4B-Instruct-2507 → trained Qwen3-4B-Instruct-2507 (warmup + main adapter via compare_eval.py).

11. Format Comparison

max_overs Format selected Data used by cricsheet Typical target
5 T5 T20 pkl (closest) ~47 runs
7 T5 T20 pkl ~66 runs
20 T20 T20 pkl ~160 runs
50 ODI ODI pkl ~290 runs

All formats work with all opponent modes. Use --max-overs N with any runner.

12. Tool budget and training

Implemented in server/cricket_environment.py (TOOL_BUDGET_PER_OVER=3, TOOL_FINE_PER_EXCESS=0.04).

Overhead tools (only these increment the per-over counter): set_strategy, set_bowling_strategy, plan_delivery, reflect_after_ball, analyze_situation.

Not overhead: plan_shot, set_match_plan, update_match_plan, select_batter, choose_bowler, set_field_setting, play_delivery, bowl_delivery, call_toss, and other tools that advance or directly execute the ball.

Rule: the first 3 overhead calls in each over are not fined; each further overhead call in that over adds −0.04 to the step reward. The prompt includes Tool budget: N/3 overhead calls used this over.

Why this matters for GRPO: training uses the same environment as inference. Fines are part of the reward the trainer optimizes, so the policy learns to use reflection and analyze_situation when they matter, and to lean on plan_shot plus match-level planning (set_match_plan / update_match_plan) for routine structure without spending the 3 free overhead calls every ball.

13. Configs

# Start server with the legacy default config (llm_live opponent, 5-over default)
PYTHONPATH=. python server/app.py --port 8001 --config configs/extras/default.yaml

# Start with reproducible cached eval
PYTHONPATH=. python server/app.py --port 8001 --config configs/extras/cached_eval.yaml

Config controls: env.agent_team, env.max_overs, env.eval_pack_id, train.model, train.max_completion_length, train.max_tool_calling_iterations, opponent.mode, opponent.model, opponent.api_base, captain.model, and captain.api_base.

14. Latest Verified Run Results (2026-04-25)

All runs in illustrations/. Zero parse errors across all 14 inference runs. r_validity=1.0 across all train-smoke matches.

Random agent — 5-over, heuristic:   score=20.8  coherence=0.556  reward=−0.826
Random agent — 5-over, cricsheet:   score=28.0  coherence=0.527  reward=−0.410
Random agent — 5-over, llm_live:    score=27.4  coherence=0.537  reward=−0.723
Random agent — 20-over, cricsheet:  score=63.6  coherence=0.568  reward=−5.632
Random agent — 20-over, heuristic:  score=82.4  coherence=0.545  reward=−8.174

Train-smoke — 5-over, heuristic:    r_validity=1.0  coherence=0.596  3W/0L
Train-smoke — 5-over, cricsheet:    r_validity=1.0  coherence=0.620  2W/1L
Train-smoke — 5-over, llm_live:     r_validity=1.0  coherence=0.552  2W/1L

LLM captain (gemma-4-26B) — 3-over, llm_live:
  coherence=0.657  adaptation=0.502  opp_aware=0.750  parse_err=0%

15. Immediate Next Steps

  1. Run GRPO training via bash scripts/run_warmup_then_main.sh (warmup curriculum + main run + auto-resume) to produce reward curves.
  2. Deploy HF Space for live Gradio demo — Dockerfile present, just needs HF push.
  3. Generate opponent cache using llm_live for reproducible official eval.
  4. Produce training plots — coherence heatmap, reward curve, tool-usage timeline.