Spaces:

pratinavseth
/

cricket-captain-llm

Running

App Files Files Community

pratinavseth commited on 12 days ago

Commit

2fc50a9

verified ·

1 Parent(s): 9431040

sync: today's source updates (XML-only prompt, reward unclip, neg-reward on loss, pinned versions, configs reorg)

Browse files

Files changed (26) hide show

.gitignore +3 -0
PRD.md +2 -0
README.md +279 -68
compare_eval.py +395 -0
configs/cricket_train_qwen3.yaml +99 -0
configs/cricket_train_qwen3_smoke.yaml +99 -0
configs/cricket_train_qwen3_warmup.yaml +98 -0
configs/extras/cached_eval.yaml +18 -0
configs/extras/cricket_train.yaml +125 -0
configs/extras/cricket_train_warmup.yaml +95 -0
configs/extras/default.yaml +59 -0
configs/game_knowledge.yaml +16 -5
docs/benchmark_explainer.md +190 -451
docs/experiment_workflow.md +161 -355
docs/slides.html +36 -31
openenv.yaml +13 -12
pyproject.toml +15 -4
scripts/eval_all_checkpoints.sh +95 -0
scripts/generate_training_plots.py +248 -0
scripts/run_full_pipeline.sh +84 -0
scripts/run_warmup_then_main.sh +43 -0
server/coherence_grader.py +27 -20
server/cricket_environment.py +57 -5
server/markov_engine.py +24 -13
server/reward_calculator.py +31 -11
train.py +1178 -147

.gitignore CHANGED Viewed

@@ -14,9 +14,12 @@ checkpoints/
 training_curves.png
 training_summary.json
 wandb/
 .env
 unsloth_compiled_cache/
 .ipynb_checkpoints/
 .DS_Store
 *.log
 *.zip

 training_curves.png
 training_summary.json
 wandb/
+logs/
 .env
 unsloth_compiled_cache/
 .ipynb_checkpoints/
 .DS_Store
 *.log
 *.zip
+checkpoints_smoke/
+.venv-qwen3/

PRD.md CHANGED Viewed

@@ -128,6 +128,8 @@ Each step returns a `CricketObservation` containing:
 The top-level objective remains long-horizon match success over many simulated matches. Dream11-style reward is auxiliary shaping, not the primary benchmark target.
 ### 5.4 Curriculum Stages
 | Stage | Episodes | Active Rubrics | Objective |

 The top-level objective remains long-horizon match success over many simulated matches. Dream11-style reward is auxiliary shaping, not the primary benchmark target.
+**Tool budget (operational constraint during play and training):** per over, the environment allows **3 no-fine “overhead” tool calls** among `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation`. Each additional overhead call in that over applies a **−0.04** step reward. `plan_shot`, `set_match_plan`, `update_match_plan`, and ball-advancing tools do **not** count against this limit. Training via `train.py` (TRL GRPO with `CricketEnvironment`) uses the same rule, so the policy learns to ration analysis and re-planning across a full innings without a separate ad-hoc budget in the trainer.
 ### 5.4 Curriculum Stages
 | Stage | Episodes | Active Rubrics | Objective |

README.md CHANGED Viewed

@@ -12,12 +12,22 @@ license: mit
 # CricketCaptain-LLM
-**An RL benchmark for adaptive, opponent-aware strategic decision-making.**
-CricketCaptain tests whether language-model agents can plan before a ball, act, observe the result, model the opponent, and revise tactics under changing match pressure. Cricket is the domain: overs, wickets, target pressure, player roles, field settings, and hundreds of tactical decisions per match.
 The Hugging Face Space exposes the OpenEnv server and a Gradio demo UI at `/web`.
 ---
 ## The Problem
@@ -36,6 +46,41 @@ CricketCaptain evaluates:
 ---
 ## Read This First
 - [`docs/benchmark_explainer.md`](docs/benchmark_explainer.md): full explanation of the problem statement, OpenEnv architecture, environment loop, rewards, data curation, training, and competition compliance.
@@ -58,18 +103,19 @@ Call the coin and decide whether to bat or bowl first.
 ```
 ### 2. Batting (when agent bats)
-Choose the batter/strategy, plan the shot, then play each delivery.
 ```json
-{"tool": "select_batter", "arguments": {"name": "Anchor", "style": "anchor", "aggression": 0.35, "rationale": "Middle overs need wicket preservation and strike rotation."}}
 {"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Middle overs against spin — rotate strike and preserve wickets."}}
 {"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy gap."}}
 {"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Working into the gap at mid-wicket."}}
 ```
 ### 3. Bowling & Fielding (when agent bowls)
-Choose the bowler, set a delivery/field plan, then bowl each delivery against an opponent LLM/heuristic batter.
 ```json
-{"tool": "choose_bowler", "arguments": {"name": "Death Specialist", "bowler_type": "pace", "style": "yorker", "rationale": "Target the stumps at the death."}}
 {"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "delivery_type": "yorker", "line": "stumps", "length": "full"}}
 {"tool": "set_field_setting", "arguments": {"setting": "Aggressive"}}
 {"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "delivery_type": "yorker", "line": "stumps", "length": "full", "rationale": "Limit swing room against an aggressive finisher."}}
@@ -108,6 +154,8 @@ Query match intel at a small reward cost.
 | `bowl_delivery` | Bowling | Bowl the next delivery |
 | `reflect_after_ball` | Bat/Bowl | Record post-ball tactical adjustment |
 | `analyze_situation` | Any | Query pitch, bowler, field, or match situation |
 `plan_shot.target_area` is normalized into cricket zones such as `cover`, `point`, `straight`, `midwicket`, `square_leg`, `fine_leg`, `long_on`, and `long_off`. `plan_shot.trajectory` can be `ground`, `lofted`, or `aerial`. Delivery plans normalize line (`outside_off`, `stumps`, `pads`, `wide`), length (`yorker`, `full`, `good`, `short`, `bouncer`), and variation (`stock`, `swing`, `seam`, `slower`, `yorker`, `bouncer`, `off_spin`, `leg_spin`, `googly`).
@@ -115,22 +163,32 @@ Query match intel at a small reward cost.
 ## Reward Architecture
 | Rubric | Weight | When | What |
 |--------|--------|------|------|
-| `r_result` | 55% | Episode end | Match outcome: win/loss, target margin, DLS/par |
-| `r_cricket` | 25% | Innings end | Dense Cricket contribution proxy (Dream11-style: runs, wickets, dots, milestones) |
-| `r_behavior` | 15% | Every delivery | Plan-action coherence + adaptation + opponent awareness + counterfactual regret |
-| `r_validity` | 5% | Every turn | Valid JSON tool call structure (gate/penalty) |
 `r_tools` is computed and logged for analysis but excluded from the composite — tool discipline is measured through outcome and behavior instead.
-The primary objective is to **win or defend the match over a full long-horizon episode**. `r_cricket` provides dense per-ball feedback so training gets a gradient before the final win/loss result.
-The `r_behavior` bundle (15%) covers: plan-action coherence (50%), strategic adaptation (20%), opponent awareness (20%), counterfactual regret (10%).
-**Two-stage curriculum (ToolRL):**
-- Stage 1 (episodes 0–100): `r_format` only — trains valid JSON
-- Stage 2 (episodes 100+): all rubrics — trains full-match strategy
 **Innings-specific scoring:**
 - **1st Innings (batting):** Score vs DLS par baseline
@@ -189,7 +247,7 @@ cricket_captain/
 ├── inference.py               # Random + LLM agent evaluation
 ├── client.py                  # OpenEnv WebSocket client (CricketCaptainEnv)
 ├── models.py                  # GameState, CricketAction, CricketObservation, CricketState
-├── train.py                   # MT-GRPO training (stateless reward function)
 ├── eval.py                    # Coherence heatmaps, reward curves
 ├── scripts/
 │   └── curate_transitions.py  # Cricsheet → Markov transition table pipeline
@@ -234,9 +292,19 @@ Both the heuristic opponent and the environment's `select_batter` / `choose_bowl
 `server/player_roster.py` loads team profiles from `data/player_profiles/` (10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan). When the agent calls `select_batter` or `choose_bowler` with a player name, the roster performs fuzzy lookup (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths.
-### Tool Budget
-The environment enforces a **3-call overhead budget per over**. Overhead tools are `analyze_situation`, `reflect_after_ball`, `plan_delivery`, `set_strategy`, and `set_bowling_strategy`. Each call beyond 3 incurs a **−0.04 reward fine**. `plan_shot` is explicitly excluded from the budget — shot planning is always free. This discourages LLMs from padding with low-information tool calls.
 ### Markov Engine
@@ -250,81 +318,201 @@ Ball outcomes are sampled from a 5-dimensional transition table keyed by `(over,
 Bowler rotation mirrors real cricket: pace-heavy powerplay (90/10), spin-heavy middle overs (45/55), pace-heavy death (80/20). Each bowler has a 10-over cap before rotation is forced.
-### Stateless GRPO Reward
-The `reward_fn` passed to `GRPOTrainer` computes rewards purely from `(prompt, completion)` pairs — no shared environment state. Strategy and phase are extracted from the rendered prompt text:
-```python
-# Prompt always contains:
-# "[CricketCaptain] MIDDLE | FIRST INNINGS"
-# "Over 18.2 | Score: 145/4 | ..."
-# "Current Strategy: consolidate (aggression=0.30) — Rotate strike against spin..."
-def r_coherence_stateless(prompt: str, completion: str) -> float:
-    strategy = extract_strategy_from_prompt(prompt)
-    phase    = extract_phase_from_prompt(prompt)
-    shot     = json.loads(completion)["arguments"]["shot_intent"]
-    return coherence_score(strategy, shot, phase)
-```
 ---
 ## Quickstart
-### YAML config (recommended)
-Use YAML configs to control **both**:
-- **server defaults** (opponent mode/model/cache, eval pack id), and
-- **runner defaults** (`inference.py` / `eval.py`: env URL, max overs, captain model/API).
-Use `configs/default.yaml` when you want both the captain and live opponent to call HF router models. Use `configs/cached_eval.yaml` when you want a live captain model against replayed opponent decisions for reproducible comparison.
 ```bash
-# Start server with live HF opponent config
-cd cricket_captain
-PYTHONPATH=. python server/app.py --port 8001 --config configs/default.yaml
-# Run a short HF Gemma captain baseline using config defaults
-export CRICKET_CAPTAIN_ENV_URL="ws://localhost:8001"
-export HF_TOKEN="hf_..."
 python inference.py --config configs/default.yaml --episodes 1
 ```
-The default config uses the HF router-compatible model `google/gemma-4-26B-A4B-it` for captain-side inference and live opponent defaults. In `llm_live` mode, the opponent actually calls that model during the run. In `llm_cached` mode, the opponent does **not** call `model`; it replays `cache_path`.
-For fair/reproducible eval:
 ```bash
-PYTHONPATH=. python server/app.py --port 8001 --config configs/cached_eval.yaml
-export CRICKET_CAPTAIN_ENV_URL="ws://localhost:8001"
-export HF_TOKEN="hf_..."
-python inference.py --config configs/cached_eval.yaml --episodes 1
 ```
 ```bash
-# Install
-pip install openenv-core>=0.2.2 trl unsloth gradio fastapi uvicorn
-# Start environment server
-cd cricket_captain
-PYTHONPATH=. python server/app.py
-# Set this to your reachable WebSocket endpoint.
-# On Lightning, use the public/internal Lightning URL, not localhost from a remote runner.
-export CRICKET_CAPTAIN_ENV_URL="ws://<your-lightning-host>/ws"
-# Run random baseline agent (5 episodes)
-python inference.py --model random --episodes 5 --verbose --env-url "$CRICKET_CAPTAIN_ENV_URL"
-# Play interactively (Gradio UI)
-PYTHONPATH=. python server/ui.py  # → http://localhost:7860
-# Train (requires GPU + train extras)
-python train.py sft-data --output data/training/tool_sft_examples.jsonl
-python train.py train --stage 1 --steps 200 --model Qwen/Qwen2.5-7B-Instruct
-python train.py train --stage 2 --steps 600 --model ./checkpoints/stage1_final
 ```
 ### Lightning / Remote Runtime Notes
 `localhost:8000` only works when the agent process and server process are in the same network namespace. On Lightning, expose the server port and pass the resulting WebSocket URL via:
@@ -351,10 +539,9 @@ For fast iteration, start with short 5-over runs before full 20-over evaluation:
 1. Random baseline with heuristic opponent.
 2. Base/untrained LLM baseline.
-3. Optional SFT tool-format warmup.
-4. GRPO Stage 1 for format/tool correctness.
-5. GRPO Stage 2 for adaptive strategy.
-6. Eval with `adaptive_t20_v1` and cached opponent decisions.
 See [`docs/experiment_workflow.md`](docs/experiment_workflow.md) for exact commands and rationale.
@@ -494,6 +681,30 @@ Still needed for final submission: real trained-vs-baseline plots, HF Space URL,
 ---
 ## Citation
 ```

 # CricketCaptain-LLM
+**An OpenEnv benchmark for long-horizon multi-turn agentic RL: train an LLM to captain a full cricket match.**
+🔗 **Live Hugging Face Space:** https://huggingface.co/spaces/pratinavseth/cricket-captain-llm
+📊 **Training runs (W&B):** https://wandb.ai/ptnv-s-research/huggingface
+📝 **Mini-blog / video:** _coming soon — placeholder, will link before submission deadline_
+**Hackathon theme alignment — Theme #2: (Super) Long-Horizon Planning & Instruction Following.** A T20 match is up to **240 legal balls of strategic decision-making across both offensive and defensive roles** — exactly the regime where current LLM agents struggle: deep multi-step reasoning, sparse terminal reward (win/loss arrives 100+ turns after the early decisions that caused it), and recovery from early mistakes (a wicket in over 1 should reshape the plan for over 19).
+CricketCaptain is a serious test of an LLM's ability to handle **long-horizon planning under sparse, delayed reward** with **opponent modeling** and **multi-action-type credit assignment**. One match = ~180 sequential tool calls across batting AND bowling phases, with 13 state-conditioned tools, a real Markov ball-outcome engine trained on **1.65M cricsheet deliveries**, and a composite reward signal that scores plan coherence, tactical adaptation, opponent awareness, regret, and final match outcome.
+This is the agentic-RL equivalent of training a coding agent — partial trajectory rollouts, dense intermediate rewards, sparse terminal outcome — but in a domain where the strategy space is genuinely **two-sided** (you don't just write code; you also *anticipate the opposing captain*) and the action distribution is **mixed-discrete-continuous** (categorical tool choice + numeric aggression + free-text rationale).
 The Hugging Face Space exposes the OpenEnv server and a Gradio demo UI at `/web`.
+**Why this is novel:** No prior agentic-RL benchmark covers strategic two-sided sports captaincy with phase-aware tool gating, real-data outcome simulation, and a composable rubric system. Cricket is also one of the few domains where *the same agent must alternate between offensive and defensive policies* within a single episode — a capability gap most multi-turn agent benchmarks (SWE-bench, WebArena, AgentBench) don't touch.
 ---
 ## The Problem
 ---
+## Train → Eval Story (the framing)
+This is **multi-turn agentic RL with partial-trajectory training and full-task eval generalization** — the same pattern coding-agent RL papers (SWE-RL, AgentR) use.
+| | TRAINING (warmup) | TRAINING (main) | EVALUATION |
+|---|---|---|---|
+| max overs | 2-3 (curriculum) | 5 (end-to-end) | full T20 (20 over) |
+| steps | 30 | 100 | n/a |
+| token budget per rollout | 16k (≈ 80–180 turns) | 24k (≈ 120–240 turns) | unlimited (full match plays out) |
+| reward signal | composite (`r_result` + `r_cricket` + `r_behavior` + `r_validity`) — see `server/reward_calculator.py` | same | **headline metric: win rate vs heuristic baseline** |
+| what model learns | format mastery + per-state decisions on short formats | full-match strategic depth on 5-over | composes per-state decisions into match-winning trajectories |
+| script | `python train.py train --config configs/cricket_train_qwen3_warmup.yaml` | `python train.py train --config configs/cricket_train_qwen3.yaml` | `compare_eval.py` (baseline + trained, prints comparison) |
+The training operates on shorter formats than evaluation, but the trained policy generalizes to full matches at inference because it learns good per-state decisions, not specific trajectory lengths. The whole chain runs via `bash scripts/run_warmup_then_main.sh`.
+## Results
+**Live W&B project:** https://wandb.ai/ptnv-s-research/huggingface
+Training stack: Qwen3-4B-Instruct-2507 + LoRA r=64 + TRL GRPO, vLLM colocate, 1× H200. Warmup is on a 2–3 over curriculum (30 steps); the main run trains on 5-over end-to-end matches (100 steps) and resumes from the warmup adapter.
+Static plots will land here once the chain completes:
+- **Training reward curve:** [docs/plots/training_reward_over_steps.png](docs/plots/training_reward_over_steps.png)
+- **Per-rubric breakdown:** [docs/plots/per_rubric_breakdown.png](docs/plots/per_rubric_breakdown.png)
+- **Tool-call execution frequency:** [docs/plots/tool_call_frequency.png](docs/plots/tool_call_frequency.png)
+- **Before/after comparison:** [docs/plots/before_after_comparison.png](docs/plots/before_after_comparison.png)
+### Key engineering findings (documented in commit history)
+| Issue surfaced | Fix | Effect |
+|---|---|---|
+| Only ~19% of rollouts reached `done` naturally | Restrict the system prompt to a single `<tool_call>...</tool_call>` XML format (the prompt previously also advertised bare JSON, which TRL's response-schema parser rejects, ending the rollout) | tools/call_frequency 9 → 73; rollouts 5–8× longer; matches actually play out |
+| GRPO group std collapsing to 0 once matches completed | Remove the `[-1, 1]` reward clip — let GRPO standardize the advantage itself | reward std 0.0 → 1.5; gradient signal restored |
+| Composite reward stayed positive even on dominant losses | Add explicit `outcome_bonus = -1.0` for losses (was 0.0); reduce the always-positive `progress_bonus` cap | composite now spans negative AND positive — model has a real reason to win |
 ## Read This First
 - [`docs/benchmark_explainer.md`](docs/benchmark_explainer.md): full explanation of the problem statement, OpenEnv architecture, environment loop, rewards, data curation, training, and competition compliance.
 ```
 ### 2. Batting (when agent bats)
+Choose a real roster batter, set/update the long-horizon plan, plan the shot, then play each delivery.
 ```json
+{"tool": "set_match_plan", "arguments": {"powerplay_intent": "Use V Kohli and NT Tilak Varma to build a stable platform.", "middle_intent": "Rotate against spin and attack weak matchups.", "death_intent": "Use finishers for boundary options.", "risk_budget": "Escalate only with wickets in hand or target pressure.", "trigger_conditions": "Review after wickets, phase changes, or repeated dots/boundaries.", "rationale": "Roster-aware plan for a short chase."}}
+{"tool": "select_batter", "arguments": {"name": "V Kohli", "style": "balanced", "aggression": 0.45, "rationale": "Reliable top-order batter to control risk early."}}
 {"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Middle overs against spin — rotate strike and preserve wickets."}}
 {"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy gap."}}
 {"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Working into the gap at mid-wicket."}}
 ```
 ### 3. Bowling & Fielding (when agent bowls)
+Choose a real roster bowler, set a delivery/field plan, then bowl each delivery against an opponent policy.
 ```json
+{"tool": "choose_bowler", "arguments": {"name": "BB Sran", "bowler_type": "pace", "style": "economy", "rationale": "Use a roster pacer in the powerplay with a new ball."}}
 {"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "delivery_type": "yorker", "line": "stumps", "length": "full"}}
 {"tool": "set_field_setting", "arguments": {"setting": "Aggressive"}}
 {"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "delivery_type": "yorker", "line": "stumps", "length": "full", "rationale": "Limit swing room against an aggressive finisher."}}
 | `bowl_delivery` | Bowling | Bowl the next delivery |
 | `reflect_after_ball` | Bat/Bowl | Record post-ball tactical adjustment |
 | `analyze_situation` | Any | Query pitch, bowler, field, or match situation |
+| `set_match_plan` | Bat/Bowl | Establish powerplay/middle/death plan, risk budget, and triggers |
+| `update_match_plan` | Bat/Bowl | Revise match plan with a match-state reason |
 `plan_shot.target_area` is normalized into cricket zones such as `cover`, `point`, `straight`, `midwicket`, `square_leg`, `fine_leg`, `long_on`, and `long_off`. `plan_shot.trajectory` can be `ground`, `lofted`, or `aerial`. Delivery plans normalize line (`outside_off`, `stumps`, `pads`, `wide`), length (`yorker`, `full`, `good`, `short`, `bouncer`), and variation (`stock`, `swing`, `seam`, `slower`, `yorker`, `bouncer`, `off_spin`, `leg_spin`, `googly`).
 ## Reward Architecture
+A **composable 4-rubric composite** following the SWE-RL recipe (60% intermediate / 40% terminal) — chosen because partial-trajectory training (where most episodes truncate before completion) needs gradient signal that actually fires. Putting most weight on the rare terminal reward washes out learning.
 | Rubric | Weight | When | What |
 |--------|--------|------|------|
+| `r_cricket` | **45%** | Per ball | Dream11-style proxy — runs, wickets, dots, boundaries, economy, milestones |
+| `r_behavior` | **25%** | Every turn | Coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%) |
+| `r_result` | **20%** | Innings/episode end | Match outcome: chase progress, defense margin, win bonus, DLS par |
+| `r_validity` | **10%** | Every turn | Valid tool-call structure and legal phase-gated tool use |
+Plus a **progress bonus** added to `r_result`: `min(0.25, tool_calls_made / 40.0)` — caps at +0.25 once the agent makes ≥10 tool calls. Directly rewards escaping the "planning loop" trap (where the policy maxes overhead tools without ever calling `play_delivery`).
 `r_tools` is computed and logged for analysis but excluded from the composite — tool discipline is measured through outcome and behavior instead.
+**Why this weighting works:** in partial-trajectory training every single turn produces a reward (validity + behavior + per-ball Dream11 when balls bowl); the terminal `r_result` only fires when an episode actually completes. Reward weights are calibrated to put gradient on the dense signals that fire most often, exactly mirroring the SWE-RL / coding-agent-RL recipe.
+**Single-stage training (full composite reward from step 0):**
+Qwen3-4B-Instruct-2507 emits `<tool_call>...</tool_call>` natively, so we skip the legacy
+"format mastery" warm-up and run the full composite reward (`r_result + r_cricket +
+r_behavior + r_validity`) from step 0. The internal `curriculum_stage` field is still
+set to `2` for code-path compatibility — it just means "full reward".
+**Two-config workflow:**
+- [`configs/cricket_train_qwen3_warmup.yaml`](configs/cricket_train_qwen3_warmup.yaml) — short
+  2-3 over curriculum, 30 steps. Bootstraps the LoRA adapter on a fast format.
+- [`configs/cricket_train_qwen3.yaml`](configs/cricket_train_qwen3.yaml) — 5-over end-to-end,
+  100 steps. `resume_from: ./checkpoints/stage2_final` picks up the warmup adapter.
 **Innings-specific scoring:**
 - **1st Innings (batting):** Score vs DLS par baseline
 ├── inference.py               # Random + LLM agent evaluation
 ├── client.py                  # OpenEnv WebSocket client (CricketCaptainEnv)
 ├── models.py                  # GameState, CricketAction, CricketObservation, CricketState
+├── train.py                   # TRL GRPO agent training with environment_factory tool calls
 ├── eval.py                    # Coherence heatmaps, reward curves
 ├── scripts/
 │   └── curate_transitions.py  # Cricsheet → Markov transition table pipeline
 `server/player_roster.py` loads team profiles from `data/player_profiles/` (10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan). When the agent calls `select_batter` or `choose_bowler` with a player name, the roster performs fuzzy lookup (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths.
+### Tool budget (per over)
+The simulator counts **strategic / analysis** tools that do not advance the ball. Constants live in `CricketEnvironment` as `TOOL_BUDGET_PER_OVER` (3) and `TOOL_FINE_PER_EXCESS` (0.04).
+**Overhead tools (count toward the 3 / over):** `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`.
+**Not overhead (no per-over counter):** `plan_shot`, `set_match_plan`, `update_match_plan`, `select_batter`, `choose_bowler`, `set_field_setting`, `play_delivery`, `bowl_delivery`, `call_toss`, and other execution tools.
+The first **three** overhead calls in a given over are free of this fine. Each additional overhead call in that over applies an immediate **−0.04** step reward. The prompt shows `Tool budget: N/3 overhead calls used this over` so the model can learn to ration reflection and re-planning.
+### Tool budget and training
+`train.py train` rollouts are full environment episodes: the same fines apply on every `step` the policy takes. Over a long match, repeatedly burning the budget (for example, `analyze_situation` or `reflect_after_ball` on most balls) **accumulates** many small penalties and competes with match outcome and behavior rewards. GRPO therefore sees a direct signal to use overhead tools when they change decisions, not as padding. Long-horizon **match plans** (`set_match_plan` / `update_match_plan`) are not charged against this overhead budget, so the agent can state multi-phase intent without spending the 3 “slots” on raw analysis calls.
 ### Markov Engine
 Bowler rotation mirrors real cricket: pace-heavy powerplay (90/10), spin-heavy middle overs (45/55), pace-heavy death (80/20). Each bowler has a 10-over cap before rotation is forced.
+### GRPO Agent Training
+`train.py train` uses TRL `GRPOTrainer` with `environment_factory=CricketCaptainToolEnv`. The trainer creates live `CricketEnvironment` instances, exposes captaincy actions as tool methods, and lets the model interact over multiple tool-calling turns instead of scoring isolated prompt/completion strings.
+The dataset is a set of seeded cricket scenarios. Each rollout resets the environment with `agent_team`, `opponent_mode`, `max_overs`, and optional eval-pack/cache settings. Rewards come back from the environment after real state transitions, so training sees wickets, targets, role swaps, plan updates, and terminal match results.
+The training model is configured separately from inference/demo models:
+- `train.model`: model being optimized with GRPO, currently `Qwen/Qwen3-4B-Instruct-2507` (256k native context, native `Qwen3ForCausalLM` in vLLM, no thinking blocks).
+- `opponent.model`: live opponent model used when `opponent.mode=llm_live`, currently `google/gemma-4-26B-A4B-it`.
+- `captain.model`: inference/evaluation captain model used by `inference.py` and `eval.py`.
 ---
 ## Quickstart
+### 1. Install (uv)
+System requirements: Python 3.10+, CUDA 12.x for training. Inference / random baselines / Gradio UI work CPU-only.
+This project is managed with [uv](https://docs.astral.sh/uv/). Versions in [pyproject.toml](pyproject.toml) are pinned to a known-working set (transformers 5.6.2 + trl 1.2.0 + vllm 0.19.1 + torch 2.10.0) — these are the lowest combination that supports TRL multi-turn `environment_factory` AND vLLM colocate AND transformers v5 chat templates. Earlier vLLM pins (<0.19) force transformers <5 and break multi-turn training.
+```bash
+# Clone and enter
+git clone <this-repo> cricket-captain-llm
+cd cricket-captain-llm
+# Create a venv and install core + training extras + eval plots
+uv venv .venv --python 3.10
+uv pip install --python .venv/bin/python -e ".[train,eval]"
+# Activate
+source .venv/bin/activate
+# HuggingFace login (only needed for gated models / model downloads)
+huggingface-cli login
+# or: export HF_TOKEN=hf_...
+```
+Inference-only (no GPU) install:
+```bash
+uv venv .venv --python 3.10
+uv pip install --python .venv/bin/python -e .
+```
+### 2. YAML config (single source of truth)
+All commands read defaults from a YAML — pass `--config configs/cricket_train_qwen3.yaml` (or one of the others under `configs/`) and only override what you need on the CLI. Three role groups:
+| Group | Keys | Used by |
+|---|---|---|
+| `env.*` | `agent_team`, `max_overs`, `eval_pack_id`, `env_url` | server, inference, training |
+| `opponent.*` | `mode`, `model`, `api_base`, `api_key_env`, `cache_path` | server (heuristic / cricsheet / llm_live / llm_cached) |
+| `captain.*` | `model`, `api_base`, `api_key_env` | inference / eval (when using a live captain LLM) |
+| `train.*` | `model`, `resume_from`, `stage`, `prompts`, `steps`, `batch_size`, `grad_accum`, `num_generations`, `max_completion_length`, `max_tool_calling_iterations`, `learning_rate`, `beta`, `temperature`, `top_p`, `gradient_checkpointing`, `gradient_checkpointing_use_reentrant`, `dataloader_pin_memory`, `dataloader_num_workers`, `bf16_base`, `save_steps`, `save_total_limit`, `report_to`, `run_name` | GRPO trainer |
+- `configs/cricket_train_qwen3_warmup.yaml` — **GRPO warmup** (2-3 over curriculum, 30 steps). Run first.
+- `configs/cricket_train_qwen3.yaml` — **GRPO main** (5-over end-to-end, 100 steps). Resumes from warmup adapter via `resume_from:`.
+- `configs/cricket_train_qwen3_smoke.yaml` — 2-step infrastructure smoke test.
+- `configs/game_knowledge.yaml` — reward weights and game constants (loaded at import time).
+- `configs/extras/` — legacy Qwen3.5 configs and `default.yaml`, kept for reference.
+### 3. Run the environment server (for inference / Gradio)
 ```bash
+cd cricket-captain-llm
+PYTHONPATH=. python server/app.py --port 8000 --config configs/extras/default.yaml
+```
+The server exposes the OpenEnv WebSocket at `ws://localhost:8000/ws` and a Gradio UI at `http://localhost:8000/web`.
+### 4. Inference baselines (no training required)
+```bash
+export CRICKET_CAPTAIN_ENV_URL="ws://localhost:8000"
+# Random agent (no API key needed) — fast sanity check
+python inference.py --model random --episodes 5 --opponent-mode heuristic --max-overs 5
+# Live HF Gemma captain baseline using config defaults
+export HF_TOKEN=hf_...
 python inference.py --config configs/default.yaml --episodes 1
 ```
+### 5. GRPO training
+Training does **not** need the server — it instantiates `CricketEnvironment` directly and runs it through TRL `GRPOTrainer` with `environment_factory`. All rollouts are simulated live (no static dataset).
+The recommended workflow is **warmup → main run**, both controlled entirely from YAML:
+**Single-command chain** (warmup → main, auto-resume from warmup adapter):
 ```bash
+bash scripts/run_warmup_then_main.sh
+# Logs: /tmp/train_warmup.log (then /tmp/train_main.log on success)
+# Final adapter: ./checkpoints/stage2_final/
 ```
+The chain is what we run to produce a trained model. Internally:
+**Step 1 — Warmup (2-3 over curriculum, 30 steps, ~50–60 min on a single H200):**
+- Curriculum-distributed `max_overs` (heavy on 2-over, tail to 3-over) so episodes
+  complete inside the token budget and `r_result` fires reliably.
+- Bootstraps the LoRA adapter from base Qwen3-4B-Instruct-2507 → saves to
+  `./checkpoints/stage2_final/`.
+**Step 2 — Main (5-over end-to-end, 100 steps, ~5–7 hrs):**
+- Resumes the warmup adapter via `resume_from: ./checkpoints/stage2_final`.
+- Trains on full 5-over matches with the `r_result` outcome signal as the dominant gradient driver.
+- Final adapter at `./checkpoints/stage2_final/` (overwritten — that's the deliverable).
+**Run components individually:**
 ```bash
+# Warmup only
+PYTORCH_ALLOC_CONF=expandable_segments:True \
+  python train.py train --config configs/cricket_train_qwen3_warmup.yaml
+# Main only (assumes ./checkpoints/stage2_final/ exists)
+PYTORCH_ALLOC_CONF=expandable_segments:True \
+  python train.py train --config configs/cricket_train_qwen3.yaml
+# Main without resuming (fresh adapter)
+python train.py train --config configs/cricket_train_qwen3.yaml --resume-from ""
+```
+The opponent is `heuristic` (rule-based) by default for fast iteration.
+Switch to `mode: llm_live` in `configs/cricket_train_qwen3.yaml` (and set `HF_TOKEN`) to train
+against the live Gemma adversary.
+### 6. Evaluating the trained model
+The eval-time story is the headline. Training caps rollouts at the warmup/main token budgets (16k / 24k), so warmup rollouts run 2–3 over and main rollouts run 5-over. At inference there's no token cap — full T20 matches play out. This is the same pattern coding-agent RL papers (SWE-RL, AgentR) use: train on partial windows, evaluate on full task completion.
+```bash
+# Baseline (untrained Qwen3-4B-Instruct-2507)
+python compare_eval.py --model Qwen/Qwen3-4B-Instruct-2507 \
+    --label baseline --episodes 20 --max-overs 5 \
+    --output eval_results/baseline.json
+# Trained (uses LoRA adapter)
+python compare_eval.py --model Qwen/Qwen3-4B-Instruct-2507 \
+    --adapter ./checkpoints/stage2_final \
+    --label trained --episodes 20 --max-overs 5 \
+    --output eval_results/trained.json
+# Side-by-side comparison table
+python compare_eval.py --compare \
+    eval_results/baseline.json \
+    eval_results/trained.json
 ```
+Each `--episodes 20` run takes ~30-45 min depending on match length. The comparison
+prints a side-by-side table of: match completion rate, win rate, mean agent score,
+mean wickets lost, mean tool calls, validity rate, and per-rubric reward breakdown.
+### 6. Tuning batch size for your GPU
+The dominant memory consumer during GRPO is **attention prefill of input prompts** (game state ~1300 tokens) for `generation_batch_size = batch_size × grad_accum` simultaneous sequences.
+| GPU VRAM | flash-attn? | Recommended `batch_size` × `grad_accum` | `max_completion_length` |
+|---|---|---|---|
+| 24 GB (A10/3090) | required | 1 × 4 = 4 | 512 |
+| 46 GB (L40S) | recommended | 4 × 4 = 16 | 512 |
+| 46 GB (L40S) | not installed | 2 × 4 = 8 | 512 |
+| 80 GB (A100/H100) | required | 8 × 4 = 32 | 1024 |
+Cricket tool calls are short JSON objects (~20–300 tokens), so `max_completion_length: 512` is plenty. Without flash-attn, SDPA allocates `O(seq_len²)` attention matrices — keep batches small or install flash-attn.
+If you OOM, halve `batch_size` first. If you OOM on prefill specifically (during generation, not gradient), it's the prompt length × generation_batch_size — install flash-attn or shrink `grad_accum`.
+### 7. Where to find logs and outputs
+| Path | What |
+|---|---|
+| stdout | Per-step loss, reward, lr (every `logging_steps=10`) plus full sampled completions |
+| `checkpoints/stage{1,2}/` | HF Trainer state, intermediate LoRA checkpoints (every ~80 steps) |
+| `checkpoints/stage{1,2}_final/` | Final LoRA + tokenizer |
+| `illustrations/exp_*/run_output.txt` | Per-step environment trace from `inference.py` and `train.py train-smoke` |
+For a tensorboard dashboard, set `train.report_to: tensorboard` in the YAML, then:
+```bash
+tensorboard --logdir checkpoints/
+```
+### 8. Smoke test (no model load)
+Verify the environment + opponent + tool budget end-to-end without loading any model:
+```bash
+PYTHONPATH=. python train.py train-smoke \
+  --config configs/extras/default.yaml \
+  --matches 1 --max-overs 2 \
+  --opponent-mode heuristic
+```
+This runs one short match with random actions and writes a full step log to `illustrations/exp_*/`.
 ### Lightning / Remote Runtime Notes
 `localhost:8000` only works when the agent process and server process are in the same network namespace. On Lightning, expose the server port and pass the resulting WebSocket URL via:
 1. Random baseline with heuristic opponent.
 2. Base/untrained LLM baseline.
+3. GRPO warmup (`configs/cricket_train_qwen3_warmup.yaml`) for format mastery on short matches.
+4. GRPO main (`configs/cricket_train_qwen3.yaml`) for full 5-over strategic depth, resuming the warmup adapter.
+5. Eval with `adaptive_t20_v1` and cached opponent decisions.
 See [`docs/experiment_workflow.md`](docs/experiment_workflow.md) for exact commands and rationale.
 ---
+## Hackathon Submission Materials
+| Material | Link |
+|---|---|
+| **Live HF Space** (env runs here) | https://huggingface.co/spaces/pratinavseth/cricket-captain-llm |
+| **GitHub repo** | https://github.com/pratinavseth/cricket-captain-llm |
+| **W&B project** (training runs) | https://wandb.ai/ptnv-s-research/huggingface |
+| **Mini-blog (HF blog)** | _placeholder — will be added before submission_ |
+| **Demo video (≤2 min, YouTube)** | _placeholder — will be added before submission_ |
+| **Slide deck** | _optional — TBD_ |
+### Submission checklist (per hackathon guidelines)
+- [x] Use OpenEnv (latest release): `openenv-core[core]>=0.2.2` ([pyproject.toml](pyproject.toml))
+- [x] Working training script (TRL GRPO): [train.py](train.py)
+- [x] Reward + training pipeline coherent: 4-rubric composite (`r_result`, `r_cricket`, `r_behavior`, `r_validity`) with documented signal flow
+- [x] HF Space pushed and discoverable
+- [ ] Loss + reward plots from real run (in progress — generated after main run completes)
+- [ ] Mini-blog or ≤2 min video (in progress)
+- [x] README motivates problem, explains env, links materials
+- [x] OpenEnv compliance: `Environment` base class, valid `openenv.yaml`, no reserved tool names
+---
 ## Citation
 ```

compare_eval.py ADDED Viewed

	@@ -0,0 +1,395 @@

+"""
+compare_eval.py — Baseline vs trained head-to-head evaluation.
+Plays N full cricket matches with the BASELINE model (untrained Qwen3.5-4B)
+and the TRAINED model (Qwen3.5-4B + LoRA adapter from a checkpoint), then
+dumps a comparison table:
+    win_rate, mean_agent_score, mean_opp_score, mean_wickets, match_completion_rate,
+    mean_tool_calls_per_episode, validity_rate, plus a few illustrative transcripts.
+Why this is the right eval for our setup
+----------------------------------------
+Training uses a token budget per rollout (~4096 tokens, ~16 turns) which truncates
+most matches. At EVAL time we lift that cap entirely — the model gets unlimited
+context and can actually play full matches. This is the same pattern coding-agent
+RL papers use: train on partial windows, eval on full task completion. The trained
+policy generalizes because it learned good per-state decisions, not a specific
+trajectory length.
+Usage
+-----
+    # Baseline (untrained Qwen3.5-4B base)
+    python compare_eval.py \\
+        --model Qwen/Qwen3.5-4B \\
+        --label baseline \\
+        --episodes 20 --max-overs 5 \\
+        --output eval_results/baseline.json
+    # Trained (warmup + main checkpoint)
+    python compare_eval.py \\
+        --model Qwen/Qwen3.5-4B \\
+        --adapter ./checkpoints/stage2_final \\
+        --label trained \\
+        --episodes 20 --max-overs 5 \\
+        --output eval_results/trained.json
+    # Side-by-side comparison
+    python compare_eval.py --compare eval_results/baseline.json eval_results/trained.json
+"""
+import argparse
+import json
+import os
+import sys
+import time
+from collections import Counter
+from pathlib import Path
+import torch
+from peft import PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from server.cricket_environment import CricketEnvironment
+from models import CricketAction
+import train as train_module  # reuse SYSTEM_PROMPT and _parse_completion
+# ----------------------------------------------------------------------------
+# Model loading
+# ----------------------------------------------------------------------------
+def load_model_for_eval(model_name: str, adapter_path: str | None = None):
+    """Load base model in bf16; optionally apply a LoRA adapter on top."""
+    print(f"Loading base model: {model_name}")
+    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForCausalLM.from_pretrained(
+        model_name,
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+        trust_remote_code=True,
+    )
+    if adapter_path:
+        print(f"Loading LoRA adapter: {adapter_path}")
+        model = PeftModel.from_pretrained(model, adapter_path, is_trainable=False)
+    model.eval()
+    return model, tokenizer
+# ----------------------------------------------------------------------------
+# Single-episode rollout (no token cap — let matches actually complete)
+# ----------------------------------------------------------------------------
+def play_one_episode(
+    *,
+    model,
+    tokenizer,
+    max_overs: int,
+    opponent_mode: str,
+    agent_team: str,
+    eval_pack_id: str,
+    seed: int,
+    max_tool_calls: int = 800,
+    max_completion_per_turn: int = 256,  # per-turn (NOT per-rollout) — eval is turn-by-turn
+    temperature: float = 0.3,            # deterministic-ish at eval
+    verbose: bool = False,
+) -> dict:
+    """Run one full match. Returns per-episode stats."""
+    env = CricketEnvironment()
+    obs = env.reset(seed=seed, options={
+        "task": "stage2_full",
+        "random_start": False,
+        "max_overs": max_overs,
+        "eval_pack_id": eval_pack_id,
+        "opponent_mode": opponent_mode,
+        "agent_team": agent_team,
+    })
+    # Build the message log progressively. Each turn appends model output + tool response.
+    system_prompt = train_module.SYSTEM_PROMPT
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": obs.prompt_text},
+    ]
+    tool_calls_made = 0
+    tool_breakdown: Counter = Counter()
+    parse_failures = 0
+    illegal_tool_attempts = 0
+    start_t = time.time()
+    while not obs.done and tool_calls_made < max_tool_calls:
+        # Render chat using model's tool template
+        try:
+            inputs = tokenizer.apply_chat_template(
+                messages,
+                tokenize=True,
+                add_generation_prompt=True,
+                return_tensors="pt",
+            ).to(model.device)
+        except Exception as e:
+            print(f"  apply_chat_template error: {e}")
+            break
+        with torch.no_grad():
+            out = model.generate(
+                inputs,
+                max_new_tokens=max_completion_per_turn,
+                do_sample=(temperature > 0),
+                temperature=max(temperature, 1e-5),
+                pad_token_id=tokenizer.pad_token_id,
+            )
+        gen_ids = out[0, inputs.shape[1]:]
+        completion = tokenizer.decode(gen_ids, skip_special_tokens=False)
+        # Parse the tool call
+        parsed = train_module._parse_completion(completion)
+        if parsed is None:
+            parse_failures += 1
+            if verbose:
+                print(f"  PARSE FAIL: {completion[:200]}...")
+            messages.append({"role": "assistant", "content": completion})
+            messages.append({"role": "user", "content": "Your previous output was not parseable. Please emit exactly one tool call."})
+            continue
+        tool_name = parsed.get("tool", "")
+        tool_args = parsed.get("arguments", {}) or {}
+        tool_breakdown[tool_name] += 1
+        # Apply to env
+        try:
+            obs = env.step(CricketAction(tool=tool_name, arguments=tool_args))
+            tool_calls_made += 1
+        except Exception as e:
+            illegal_tool_attempts += 1
+            if verbose:
+                print(f"  ILLEGAL TOOL: {tool_name} → {e}")
+            messages.append({"role": "assistant", "content": completion})
+            messages.append({"role": "user", "content": f"Tool error: {e}. Try a different tool."})
+            continue
+        messages.append({"role": "assistant", "content": completion})
+        messages.append({"role": "user", "content": obs.prompt_text})
+    elapsed = time.time() - start_t
+    state = env.state
+    breakdown = state.reward_breakdown or {}
+    # Determine match result
+    is_complete = bool(obs.done)
+    agent_score = int(state.total_score or 0)
+    opp_score = int(state.first_innings_score or 0) if state.innings_type == "second" else None
+    target = state.target
+    won = None
+    if is_complete:
+        # Crude win check; env's match_result string is the canonical source
+        result_str = (state.match_result or "").lower()
+        if "won" in result_str and "agent" in result_str:
+            won = True
+        elif "lost" in result_str or "won" in result_str:
+            won = False
+        else:
+            won = None
+    return {
+        "seed": seed,
+        "max_overs": max_overs,
+        "opponent_mode": opponent_mode,
+        "tool_calls_made": tool_calls_made,
+        "match_complete": is_complete,
+        "won": won,
+        "agent_score": agent_score,
+        "opponent_first_innings_score": opp_score,
+        "target": target,
+        "wickets_lost": int(state.wickets_lost or 0),
+        "match_result": state.match_result or "",
+        "tool_breakdown": dict(tool_breakdown),
+        "parse_failures": parse_failures,
+        "illegal_tool_attempts": illegal_tool_attempts,
+        "validity_rate": round(1.0 - (parse_failures + illegal_tool_attempts) / max(tool_calls_made + parse_failures + illegal_tool_attempts, 1), 4),
+        "reward_breakdown": dict(breakdown),
+        "elapsed_seconds": round(elapsed, 1),
+    }
+# ----------------------------------------------------------------------------
+# Run N episodes
+# ----------------------------------------------------------------------------
+def run_n_episodes(
+    *, model, tokenizer, episodes: int, max_overs: int, opponent_mode: str,
+    agent_team: str, eval_pack_id: str, seed_base: int, max_tool_calls: int,
+    max_completion_per_turn: int, temperature: float, verbose: bool,
+) -> dict:
+    results = []
+    for i in range(episodes):
+        seed = seed_base + i
+        print(f"  [{i+1}/{episodes}] seed={seed} …", end="", flush=True)
+        try:
+            res = play_one_episode(
+                model=model, tokenizer=tokenizer,
+                max_overs=max_overs, opponent_mode=opponent_mode,
+                agent_team=agent_team, eval_pack_id=eval_pack_id, seed=seed,
+                max_tool_calls=max_tool_calls,
+                max_completion_per_turn=max_completion_per_turn,
+                temperature=temperature, verbose=verbose,
+            )
+            print(f" {res['tool_calls_made']} tool calls, "
+                  f"{'COMPLETE' if res['match_complete'] else 'truncated'}, "
+                  f"score {res['agent_score']}/{res['wickets_lost']}, "
+                  f"{res['elapsed_seconds']}s")
+            results.append(res)
+        except Exception as e:
+            print(f" FAILED: {e}")
+            results.append({"seed": seed, "error": str(e)})
+    # Aggregate
+    valid = [r for r in results if "error" not in r]
+    n = len(valid)
+    if n == 0:
+        return {"results": results, "summary": {"n": 0, "error": "all episodes failed"}}
+    completed = [r for r in valid if r["match_complete"]]
+    won = [r for r in completed if r.get("won") is True]
+    summary = {
+        "n_episodes": n,
+        "match_completion_rate": round(len(completed) / n, 4),
+        "win_rate_among_completed": round(len(won) / max(len(completed), 1), 4),
+        "win_rate_overall": round(len(won) / n, 4),
+        "mean_agent_score": round(sum(r["agent_score"] for r in valid) / n, 2),
+        "mean_wickets_lost": round(sum(r["wickets_lost"] for r in valid) / n, 2),
+        "mean_tool_calls": round(sum(r["tool_calls_made"] for r in valid) / n, 1),
+        "mean_validity_rate": round(sum(r["validity_rate"] for r in valid) / n, 4),
+        "mean_composite_reward": round(sum(r["reward_breakdown"].get("composite", 0.0) for r in valid) / n, 4),
+        "mean_r_result":   round(sum(r["reward_breakdown"].get("r_result",   0.0) for r in valid) / n, 4),
+        "mean_r_cricket":  round(sum(r["reward_breakdown"].get("r_cricket",  0.0) for r in valid) / n, 4),
+        "mean_r_behavior": round(sum(r["reward_breakdown"].get("r_behavior", 0.0) for r in valid) / n, 4),
+        "mean_r_validity": round(sum(r["reward_breakdown"].get("r_validity", 0.0) for r in valid) / n, 4),
+        "tool_freq": {},
+    }
+    # Aggregate tool frequencies
+    all_tools: Counter = Counter()
+    for r in valid:
+        for t, c in (r.get("tool_breakdown") or {}).items():
+            all_tools[t] += c
+    total = sum(all_tools.values()) or 1
+    summary["tool_freq"] = {t: round(c / total, 3) for t, c in all_tools.most_common()}
+    return {"results": results, "summary": summary}
+# ----------------------------------------------------------------------------
+# Comparison printer
+# ----------------------------------------------------------------------------
+def print_comparison(baseline_path: str, trained_path: str):
+    with open(baseline_path) as f:
+        b = json.load(f)
+    with open(trained_path) as f:
+        t = json.load(f)
+    bs = b["summary"]
+    ts = t["summary"]
+    def row(label, key, fmt="{:.4f}"):
+        bv = bs.get(key)
+        tv = ts.get(key)
+        b_str = fmt.format(bv) if bv is not None else "-"
+        t_str = fmt.format(tv) if tv is not None else "-"
+        delta = ""
+        if isinstance(bv, (int, float)) and isinstance(tv, (int, float)):
+            d = tv - bv
+            delta = f" ({'+' if d >= 0 else ''}{d:.3f})"
+        print(f"  {label:<32} {b_str:>12}   {t_str:>12}{delta}")
+    print(f"\n{'='*80}")
+    print(f"BASELINE vs TRAINED — {bs['n_episodes']} episodes each")
+    print(f"  baseline label: {b.get('label')} | trained label: {t.get('label')}")
+    print(f"{'='*80}")
+    print(f"  {'metric':<32} {'baseline':>12}   {'trained':>12}")
+    print(f"  {'-'*32} {'-'*12}   {'-'*12}")
+    row("match_completion_rate", "match_completion_rate")
+    row("win_rate_overall", "win_rate_overall")
+    row("win_rate_among_completed", "win_rate_among_completed")
+    row("mean_agent_score", "mean_agent_score", "{:.2f}")
+    row("mean_wickets_lost", "mean_wickets_lost", "{:.2f}")
+    row("mean_tool_calls", "mean_tool_calls", "{:.1f}")
+    row("mean_validity_rate", "mean_validity_rate")
+    row("mean_composite_reward", "mean_composite_reward")
+    row("mean_r_result", "mean_r_result")
+    row("mean_r_cricket", "mean_r_cricket")
+    row("mean_r_behavior", "mean_r_behavior")
+    row("mean_r_validity", "mean_r_validity")
+    print(f"{'='*80}\n")
+# ----------------------------------------------------------------------------
+# Main
+# ----------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(description="Baseline vs trained eval for CricketCaptain.")
+    parser.add_argument("--model", default="Qwen/Qwen3.5-4B", help="Base HF model id")
+    parser.add_argument("--adapter", default=None, help="Optional LoRA adapter directory")
+    parser.add_argument("--label", default="run", help="Label for this run (used in output)")
+    parser.add_argument("--episodes", type=int, default=10)
+    parser.add_argument("--max-overs", type=int, default=5)
+    parser.add_argument("--opponent-mode", default="heuristic",
+                        choices=["heuristic", "llm_live", "llm_cached", "cricsheet"])
+    parser.add_argument("--agent-team", default="india")
+    parser.add_argument("--eval-pack-id", default="adaptive_t20_v1")
+    parser.add_argument("--seed-base", type=int, default=10000)
+    parser.add_argument("--max-tool-calls", type=int, default=800)
+    parser.add_argument("--max-completion-per-turn", type=int, default=256)
+    parser.add_argument("--temperature", type=float, default=0.3)
+    parser.add_argument("--output", default=None, help="JSON output path")
+    parser.add_argument("--verbose", action="store_true")
+    parser.add_argument("--compare", nargs=2, default=None, metavar=("BASELINE_JSON", "TRAINED_JSON"),
+                        help="Skip eval; just print comparison from two existing JSON files")
+    args = parser.parse_args()
+    if args.compare:
+        print_comparison(args.compare[0], args.compare[1])
+        return
+    print(f"\nCricketCaptain compare-eval — label='{args.label}'")
+    print(f"  model={args.model}  adapter={args.adapter or '(none)'}")
+    print(f"  {args.episodes} episodes × {args.max_overs} overs vs {args.opponent_mode} opponent\n")
+    model, tokenizer = load_model_for_eval(args.model, args.adapter)
+    out = run_n_episodes(
+        model=model, tokenizer=tokenizer,
+        episodes=args.episodes, max_overs=args.max_overs,
+        opponent_mode=args.opponent_mode,
+        agent_team=args.agent_team, eval_pack_id=args.eval_pack_id,
+        seed_base=args.seed_base, max_tool_calls=args.max_tool_calls,
+        max_completion_per_turn=args.max_completion_per_turn,
+        temperature=args.temperature, verbose=args.verbose,
+    )
+    out["label"] = args.label
+    out["model"] = args.model
+    out["adapter"] = args.adapter
+    out["config"] = {
+        "episodes": args.episodes, "max_overs": args.max_overs,
+        "opponent_mode": args.opponent_mode, "agent_team": args.agent_team,
+        "max_tool_calls": args.max_tool_calls,
+        "max_completion_per_turn": args.max_completion_per_turn,
+        "temperature": args.temperature,
+    }
+    print("\n=== SUMMARY ===")
+    print(json.dumps(out["summary"], indent=2))
+    if args.output:
+        out_path = Path(args.output)
+        out_path.parent.mkdir(parents=True, exist_ok=True)
+        with out_path.open("w") as f:
+            json.dump(out, f, indent=2)
+        print(f"\nResults → {out_path}")
+if __name__ == "__main__":
+    main()

configs/cricket_train_qwen3.yaml ADDED Viewed

	@@ -0,0 +1,99 @@

+# =============================================================================
+# CricketCaptain-LLM — Qwen3 MAIN run (5-over format, end-to-end matches)
+#
+# Differences from cricket_train.yaml:
+#   - Model: Qwen3-4B-Instruct-2507 (vs Qwen3.5-4B which has no vLLM class)
+#   - vLLM colocate enabled (3-5× throughput gain)
+#   - 5-over format instead of 20-over (20-over end-to-end is infeasible on 1× H200)
+#   - Resumes Qwen3 warmup adapter, not Qwen3.5 adapter
+#   - Reward clip removed at train.py:1240 (separately) — let GRPO standardize advantage
+#   - More steps (200 vs 30) since vLLM makes them affordable
+#
+# Stack: TRL 1.2.0 + transformers 5.6.2 + vLLM + PEFT (LoRA) + bf16 base
+# Target hardware: 1× H200 (144 GB), single node
+# =============================================================================
+env:
+  eval_pack_id: adaptive_t20_v1
+  agent_team: india
+  max_overs: 5                 # 5-over end-to-end. ~180 tool calls per match.
+  env_url: ws://localhost:8000
+opponent:
+  mode: heuristic              # llm_live optional but slows training ~2-3×
+train:
+  # ---- Model & adapter ----
+  model: Qwen/Qwen3-4B-Instruct-2507
+  # Resume from the Qwen3 warmup adapter (NOT from Qwen3.5 stage2_final).
+  # Comment this line out to start the main run with a fresh adapter.
+  resume_from: ./checkpoints/stage2_final
+  stage: 2
+  # ---- Dataset ----
+  prompts: 256
+  # ---- Schedule ----
+  # 5-over rollouts ≈ ~180 tool calls each. With vLLM colocate at B=1 sims=4,
+  # roughly ~3-4 min/step. 100 steps ≈ ~5-7 hrs.
+  steps: 100
+  logging_steps: 1
+  save_steps: 20
+  save_total_limit: 5
+  # ---- Throughput knobs ----
+  # B=1 × grad_accum=4 × G=4 → gen_batch=4, 4 sims in flight per step.
+  # Matches the working warmup config. Warmup at B=2 / 24k OOM'd on backward
+  # due to gradient-accum buffer churn — at 32k completion it would OOM worse.
+  # KV cache ≈ 4 × 36k × 144 KB ≈ 21 GB, comfortable in vLLM 0.55 pool (78 GB).
+  batch_size: 1
+  grad_accum: 4
+  num_generations: 4
+  # ---- Length budget ----
+  # 24k completion. At 32k we OOM'd at step 7 of main (140+ GB used). 24k gives
+  # ~7 GB headroom for backward-pass activation memory. Full 5-over match needs
+  # ~9k tokens of model output, so 24k is 2.5x headroom — plenty.
+  max_completion_length: 24576
+  max_tool_calling_iterations: 240
+  # ---- Optimizer ----
+  learning_rate: 5.0e-6
+  # ---- GRPO ----
+  # beta=0.0: no reference model. With 24k completion + B=16 sims, the 8 GB
+  # ref model would push us past the H200's 144 GB. Format penalty in reward
+  # is the soft anchor instead.
+  beta: 0.0
+  temperature: 0.9
+  top_p: 0.95
+  # ---- Memory ----
+  gradient_checkpointing: true
+  gradient_checkpointing_use_reentrant: false
+  bf16_base: true
+  # ---- vLLM ----
+  use_vllm: true
+  vllm_gpu_memory: 0.55
+  # ---- Dataloader ----
+  dataloader_pin_memory: true
+  dataloader_num_workers: 4
+  # ---- Logging ----
+  report_to: wandb
+  run_name: cricket_qwen3_main
+# =============================================================================
+# Reward composition (defined in server/reward_calculator.py):
+#   composite = 0.55·r_result + 0.25·r_cricket + 0.15·r_behavior + 0.05·r_validity
+#
+# WATCH during training:
+#   - rollout/match_completion_rate ≥ 0.7 within ~50 steps
+#       → if not, episodes are still hitting cap; tighten per-turn schema or drop max_overs
+#   - reward/r_result_mean separate from composite
+#       → if r_result stays at 0 while composite rises, you're optimizing format only
+#   - episode/tool_calls_mean
+#       → should be ~150-200 for 5-over; >220 means truncation events are common
+# =============================================================================

configs/cricket_train_qwen3_smoke.yaml ADDED Viewed

	@@ -0,0 +1,99 @@

+# =============================================================================
+# CricketCaptain-LLM — Qwen3 SMOKE TEST (validate vLLM colocate works)
+#
+# Purpose: shortest possible run to confirm:
+#   1. Qwen3-4B-Instruct-2507 loads cleanly into vLLM colocate
+#      (no Qwen3.5 architecture-class registration error)
+#   2. TRL multi-turn environment_factory steps execute without crashing
+#   3. At least one episode reaches `done=True` so `r_result` fires
+#
+# Expected runtime: ~10-15 min on a single H200.
+# Run before kicking off cricket_train_qwen3_warmup.yaml.
+#
+# Stack target: conda cloudspace (torch 2.10, transformers 5.6.2, trl 1.2.0)
+#               + vllm + flash-attn installed via .venv-qwen3
+# =============================================================================
+env:
+  eval_pack_id: adaptive_t20_v1
+  agent_team: india
+  max_overs: 2                # smallest format — match must complete in ~70 turns
+  env_url: ws://localhost:8000
+opponent:
+  mode: heuristic              # deterministic-ish, no API costs
+train:
+  # ---- Model ----
+  model: Qwen/Qwen3-4B-Instruct-2507
+  # 256k native context, no <think> blocks, native Qwen3ForCausalLM in vLLM.
+  # Fresh adapter — do NOT load Qwen3.5-trained weights into Qwen3 base.
+  # resume_from intentionally omitted.
+  stage: 2
+  # ---- Dataset ----
+  prompts: 16                  # tiny — smoke only
+  # ---- Schedule ----
+  steps: 2                     # absolute minimum to test gradient + save
+  logging_steps: 1
+  save_steps: 2
+  save_total_limit: 1
+  # ---- Throughput knobs ----
+  # bs=1 + grad_accum=4 + G=4 → generation_batch_size=4 divides G cleanly
+  # (TRL 1.2 GRPOConfig requires bs*grad_accum divisible by num_generations).
+  # 4 sim episodes in flight. KV cache ≈ 4 × 16k × 144 KB ≈ 9.5 GB → tiny.
+  batch_size: 1
+  grad_accum: 4
+  num_generations: 4
+  # ---- Length budget ----
+  # 2-over needs ~70 tool calls. At <120 tok/turn this fits in ~8k.
+  # Generous 16k completion to catch any per-turn bloat.
+  max_completion_length: 16384
+  max_tool_calling_iterations: 120
+  # ---- Optimizer ----
+  learning_rate: 5.0e-6
+  # ---- GRPO ----
+  beta: 0.0                    # no reference model (saves ~8 GB VRAM)
+  temperature: 0.9
+  top_p: 0.95
+  # ---- Memory ----
+  gradient_checkpointing: true
+  gradient_checkpointing_use_reentrant: false
+  bf16_base: true
+  # ---- vLLM colocate (THE thing being tested) ----
+  use_vllm: true
+  vllm_gpu_memory: 0.50
+  # vllm_model_impl omitted → vLLM picks Qwen3ForCausalLM natively.
+  # If you fall back to Qwen3.5-4B for some reason, set this to "transformers".
+  # ---- Dataloader ----
+  dataloader_pin_memory: true
+  dataloader_num_workers: 2
+  # ---- Logging ----
+  report_to: wandb
+  run_name: cricket_qwen3_smoke
+# =============================================================================
+# Run with:
+#   source .venv-qwen3/bin/activate
+#   python train.py train --config configs/cricket_train_qwen3_smoke.yaml
+#
+# Pass criteria:
+#   - 2 gradient steps complete without OOM
+#   - logs show `rollout/match_completion_rate > 0`
+#   - at least one episode in episode_stats.jsonl has `termination_reason: natural`
+#
+# If pass → run cricket_train_qwen3_warmup.yaml
+# If fail with "Qwen3_5..." → wrong model name, check spelling
+# If fail with vLLM class error → vLLM build doesn't include Qwen3 support, upgrade
+# If fail with LoRA error → known TRL+vLLM+LoRA issue, set vllm_model_impl: transformers
+# =============================================================================

configs/cricket_train_qwen3_warmup.yaml ADDED Viewed

	@@ -0,0 +1,98 @@

+# =============================================================================
+# CricketCaptain-LLM — Qwen3 WARMUP (2-3 over curriculum, fast iterations)
+#
+# Differences from cricket_train_warmup.yaml:
+#   - Model: Qwen3-4B-Instruct-2507 (256k native, no <think>, native vLLM class)
+#   - vLLM colocate enabled (works because Qwen3ForCausalLM is a registered class)
+#   - Fresh adapter (do NOT resume Qwen3.5 LoRA on Qwen3 base — incompatible)
+#   - Slightly tighter schedule given vLLM throughput gain (~3-5x)
+#
+# Stack: TRL 1.2.0 + transformers 5.6.2 + vLLM + PEFT (LoRA) + bf16 base
+# Target hardware: 1× H200 (144 GB)
+# =============================================================================
+env:
+  eval_pack_id: adaptive_t20_v1
+  agent_team: india
+  max_overs: 0                 # 0 = use overs_distribution below
+  env_url: ws://localhost:8000
+opponent:
+  mode: heuristic              # fast iteration; switch to llm_live for final eval
+train:
+  # ---- Model & adapter ----
+  model: Qwen/Qwen3-4B-Instruct-2507
+  stage: 2                     # full composite reward
+  # No resume_from — start fresh on Qwen3 base.
+  # ---- Dataset ----
+  prompts: 64
+  # ---- Schedule ----
+  # 2-3 over rollouts ≈ ~70-110 tool calls each. With vLLM colocate, ~2-3 min/step
+  # at 16 sim episodes. 30 steps ≈ ~1.5 hrs total.
+  steps: 30
+  logging_steps: 1
+  save_steps: 5
+  save_total_limit: 5
+  # ---- Curriculum (per-scenario max_overs) ----
+  # Heavier on T2 (cleanly completes in token budget), tail to T3.
+  # Skip T4/T5 in warmup — those go in the main run.
+  overs_distribution: [2, 2, 2, 2, 2, 2, 3, 3, 3]
+  # ---- Throughput knobs ----
+  # B=1 × grad_accum=4 × G=4 → gen_batch=4, 4 sims in flight per step.
+  # Matches the smoke config that ran cleanly. Slower per step (~30-40s) than
+  # B=2 (~55s), but B=2 OOM'd on backward of step 2 due to gradient-accumulation
+  # micro-batch buffer churn even with expandable_segments. 30 steps ≈ ~20 min.
+  batch_size: 1
+  grad_accum: 4
+  num_generations: 4
+  # ---- Length budget ----
+  # 24k completion = ~130 tok/turn × 180 turns. Generous for 2-3 over format,
+  # matches Qwen3-4B-Instruct-2507 recommendation of ≥32k output for most queries
+  # (per-rollout cumulative across multi-turn).
+  max_completion_length: 24576
+  max_tool_calling_iterations: 240
+  # ---- Optimizer ----
+  learning_rate: 5.0e-6
+  # ---- GRPO ----
+  # beta=0.0: no reference model (saves ~8 GB VRAM, lets G=4 fit at 16k completion).
+  # Reward shaping has format penalty as soft anchor.
+  beta: 0.0
+  temperature: 0.9
+  top_p: 0.95
+  # ---- Memory ----
+  gradient_checkpointing: true
+  gradient_checkpointing_use_reentrant: false
+  bf16_base: true
+  # ---- vLLM (THE big change vs Qwen3.5 config) ----
+  use_vllm: true
+  vllm_gpu_memory: 0.55
+  # vllm_model_impl omitted → vLLM picks Qwen3ForCausalLM natively.
+  # ---- Dataloader ----
+  dataloader_pin_memory: true
+  dataloader_num_workers: 4
+  # ---- Logging ----
+  report_to: wandb
+  run_name: cricket_qwen3_warmup
+# =============================================================================
+# Workflow:
+#   1. Smoke test (10-15 min):
+#        python train.py train --config configs/cricket_train_qwen3_smoke.yaml
+#   2. Warmup (1-2 hrs):
+#        python train.py train --config configs/cricket_train_qwen3_warmup.yaml
+#   3. Main run (5-7 hrs):
+#        python train.py train --config configs/cricket_train_qwen3.yaml
+#        (resumes from ./checkpoints/stage2_final saved by step 2)
+# =============================================================================

configs/extras/cached_eval.yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+env:
+  # Used by server + runners for reproducible comparison runs.
+  eval_pack_id: adaptive_t20_v1
+  max_overs: 5
+  env_url: ws://localhost:8000
+opponent:
+  # llm_cached does not call `model` live. It replays pre-generated decisions
+  # from cache_path so every captain model faces the same opponent behavior.
+  mode: llm_cached
+  cache_path: data/opponent_cache/adaptive_t20_v1_official_gemma2b.jsonl
+captain:
+  # Captain still calls HF router live in this config.
+  model: google/gemma-4-26B-A4B-it
+  api_base: https://router.huggingface.co/v1
+  api_key_env: HF_TOKEN

configs/extras/cricket_train.yaml ADDED Viewed

	@@ -0,0 +1,125 @@

+# =============================================================================
+# CricketCaptain-LLM — dedicated training config
+# Stack: TRL 1.2.0 GRPO + Transformers 5.6.2 + PEFT (LoRA) + bf16 base
+# NOT using vLLM (incompatible with our env), NOT using Unsloth (incompatible
+# with TRL's multi-turn environment_factory).
+# Target run: 10-hour budget on a single H200 (144 GB VRAM), Qwen3.5-4B base.
+# =============================================================================
+env:
+  # eval_pack_id selects scenario distribution + opponent rosters
+  eval_pack_id: adaptive_t20_v1
+  agent_team: india
+  # Match length per episode. 20 = full T20.
+  # 5-over (~180 tool calls)  → ~5 min/step
+  # 20-over (~720 tool calls) → ~15-18 min/step  ← current choice
+  max_overs: 20
+  env_url: ws://localhost:8000
+opponent:
+  # heuristic | llm_live | llm_cached | cricsheet
+  # heuristic: rule-based, fast, deterministic-ish — best for fast iteration.
+  # llm_live: adversarial Gemma via HF router — realistic but slow + costs API.
+  mode: heuristic
+train:
+  # ---- Model & adapter ----
+  model: Qwen/Qwen3.5-4B
+  # Resume LoRA from the warmup checkpoint. When this run starts, base model
+  # loads from Qwen/Qwen3.5-4B above + LoRA adapter loads from this path.
+  # Comment out (with #) to start with a fresh adapter instead.
+  resume_from: ./checkpoints/stage2_final
+  # Single-stage training. Code uses curriculum_stage=2 internally to mean
+  # "full composite reward". Stage 1 (validity-only warm-up) was dropped because
+  # Qwen3.5-4B already knows tool calling natively (XML+JSON both accepted).
+  stage: 2
+  # ---- Dataset ----
+  prompts: 256          # number of unique scenarios; trainer cycles through them
+  # ---- Schedule ----
+  # 30 steps × ~15 min/step = ~7-8 hrs for the main run. Total chain (warmup +
+  # main) fits in ~10-hr budget. Bump to 100+ for a longer training run if
+  # compute is unconstrained — the resume_from path lets you continue cleanly.
+  steps: 30
+  logging_steps: 1      # log every step (set 10 for less wandb noise)
+  # ---- Throughput knobs (H200, 144 GB VRAM) ----
+  # batch_size × grad_accum = effective rollout per gradient update.
+  # batch_size × num_generations = simultaneous in-flight episodes (memory driver).
+  # 4 × 4 = 16 sim ≈ ~80 GB peak at 4096 max_completion — needed since 32 sim @ 4096
+  # OOMs at ~110 GB. grad_accum=4 keeps effective batch healthy.
+  batch_size: 4
+  grad_accum: 4
+  num_generations: 4    # GRPO requires ≥2 for group advantage; 4 is the sweet spot
+  # ---- Length budget ----
+  # 4096 = ~16 turns per rollout at ~250 tok/turn. Larger budget than warmup
+  # because main run uses smaller batch (32 sim @ batch=8 × num_gen=4) so
+  # KV cache fits comfortably.
+  max_completion_length: 4096
+  # Hard cap on tool calls per episode. 5-over needs ~180; 20-over needs ~720.
+  # 800 leaves slack for extra-balls (no-balls/wides) without truncating matches.
+  max_tool_calling_iterations: 800
+  # ---- Optimizer ----
+  # Bumped 5e-6 → 1e-5 for the main run. Warmup at 5e-6 showed flat reward and
+  # tiny loss magnitudes (~0.02) with grad_norm 0.35 — gradients flowing but
+  # weights barely moving. 2× LR doubles step size without re-entering instability
+  # territory (still well below the 5e-5 commonly used for r=64 LoRA SFT).
+  learning_rate: 1.0e-5
+  # ---- GRPO knobs ----
+  # KL coefficient. Default in TRL 1.2 is 0.0 — and 0.0 specifically means the
+  # reference model is NOT loaded (saves ~8 GB weights + ref-forward-pass mem).
+  # Any beta > 0 loads a frozen copy of the base model for KL anchoring.
+  # We use 0.0 because: (a) memory budget is tight at batch=16 + 20-over,
+  # (b) reward is well-shaped (composite with format penalty), so format collapse
+  # is unlikely, (c) cricket strategy isn't in the base distribution — we want drift.
+  beta: 0.0
+  # Sampling — slight bump from 0.8 for GRPO group diversity. top_p=0.95 trims
+  # the long tail (rare bad tokens during 720-turn rollouts).
+  temperature: 0.9
+  top_p: 0.95
+  # ---- Memory ----
+  # Trade ~30% extra backward-pass compute for big activation memory savings
+  # (>20 GB). Required to fit batch_size=16 + num_gen=4 + 3072 completion in 144 GB.
+  # use_reentrant=False is the modern path, more stable with LoRA than the legacy True.
+  gradient_checkpointing: true
+  gradient_checkpointing_use_reentrant: false
+  # ---- Dataloader ----
+  # Cheap micro-opts — pin host memory + a few worker threads so CPU isn't a
+  # bottleneck feeding prompts. Tiny win since our dataset is only 256 prompts.
+  dataloader_pin_memory: true
+  dataloader_num_workers: 4
+  # ---- Checkpointing ----
+  # Save every 10 steps; keep only the 5 most recent on disk to cap usage.
+  save_steps: 10
+  save_total_limit: 5
+  # ---- LoRA (currently hardcoded in train.py load_model()) ----
+  # r=64, alpha=128, dropout=0.05, targets q/k/v/o/gate/up/down
+  # → 85M trainable params (1.98% of 4.2B base)
+  # ---- Precision ----
+  # bf16 base + bf16 LoRA adapter. NO 4-bit quant — H200 has the VRAM and bf16
+  # is 15-20% faster than 4-bit dequant on every forward pass.
+  bf16_base: true
+  # ---- Logging ----
+  report_to: wandb        # or "tensorboard" for local-only
+  run_name: cricket_captain_v10
+# =============================================================================
+# Reward composition (weights live in configs/game_knowledge.yaml `reward:` block)
+#   composite = 0.35·r_result + 0.30·r_cricket + 0.25·r_behavior + 0.10·r_validity
+#
+#   r_result   — match outcome (chase margin, defense margin, win/tie bonus)
+#   r_cricket  — Dream11 fantasy proxy as dense per-ball signal
+#   r_behavior — coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%)
+#   r_validity — fraction of legal tool calls
+# =============================================================================

configs/extras/cricket_train_warmup.yaml ADDED Viewed

	@@ -0,0 +1,95 @@

+# =============================================================================
+# CricketCaptain-LLM — WARMUP training config (5-over format, fast iterations)
+# Stack: TRL 1.2.0 GRPO + Transformers 5.6.2 + PEFT (LoRA) + bf16 base
+# Purpose: short matches + bigger GRPO group = fast format mastery before
+# moving to the full 20-over training run (configs/cricket_train.yaml).
+#
+# Why bigger num_generations here?
+#   5-over rollouts are ~4× faster per step than 20-over, so we have memory +
+#   wall-clock budget for G=8. Bigger group → more stable GRPO advantage signal,
+#   especially useful when the model is just learning format/tactics.
+# =============================================================================
+env:
+  eval_pack_id: adaptive_t20_v1
+  agent_team: india
+  # max_overs: 0 unlocks the per-scenario curriculum distribution (see train.overs_distribution).
+  # Set to a positive integer to lock all scenarios to that single format.
+  max_overs: 0
+  env_url: ws://localhost:8000
+opponent:
+  # Heuristic for fast iteration. Switch to llm_live for final eval, not warmup.
+  mode: heuristic
+train:
+  # ---- Model & adapter ----
+  model: Qwen/Qwen3.5-4B
+  # Single-stage. curriculum_stage=2 internally = full composite reward.
+  stage: 2
+  # ---- Dataset ----
+  prompts: 64           # smaller dataset since we're only doing 25 steps
+  # ---- Schedule ----
+  # 5-over ≈ ~180 tool calls / episode → ~4-5 min / step at 32 sim episodes.
+  # 25 steps ≈ ~2 hrs total. Quick warmup before kicking off the full
+  # 20-over run with cricket_train.yaml.
+  steps: 25
+  logging_steps: 1
+  # Save every 5 steps → 5 checkpoints across 25 steps. Keep only 5 on disk.
+  save_steps: 5
+  save_total_limit: 5
+  # ---- Curriculum (per-scenario max_overs) ----
+  # Heavy on T2 (the only format that actually completes inside our token budget),
+  # tail to T5 (target eval distribution). Activated when env.max_overs is 0 or unset.
+  # Frequencies: ~45% T2, ~27% T3, ~18% T4, ~9% T5.
+  overs_distribution: [2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5]
+  # ---- Throughput knobs ----
+  # 4 × 4 = 16 sim episodes — needed because 4096 max_completion at 32 sim OOMs
+  # (~110 GB observed). At 16 sim: ~80 GB, fits with margin. grad_accum=4 keeps
+  # the effective gradient batch size where G=4 still gives stable advantages.
+  batch_size: 4
+  grad_accum: 4
+  num_generations: 4
+  # ---- Length budget ----
+  # 4096 = ~16 turns per rollout at ~250 tok/turn. Lets T2 episodes get further
+  # toward completion (~5-8 balls per innings vs ~3-5 at 3072).
+  max_completion_length: 4096
+  # Cap above what 5-over needs (~180) so no truncation from extras.
+  max_tool_calling_iterations: 240
+  # ---- Optimizer ----
+  learning_rate: 5.0e-6
+  # ---- GRPO knobs ----
+  # beta=0.0 → no reference model loaded → saves ~8 GB VRAM and lets G=8 fit.
+  beta: 0.0
+  temperature: 0.9
+  top_p: 0.95
+  # ---- Memory ----
+  gradient_checkpointing: true
+  gradient_checkpointing_use_reentrant: false
+  # ---- Dataloader ----
+  dataloader_pin_memory: true
+  dataloader_num_workers: 4
+  # ---- Precision ----
+  bf16_base: true
+  # ---- Logging ----
+  report_to: wandb
+  run_name: cricket_captain_warmup_5over
+# =============================================================================
+# Workflow:
+#   1. Train warmup:    python train.py train --config configs/cricket_train_warmup.yaml
+#   2. Train main run:  python train.py train --config configs/cricket_train.yaml \
+#                          --model ./checkpoints/stage2_final  (resume from warmup)
+# =============================================================================

configs/extras/default.yaml ADDED Viewed

	@@ -0,0 +1,59 @@

+env:
+  # Used by server + runners
+  eval_pack_id: adaptive_t20_v1
+  max_overs: 20
+  agent_team: india
+  env_url: ws://localhost:8000
+opponent:
+  # heuristic | llm_live | llm_cached
+  # llm_live is the adversarial opponent. The captain being trained still runs
+  # locally through TRL/Transformers, not through the HF inference endpoint.
+  # For reproducible cached evaluation, use configs/cached_eval.yaml instead.
+  # H200 default: use the live LLM opponent for realism.
+  mode: llm_live
+  model: google/gemma-4-26B-A4B-it
+  api_base: https://router.huggingface.co/v1
+  api_key_env: HF_TOKEN
+captain:
+  # For inference/eval runner when using an API model (OpenAI-compatible).
+  # You can still pass --model random for baseline runs.
+  model: google/gemma-4-26B-A4B-it
+  api_base: https://router.huggingface.co/v1
+  api_key_env: HF_TOKEN
+train:
+  model: Qwen/Qwen3.5-4B
+  # SINGLE-STAGE training. The two-stage curriculum (Stage 1 = format-only)
+  # was dropped — Qwen3.5-4B already knows tool calling natively (XML+JSON),
+  # so we run the full composite reward (r_result + r_cricket + r_behavior + r_validity)
+  # from step 0. Code still uses curriculum_stage=2 internally to mean "full reward".
+  stage: 2
+  prompts: 256
+  # 10-hr H200 budget. ~150 steps × ~4-5 min each = ~10 hrs.
+  steps: 150
+  # H200 (144GB) v8 config: bigger micro-batch, fewer grad accum steps, more updates.
+  # Effective rollout per gradient update = batch_size * grad_accum = 32 prompts
+  # → with num_generations=4 that's 128 episodes per gradient update.
+  batch_size: 16
+  grad_accum: 2
+  num_generations: 4   # GRPO requires >= 2 for group advantage; 4 is the standard sweet spot
+  # 3072 = enough headroom for thinking (v5 mean was 1115 tokens) without the
+  # KV-cache bloat of 4096. Cuts memory for the rollout buffer ~25%.
+  max_completion_length: 3072
+  # Episodes terminate either when match finishes or this cap hits.
+  # 5-over match needs ~180 tool calls, 20-over T20 needs ~720. Don't truncate.
+  max_tool_calling_iterations: 800
+  # Learning rate — was hardcoded to 1e-5; lowered to 5e-6 because of LoRA r=64
+  # (4× more trainable params than r=16) and partially-sparse outcome reward.
+  # Smaller LR = more stable updates when r_result is binary at innings end.
+  learning_rate: 5e-6
+  # Trade ~30% extra compute per backward pass for big activation memory savings
+  # (>20 GB freed). Lets us push micro-batch / num_generations with safety margin.
+  gradient_checkpointing: true
+  logging_steps: 1     # log loss/reward every step (set 10 for less noise)
+  # Switch to `tensorboard` if you prefer local-only logging.
+  report_to: wandb
+  run_name: cricket_captain_v10

configs/game_knowledge.yaml CHANGED Viewed

@@ -29,11 +29,22 @@ transition_overs: [6, 16]
 # Reward weights (must sum to 1.0)
 # ---------------------------------------------------------------------------
 reward:
-  # Episode-level composite
-  r_result:   0.55   # match outcome: win/loss, target margin, DLS/par
-  r_cricket:  0.25   # dense cricket position signal (Dream11 proxy)
-  r_behavior: 0.15   # plan-action coherence, adaptation, opponent awareness
-  r_validity: 0.05   # legal JSON tool use gate
   # Within r_behavior
   behavior:

 # Reward weights (must sum to 1.0)
 # ---------------------------------------------------------------------------
 reward:
+  # Episode-level composite — REBALANCED for partial-trajectory training.
+  # Original 55/25/15/5 was textbook "outcome-dominated" but assumed matches
+  # would actually complete in the token budget. Reality: episodes truncate at
+  # ~25% of T2 and r_result almost never fires. Putting 55% weight on a signal
+  # that fires <5% of the time washes out gradient.
+  #
+  # New weights match the SWE-RL / coding-agent-RL recipe. Re-rebalanced for
+  # the main 20-over run: at full T20 length r_result actually fires (matches
+  # complete), so we shift 0.15 weight back from r_cricket → r_result to give
+  # the outcome signal real gradient pull. r_cricket alone was producing flat,
+  # bouncy reward curves in warmup — per-ball Dream11 is too noisy as 45% of
+  # the gradient when group size is only 4.
+  r_result:   0.35   # match outcome — fires reliably at 20 overs, biggest signal
+  r_cricket:  0.30   # dense per-ball Dream11 (analog: partial test-pass)
+  r_behavior: 0.25   # per-turn coherence/adaptation (analog: lint/quality)
+  r_validity: 0.10   # tool format (analog: compile success)
   # Within r_behavior
   behavior:

docs/benchmark_explainer.md CHANGED Viewed

@@ -30,14 +30,13 @@ The original motivation came from strategic coherence: LLMs often say one thing
 ## 2. Fit With OpenEnv Competition Themes
-The environment aligns with multiple OpenEnv hackathon themes.
 ### Multi-Agent Interactions
 The submitted captain agent plays against an opponent policy. The opponent can be:
-- `heuristic`: fast deterministic/stochastic cricket logic.
-- `llm_live`: live OpenAI-compatible LLM opponent.
 - `llm_cached`: replayed opponent decisions for reproducible evaluation.
 This tests whether the agent can reason about another actor's incentives, field settings, and likely plans.
@@ -48,29 +47,11 @@ A full match has many decisions across innings, phases, wickets, and pressure st
 ### World Modeling
-The agent observes a partially summarized cricket world:
-- score,
-- over/ball,
-- wickets,
-- target,
-- phase,
-- field,
-- batter profile,
-- bowler profile,
-- opponent plan,
-- previous outcome.
-It must maintain an internal model of what is happening and update that model after every ball.
 ### Self-Improvement
-The same environment can support:
-- heuristic curriculum training,
-- cached-opponent official evaluation,
-- live LLM opponent self-play,
-- future agent-vs-agent training.
 ## 3. Environment Flow
@@ -86,8 +67,6 @@ Within each batting or bowling phase, the tactical loop is:
 PRE_OVER -> PRE_BALL -> BALL_RESOLUTION -> POST_BALL -> next decision
 ```
-The captain can use tools at different points in that loop.
 ### Toss
 ```json
@@ -97,40 +76,21 @@ The captain can use tools at different points in that loop.
 ### Batting Tools
 ```json
-{"tool": "select_batter", "arguments": {"name": "Anchor", "style": "anchor", "aggression": 0.35, "rationale": "Preserve wickets in the middle overs."}}
-```
-```json
 {"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Rotate strike against spin and keep wickets in hand."}}
-```
-```json
 {"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy single."}}
-```
-```json
 {"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Work into the gap."}}
 ```
-### Bowling Tools
-```json
-{"tool": "choose_bowler", "arguments": {"name": "Death Specialist", "bowler_type": "pace", "style": "yorker", "rationale": "Attack the stumps at the death."}}
-```
 ```json
 {"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Limit swing room."}}
-```
-```json
 {"tool": "set_field_setting", "arguments": {"setting": "Defensive"}}
-```
-```json
 {"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Protect boundaries and force a low-percentage shot."}}
-```
-```json
 {"tool": "bowl_delivery", "arguments": {}}
 ```
@@ -146,309 +106,197 @@ The captain can use tools at different points in that loop.
 {"tool": "analyze_situation", "arguments": {"query_type": "match_situation"}}
 ```
-## 4. OpenEnv Architecture
-The environment follows OpenEnv's standard client/server split.
 ```text
 LLM Agent / Evaluator
         |
-        | WebSocket
         v
-OpenEnv FastAPI server
         |
         v
-CricketEnvironment
         |
-        +--> MarkovCricketEngine
-        +--> OpponentPolicy
-        +--> CoherenceGrader
-        +--> RewardCalculator
-```
-Important files:
-- `server/app.py`: creates the OpenEnv server using `create_app(...)`.
-- `server/cricket_environment.py`: implements `reset`, `step`, and `state`.
-- `models.py`: defines `CricketAction`, `CricketObservation`, and `CricketState`.
-- `client.py`: defines the WebSocket client `CricketCaptainEnv`.
-- `inference.py`: runs random or OpenAI-compatible agents against the server.
-- `eval.py`: runs evaluation episodes and saves plots/raw logs.
-The standard OpenEnv lifecycle is:
-```python
-obs = env.reset(...)
-obs = env.step(CricketAction(tool="...", arguments={...}))
-state = env.state
-```
-This matters for competition compliance because clients do not need to import server internals. They interact through the OpenEnv API.
-## 5. What The Observation Contains
-Each step returns a `CricketObservation` with fields like:
-- `game_state`: toss / batting / bowling / finished.
-- `strategic_phase`: pre_over / pre_ball / ball_resolution / post_ball.
-- `game_context`: score, wickets, over, ball, target, phase.
-- `declared_strategy`: current batting strategy.
-- `bowling_strategy`: current bowling plan.
-- `field_setting`: Aggressive / Balanced / Defensive.
-- `current_batter`: batter profile.
-- `current_bowler`: bowler profile.
-- `opponent_plan`: last visible opponent policy decision.
-- `last_outcome`: previous ball outcome plus tactical metadata such as event type, shot zone, delivery features, field pressure, and fielder effect.
-- `available_tools`: legal tools for current state.
-- `prompt_text`: rendered prompt for the LLM.
-The LLM sees enough information to reason tactically, but not the entire simulator internals.
-## 6. Opponent Policies
-Opponent behavior lives in `server/opponent_policy.py`.
-There are three modes:
-### `heuristic`
-Fast local policy. Useful for:
-- tests,
-- development,
-- cheap training rollouts,
-- baseline comparison.
-### `llm_live`
-Calls an OpenAI-compatible LLM with a fixed prompt. Useful for:
-- demos,
-- realistic opponent behavior,
-- self-play-style experiments.
-The current default live opponent/captain model is:
-```text
-google/gemma-4-26B-A4B-it via https://router.huggingface.co/v1
 ```
-In this mode, the opponent actually calls the configured model during the run.
-### `llm_cached`
-Reads pre-recorded opponent decisions from JSONL. Useful for:
-- official leaderboard-style evaluation,
-- reproducibility,
-- preventing eval randomness.
-In this mode, the opponent does **not** call the configured model live. It replays the JSONL cache so every compared captain faces the same opponent decisions.
-The key idea:
-> Teams can change their agent however they want, but the evaluation opponent should be frozen.
-## 7. Ball Physics And Markov Engine
-The simulation uses `server/markov_engine.py` plus the field/zone definitions in `server/field_model.py`.
-It supports:
-1. Synthetic transition probabilities from `data/transition_probs.json`.
-2. Cricsheet-derived transition tables from `data/processed/cricket_transitions_v1.pkl`.
-The upgraded ball resolver uses both sides' plans:
-```text
-outcome ~ P(
-  shot_plan,
-  delivery_plan,
-  batter_profile,
-  bowler_profile,
-  field_setting,
-  phase,
-  score,
-  wickets,
-  target_pressure
-)
-```
-The engine first samples from the transition table, then applies a hybrid tactical layer:
-- hitter/finisher benefits aggressive shots,
-- anchor benefits low-risk rotation,
-- yorker/death specialist suppresses boundaries,
-- shot target zones (`cover`, `point`, `midwicket`, `long_on`, etc.) are matched against delivery line/length/variation,
-- field presets expand `Aggressive`, `Balanced`, and `Defensive` into named fielder zones,
-- boundary riders can cut off fours/sixes and inner ring fielders can save singles,
-- close catchers/slips/gully can convert edges into wickets,
-- wides/no-balls, drops, misfields, overthrows, run-outs, bowled/LBW routes, and caught-in-zone events add bounded stochastic noise,
-- high chase pressure makes defensive batting less useful.
-This keeps the simulator simple enough to train on while making actions meaningfully interact: a six toward an unprotected leg-side boundary is not the same as a lofted hit toward deep midwicket with a rider waiting.
-## 8. Reward Design
-Rewards are intentionally multi-component. We do not want an agent that wins by gaming one metric.
-Main components:
-### Result Quality
-Measures the long-horizon cricket objective:
-- win/loss,
-- score vs DLS/par,
-- chase success,
-- defense success,
-- wickets preserved or taken.
-This is the benchmark's primary outcome. The agent is trained over many simulated matches so it learns policies that improve match result, not just isolated ball-level actions.
-### Dream11 Auxiliary Signal
-Provides a dense cricket contribution proxy:
-- batting runs, strike rate, boundaries,
-- bowling wickets, dots, economy,
-- milestone and dismissal bonuses.
-This helps training get intermediate signal, but it does **not** replace the win/loss objective.
-### Plan-Action Coherence
-Checks whether the action matches the declared plan.
-Example:
-- Declared aggression `0.30` plus `single` is coherent.
-- Declared aggression `0.30` plus `six` is less coherent.
-### Strategic Adaptation
-Rewards plans that change with context:
-- new phase,
-- target pressure,
-- wickets down,
-- previous reflection,
-- opponent behavior.
-### Opponent Awareness
-Rewards plans that mention or respond to:
-- field setting,
-- bowler type,
-- batter style,
-- opponent plan,
-- phase/matchup.
-### Regret-Style Score
-Compares chosen action quality against simple heuristic alternatives.
-### Tool Efficiency
-Rewards useful `analyze_situation` calls and penalizes spam indirectly.
-### Format Validity
-Rewards valid JSON and valid tools.
-The episode reward combines:
-```text
-25% result quality / match outcome
-10% Dream11 dense cricket proxy
-30% strategy bundle
-20% tool efficiency
-15% format validity
 ```
-The strategy bundle includes coherence, adaptation, opponent awareness, and regret-style scoring.
-## 9. Data Curation Pipeline
-The data pipeline is designed to keep training and environment behavior aligned.
-### Step 1: Curate Ball Outcomes
-Script:
-```bash
-python scripts/curate_transitions.py --format t20
 ```
-Outputs:
-- `data/processed/ball_outcomes_t20_v1.pkl`
-- `data/processed/cricket_transitions_v1.pkl`
-The rich ball outcome records include:
-- both innings,
-- target,
-- required rate,
-- legal vs extra delivery,
-- batter name,
-- bowler name,
-- bowler type,
-- dismissal type,
-- phase,
-- score before ball,
-- wickets before ball,
-- runs and wicket outcome.
-### Step 2: Build Player Profiles
-Script:
 ```bash
-python scripts/build_player_profiles.py
 ```
-Output:
-- `data/processed/player_profiles_t20_v1.json`
-Profiles include:
-- batter style: anchor, balanced, hitter, finisher,
-- bowler style: pace, spin, death specialist, economy, wicket-taker,
-- phase strengths,
-- economy,
-- strike rate,
-- dot rate,
-- Dream11-style pressure proxy.
-### Step 3: Build Evaluation Pack
-Script:
 ```bash
 python scripts/build_eval_pack.py --eval-pack-id adaptive_t20_v1
 ```
-Output:
-- `data/eval_packs/adaptive_t20_v1.json`
-The pack has:
-- dev scenarios,
-- official scenarios,
-- chase states,
-- defense states,
-- collapse states,
-- death-over states,
-- matchup states,
-- opponent config.
-### Step 4: Generate Opponent Cache
-Script:
 ```bash
 python scripts/generate_opponent_cache.py \
@@ -458,199 +306,90 @@ python scripts/generate_opponent_cache.py \
   --output data/opponent_cache/adaptive_t20_v1.jsonl
 ```
-For official evaluation, use `llm_cached` mode with a fixed cache.
-## 10. Training Pipeline
-The training plan uses SFT only as a bootstrapping stage. The main optimization path remains GRPO.
-### Stage 0: SFT Tool Warmup
-Command:
 ```bash
-python train.py sft-data --output data/training/tool_sft_examples.jsonl
 ```
-Purpose:
-- teach valid JSON,
-- teach tool names,
-- teach argument schemas,
-- reduce parse errors before RL.
-This does not replace RL. It makes RL less wasteful.
-### Stage 1: GRPO Format / Tool Correctness
-Command:
 ```bash
-python train.py train --stage 1 --steps 200 --model Qwen/Qwen2.5-7B-Instruct
 ```
-Purpose:
-- train valid tool calls,
-- reduce invalid JSON,
-- stabilize action format.
-### Stage 2: GRPO Strategic Behavior
-Command:
 ```bash
-python train.py train --stage 2 --steps 600 --model ./checkpoints/stage1_final
 ```
-Purpose:
-- improve coherence,
-- improve adaptation,
-- improve opponent awareness,
-- improve tool efficiency,
-- improve match result quality.
-### Evaluation
-Command:
 ```bash
-python eval.py --episodes 20 --env-url "$CRICKET_CAPTAIN_ENV_URL" --eval-pack-id adaptive_t20_v1
 ```
-Expected comparison:
-- random/untrained baseline,
-- SFT-warmed model,
-- GRPO-trained model.
-For the competition, we should produce plots showing:
-- reward over training,
-- parse error rate,
-- coherence,
-- adaptation,
-- opponent awareness,
-- score/chase/defense metrics.
-## 11. How This Complies With The Competition Instructions
-The competition requires:
-### Use OpenEnv
-Implemented through:
-- `server/app.py`
-- `server/cricket_environment.py`
-- `models.py`
-- `client.py`
-The environment follows `reset`, `step`, and `state`.
-### Training Script With HF TRL / Unsloth
-Implemented through:
-- `train.py`
-It uses Hugging Face TRL GRPO paths when training dependencies are installed.
-### Hosted Environment
-The repo has Hugging Face Spaces metadata in `README.md` and a Docker-based app path. The server binds to `0.0.0.0`, and remote clients should use `CRICKET_CAPTAIN_ENV_URL`.
-### README With Problem / Environment / Results
-The README now explains:
-- problem statement,
-- tools,
-- reward architecture,
-- environment design,
-- data pipeline,
-- Lightning/HF runtime notes.
-Still needed for final submission:
-- actual HF Space URL,
-- training result plots,
-- mini-blog/video link.
-### Show Improvement
-The environment and scripts support this, but the final artifact still needs a real training run with plots.
-Minimum evidence to add:
-- random baseline metrics,
-- trained model metrics,
-- reward curve,
-- parse error curve,
-- example before/after decisions.
-## 12. Recommended Demo Story
-A simple judge-friendly demo:
-1. Show a late chase scenario:
-```text
-Over 16.0, 128/5, target 172
-```
-2. Random/untrained model:
-- may choose invalid tools,
-- may attack blindly,
-- may ignore field/opponent.
-3. Trained/adaptive model:
-- checks target pressure,
-- selects finisher,
-- plans boundary zones,
-- responds after wicket/boundary,
-- changes risk level.
-4. Show metrics:
-- parse errors down,
-- adaptation up,
-- opponent awareness up,
-- reward up.
-This tells the story clearly:
-> The model learned to captain, not just output JSON.
-## 13. Current Status
-Implemented:
-- OpenEnv environment.
-- Rich strategic tool surface.
-- Opponent policies.
-- Adaptive eval pack.
-- T20 data curation script.
-- Player profile builder.
-- Opponent cache generator.
-- GRPO training script path.
-- SFT bootstrap data generator.
-- Eval and plotting scripts.
-Verified with smoke tests:
-- Python compile checks.
-- Lint checks.
-- `train.py test`.
-- `train.py sft-data`.
-- opponent cache generation.
-- server startup.
-- random inference run.
-- eval run with plots.
-Next major task:
-Run real training on compute and commit the resulting plots/metrics for the final submission.

 ## 2. Fit With OpenEnv Competition Themes
 ### Multi-Agent Interactions
 The submitted captain agent plays against an opponent policy. The opponent can be:
+- `heuristic`: fast format-aware cricket logic (T5/T20/ODI rules).
+- `cricsheet`: real Cricsheet ball-by-ball match data sampled by game context.
+- `llm_live`: live OpenAI-compatible LLM opponent (google/gemma-4-26B-A4B-it via HF Router).
 - `llm_cached`: replayed opponent decisions for reproducible evaluation.
 This tests whether the agent can reason about another actor's incentives, field settings, and likely plans.
 ### World Modeling
+The agent observes a partially summarized cricket world: score, over/ball, wickets, target, phase, field, batter profile, bowler profile, previous outcome. It must maintain an internal model of what is happening and update that model after every ball.
 ### Self-Improvement
+The same environment can support heuristic curriculum training, cached-opponent official evaluation, live LLM opponent self-play, and future agent-vs-agent training.
 ## 3. Environment Flow
 PRE_OVER -> PRE_BALL -> BALL_RESOLUTION -> POST_BALL -> next decision
 ```
 ### Toss
 ```json
 ### Batting Tools
 ```json
+{"tool": "select_batter", "arguments": {"name": "Virat Kohli", "style": "anchor", "aggression": 0.35, "rationale": "Preserve wickets in the middle overs."}}
 {"tool": "set_strategy", "arguments": {"phase_intent": "consolidate", "aggression": 0.35, "rationale": "Rotate strike against spin and keep wickets in hand."}}
 {"tool": "plan_shot", "arguments": {"shot_intent": "single", "target_area": "midwicket", "risk": "low", "trajectory": "ground", "rationale": "Field is spread, so take the easy single."}}
 {"tool": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "Work into the gap."}}
 ```
+`plan_shot` is **not** an overhead tool. Only `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, and `analyze_situation` count against the 3 / over limit (see Tool budget).
+### Bowling Tools
 ```json
+{"tool": "choose_bowler", "arguments": {"name": "Jasprit Bumrah", "bowler_type": "pace", "style": "yorker", "rationale": "Attack the stumps at the death."}}
 {"tool": "set_bowling_strategy", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Limit swing room."}}
 {"tool": "set_field_setting", "arguments": {"setting": "Defensive"}}
 {"tool": "plan_delivery", "arguments": {"bowler_type": "pace", "line": "stumps", "length": "full", "delivery_type": "yorker", "rationale": "Protect boundaries and force a low-percentage shot."}}
 {"tool": "bowl_delivery", "arguments": {}}
 ```
 {"tool": "analyze_situation", "arguments": {"query_type": "match_situation"}}
 ```
+## 4. Tool budget
+The environment enforces a **3-call overhead budget per over** (see `CricketEnvironment.TOOL_BUDGET_PER_OVER` and `TOOL_FINE_PER_EXCESS` in `server/cricket_environment.py`).
+**Overhead tools** (increment the per-over counter; the 4th+ in the same over are fined):
+`set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`
+**Not overhead** (do not use the 3 free “slots”):
+`play_delivery`, `bowl_delivery`, `plan_shot`, `call_toss`, `select_batter`, `choose_bowler`, `set_field_setting`, `set_match_plan`, `update_match_plan`
+Each overhead call **beyond the third in that over** incurs an immediate **−0.04** step reward. The prompt shows `Tool budget: N/3 overhead calls used this over`.
+**Training connection:** `train.py train` uses real `CricketEnvironment` steps, so these fines are part of the return GRPO optimizes. That keeps long-horizon training aligned with the benchmark: agents must choose when to pay for `analyze_situation` and `reflect_after_ball`, while `set_match_plan` / `update_match_plan` let them carry structure across overs without spending overhead budget.
+## 5. OpenEnv Architecture
 ```text
 LLM Agent / Evaluator
         |
+        | WebSocket (OpenEnv)
         v
+FastAPI server (server/app.py)
         |
         v
+CricketEnvironment (server/cricket_environment.py)
         |
+        +--> MarkovCricketEngine  (server/markov_engine.py)
+        +--> FormatMapper         (server/format_mapper.py)
+        +--> OpponentPolicy       (server/opponent_policy.py)
+        +--> PlayerRoster         (server/player_roster.py)
+        +--> CoherenceGrader      (server/coherence_grader.py)
+        +--> RewardCalculator     (server/reward_calculator.py)
+        +--> FieldModel           (server/field_model.py)
 ```
+Key files:
+| File | Role |
+|------|------|
+| `server/app.py` | OpenEnv server entry point |
+| `server/cricket_environment.py` | `reset`, `step`, `state` implementation |
+| `server/format_mapper.py` | T5/T20/ODI closest-format selector; phase-aware shot weights, batter/bowler roles |
+| `server/opponent_policy.py` | Heuristic, Cricsheet, live LLM, cached LLM opponent policies |
+| `server/player_roster.py` | Fuzzy player lookup; batter/bowler profile extractor |
+| `models.py` | `CricketAction`, `CricketObservation`, `CricketState` |
+| `client.py` | WebSocket client `CricketCaptainEnv` |
+| `inference.py` | Random + LLM agent evaluation |
+| `train.py` | MT-GRPO + SFT training pipeline |
+| `eval.py` | Coherence heatmaps, reward curves, tool analytics |
+## 6. Format-Aware Rules
+`server/format_mapper.py` auto-selects T5 / T20 / ODI rules by `|max_overs − format_overs|`:
+| Format | max_overs | Key differences |
+|--------|-----------|-----------------|
+| T5 | ≤ 7 | High-aggression throughout, powerplay dominates all overs |
+| T20 | 8–35 | Three phases (PP/Middle/Death); spin-heavy middle |
+| ODI | > 35 | Four phases (PP/Middle-early/Middle-late/Death); anchor roles |
+The format mapper provides:
+- **Phase-aware shot weights**: boundary/six probability rises sharply in death overs
+- **Batter roles** with `overs_active` windows (opener, anchor, middle_order, finisher)
+- **Bowler roles** with `preferred_phases` (pace_opener, spin_controller, death_specialist)
+- **Bowling strategy** per phase (line, length, delivery_type, field_setting)
+Both the heuristic opponent and the `select_batter` / `choose_bowler` tools draw from these tables.
+## 7. Player Rosters
+`server/player_roster.py` loads team profiles from `data/player_profiles/` — 10 T20I squads: India, Australia, England, Pakistan, South Africa, New Zealand, West Indies, Sri Lanka, Bangladesh, Afghanistan.
+When the agent calls `select_batter` or `choose_bowler` with a player name, the roster performs **fuzzy lookup** (exact → surname → word-overlap) and fills in real aggression, batting/bowling style, and phase strengths from the profile.
+## 8. What The Observation Contains
+Each step returns a `CricketObservation` with:
+- `game_state`: toss / batting / bowling / finished
+- `strategic_phase`: pre_over / pre_ball / ball_resolution / post_ball
+- `game_context`: score, wickets, over, ball, target, phase, run_rate, req_rate
+- `declared_strategy`: current batting strategy (aggression, intent, rationale)
+- `bowling_strategy`: current bowling plan
+- `field_setting`: Aggressive / Balanced / Defensive
+- `current_batter`: batter profile (style, aggression, phase strengths)
+- `current_bowler`: bowler profile
+- `last_outcome`: ball outcome + tactical metadata (event type, shot zone, delivery features, field pressure, fielder effect)
+- `available_tools`: legal tools for current state (phase-gated)
+- `tool_budget`: overhead calls used this over vs 3-call limit
+- `prompt_text`: rendered prompt for the LLM
+The LLM sees enough information to reason tactically, but not simulator internals.
+## 9. Opponent Policies
+Four modes in `server/opponent_policy.py`:
+### `heuristic`
+Format-aware local policy using T5/T20/ODI rules from `format_mapper.py`. Picks shot intent from phase-weighted distributions, adjusts for wicket pressure (shifts conservative under 7+ wickets down), and selects batter/bowler roles by current over and format. Fast, no API key needed.
+### `cricsheet`
+Samples real Cricsheet ball-by-ball deliveries indexed by `(phase, wickets_band, innings_type)`. Automatically selects T20 or ODI data based on `max_overs`:
+- ≤ 25 overs �� `ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries from 5,176 matches)
+- > 25 overs → `ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries from 3,116 matches)
+Progressive fallback widening (drop innings_type → drop wickets_band → any phase record) ensures no dead buckets. Heuristic fallback if data file absent.
+### `llm_live`
+Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible API). Graceful heuristic fallback when no API key is present, so local development never breaks.
+### `llm_cached`
+Replays pre-recorded opponent decisions from JSONL. Does **not** call the configured model live. Use for official leaderboard-style evaluation where every compared captain faces identical opponent decisions.
+## 10. Ball Physics And Markov Engine
+The simulation uses `server/markov_engine.py` plus field/zone definitions in `server/field_model.py`.
+Ball transition tables keyed by `(over, wickets, score_band, phase, bowler_type)`:
+1. **Cricsheet-derived**: `data/processed/cricket_transitions_v1.pkl` when available
+2. **Calibrated synthetic**: `data/transition_probs.json` as fallback
+After the base Markov draw, a **hybrid tactical layer** applies:
+- Shot target zones (`cover`, `point`, `midwicket`, `long_on`, …) matched against delivery line/length/variation
+- Field presets (`Aggressive`, `Balanced`, `Defensive`) expand into named fielder zones
+- Boundary riders cut off fours/sixes; inner-ring fielders save singles; slips/gully convert edges
+- Wides/no-balls, drops, misfields, overthrows, run-outs, caught-in-zone events add bounded stochastic noise
+- High chase pressure makes defensive batting less useful
+## 11. Reward Design
+Four-rubric composite reward:
+| Rubric | Weight | Frequency | Measures |
+|--------|--------|-----------|----------|
+| `r_cricket` | **45%** | Per ball | Dream11 proxy: runs, wickets, dots, milestones, economy, strike rate |
+| `r_behavior` | **25%** | Every turn | Coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%) |
+| `r_result` | **20%** | Innings/episode end | Win/loss vs DLS par, target margin, wickets |
+| `r_validity` | **10%** | Every turn | Valid tool-call structure and legal phase-gated tool use |
+Plus a **progress bonus** added to `r_result`: `min(0.25, tool_calls_made / 40.0)` — caps at +0.25 once the agent makes ≥10 tool calls. Directly rewards escaping the planning-loop trap (where the policy maxes overhead tools without ever calling `play_delivery`).
+**Why these weights** (rebalanced from the original 55/25/15/5): partial-trajectory training means `r_result` rarely fires (episodes truncate before completion). Putting 55% weight on a signal that fires <5% of the time washes out the gradient. The new 45/25/20/10 split mirrors the SWE-RL recipe (60% intermediate / 40% terminal) and matches what working coding-agent RL setups actually use.
+`r_tools` is computed and logged but excluded from the composite — tool discipline is measured through outcomes.
+### Coherence Scoring (batting)
 ```
+coherence = aggression_match × rationale_specificity × phase_appropriate
+aggression_match   = 1 − |declared_aggression − shot_aggression_proxy|
+rationale_specificity = (word_count_score + cricket_keyword_density) / 2
+phase_appropriate  = 1 − |declared_aggression − phase_baseline|
+phase_baselines: powerplay=0.55, middle=0.35, death=0.75
 ```
+### Single-Stage Training with Format Curriculum
+The original two-stage (format → strategy) curriculum was collapsed because Qwen3.5-4B
+already does tool calling natively (XML+JSON via `_parse_completion`). The full composite
+reward fires from step 0.
+What remains is a **format-length curriculum within the warmup config**: per-scenario
+`max_overs` is sampled from `[2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5]` (heavy on T2-T3 so episodes
+actually complete inside the token budget). The main run then trains on full T20 (20-over)
+matches, optionally resuming from the warmup adapter.
+## 12. Data Curation Pipeline
+### Step 1: Curate Ball Outcomes
 ```bash
+python scripts/curate_transitions.py --format t20   # → ball_outcomes_t20_v1.pkl
+python scripts/curate_transitions.py --format odi   # → ball_outcomes_odi_v1.pkl
 ```
+Both files already generated:
+- `data/processed/ball_outcomes_t20_v1.pkl` — 1.17M T20 deliveries, 5,176 matches
+- `data/processed/ball_outcomes_odi_v1.pkl` — 1.65M ODI deliveries, 3,116 matches
+- `data/processed/cricket_transitions_v1.pkl` — 5,138 Markov keys, 2,878 high-confidence
+### Step 2: Build Evaluation Pack
 ```bash
 python scripts/build_eval_pack.py --eval-pack-id adaptive_t20_v1
 ```
+### Step 3: Generate Opponent Cache
 ```bash
 python scripts/generate_opponent_cache.py \
   --output data/opponent_cache/adaptive_t20_v1.jsonl
 ```
+## 13. Training Pipeline
+### Recommended: Single-Command Chain
 ```bash
+# Warmup (5-over curriculum, 25 steps) → Main (20-over T20, 100 steps).
+# Main auto-resumes from warmup adapter at ./checkpoints/stage2_final.
+bash scripts/run_warmup_then_main.sh
 ```
+### Run Components Individually
+**Warmup only — short curriculum, bootstraps the LoRA adapter:**
 ```bash
+PYTORCH_ALLOC_CONF=expandable_segments:True \
+  python train.py train --config configs/cricket_train_warmup.yaml
 ```
+**Main only — full T20, resumes warmup adapter (or fresh if `resume_from` is empty):**
 ```bash
+PYTORCH_ALLOC_CONF=expandable_segments:True \
+  python train.py train --config configs/cricket_train.yaml
 ```
+**Optional SFT bootstrap** (legacy, not needed for Qwen3.5-4B which has native tool calling):
 ```bash
+python train.py sft-data --output data/training/tool_sft_examples.jsonl
 ```
+`train.py train` uses TRL `GRPOTrainer` with `environment_factory=CricketCaptainToolEnv`. The captain being trained is loaded locally by Transformers/TRL and interacts with live environment instances through tool methods. `opponent-mode llm_live` affects only the adversary; it does not mean the trained captain is served through the HF inference endpoint.
+The default training model is `Qwen/Qwen3.5-4B`. The default live opponent model is `google/gemma-4-26B-A4B-it`. Roster-backed training requires `--agent-team` or `env.agent_team` in YAML so `select_batter` and `choose_bowler` use real player profiles instead of generic names.
+## 14. Current Status (2026-04-25)
+### Implemented and verified
+| Component | Status |
+|-----------|--------|
+| OpenEnv server + client | ✅ |
+| 14-tool strategic surface | ✅ |
+| 4-rubric reward system | ✅ |
+| Tool budget system (3/over, −0.04 fine) | ✅ |
+| Format mapper (T5/T20/ODI) | ✅ |
+| Player rosters (10 T20I teams, fuzzy lookup) | ✅ |
+| Cricsheet T20 data (1.17M deliveries) | ✅ |
+| Cricsheet ODI data (1.65M deliveries) | ✅ |
+| Heuristic opponent (format-aware) | ✅ |
+| Cricsheet opponent (T20+ODI, context-indexed) | ✅ |
+| LLM live opponent (HF Router / OpenAI-compatible API) | ✅ |
+| LLM cached opponent | ✅ |
+| GRPO training script (`environment_factory` agent rollouts) | ✅ |
+| SFT data generator | ✅ |
+| Gradio demo UI | ✅ |
+| Colab training notebook | ✅ |
+### Verified end-to-end (2026-04-25)
+All 3 opponent modes verified at 5-over inference + train-smoke:
+| Mode | inference parse_err | train-smoke r_validity | coherence |
+|------|--------------------|-----------------------|-----------|
+| heuristic | 0% | 1.0 | 0.556 |
+| cricsheet | 0% | 1.0 | 0.620 |
+| llm_live  | 0% | 1.0 | 0.537 |
+### Pending for submission
+- Real GRPO training run with reward curves (requires HF compute)
+- HF Space deployment URL
+- Training-vs-baseline comparison plots
+- Mini-blog / video
+## 15. Recommended Demo Story
+1. **Show a late chase scenario**: Over 16.0, 128/5, target 172
+2. **Random/untrained model**: invalid tools, blind aggression, ignores field/opponent
+3. **Trained model**: checks target pressure → selects finisher → plans boundary zones → responds after wicket → changes risk level
+4. **Show metrics**: parse errors ↓, coherence ↑, adaptation ↑, opponent_awareness ↑, reward ↑
+> The model learned to captain, not just emit a valid tool-call object.

docs/experiment_workflow.md CHANGED Viewed

@@ -1,12 +1,12 @@
 # Experiment Workflow: Baselines, Opponents, Short Runs, and Training
-This document explains how to run CricketCaptain experiments in a practical order: first short 2-over smoke checks, then 5-over untrained model baselines, then GRPO training, then longer 20-over evaluation.
-## 1. Why Start With 2-Over Smoke + 5-Over Baselines?
-A full T20 innings is 20 overs. That is useful for final evaluation, but it is slow and noisy for debugging.
-For early code-path experiments, 2-over smoke runs are better because they quickly answer:
 - Is the OpenEnv server working?
 - Is the client connecting correctly?
@@ -19,461 +19,267 @@ For early code-path experiments, 2-over smoke runs are better because they quick
 The workflow should be:
 ```text
-2-over smoke -> 5-over untrained baseline -> short training -> 5-over trained eval -> 20-over final eval
 ```
-Do not start with full 20-over training unless the 2-over smoke and 5-over baseline loops are stable.
-## 2. Current Opponent Modes
-Opponent behavior is controlled by YAML (`configs/default.yaml` / `configs/cached_eval.yaml`), `CRICKET_OPPONENT_MODE`, or `--opponent-mode`.
-The code supports three modes in `server/opponent_policy.py`.
-## 2.1 Heuristic Opponent
 ```bash
-export CRICKET_OPPONENT_MODE=heuristic
 ```
-This is the current default.
-It is not a live LLM. It is a deterministic/stochastic cricket policy that chooses sensible batting and bowling plans based on phase, wickets, target, batter style, and field.
-Use it for:
-- fast local tests,
-- smoke tests,
-- cheap training rollouts,
-- deterministic-ish baselines.
-Pros:
-- cheap,
-- fast,
-- no API key,
-- stable enough for debugging.
-Cons:
-- less realistic than an LLM opponent,
-- less diverse.
-## 2.2 Live LLM Opponent
 ```bash
-export CRICKET_OPPONENT_MODE=llm_live
-export CRICKET_OPPONENT_MODEL=google/gemma-4-26B-A4B-it
-export CRICKET_OPPONENT_API_BASE=https://router.huggingface.co/v1
-export HF_TOKEN=...
 ```
-The default live opponent model in code/config is:
-```text
-google/gemma-4-26B-A4B-it
-```
-This mode calls an OpenAI-compatible API from `LLMOpponentPolicy`.
-Use it for:
-- demos,
-- more realistic opponent behavior,
-- generating cached opponent decisions,
-- future self-play-style experiments.
-Pros:
-- more realistic strategic behavior,
-- can react with natural tactical reasoning,
-- good for storytelling/demo.
-Cons:
-- costs API calls,
-- can be non-deterministic,
-- not ideal for official evaluation directly.
-## 2.3 Cached LLM Opponent
 ```bash
-export CRICKET_OPPONENT_MODE=llm_cached
-export CRICKET_OPPONENT_CACHE=data/opponent_cache/adaptive_t20_v1_official_gemma2b.jsonl
 ```
-This mode replays pre-recorded opponent decisions. It does **not** call `CRICKET_OPPONENT_MODEL` live during the run; `cache_path` is the source of opponent behavior.
-Use it for:
-- official/fair evaluation,
-- leaderboard-style comparison,
-- reproducible experiments.
-Recommended official flow:
-```text
-llm_live once -> save opponent decisions -> llm_cached for all model comparisons
-```
-This gives the benefit of an LLM opponent while ensuring every model faces the same opponent decisions.
-## 3. What Model Is The Opposite Team?
-Currently, the default live opponent is:
-```text
-google/gemma-4-26B-A4B-it via https://router.huggingface.co/v1
 ```
-Important distinction:
-```text
-llm_live   -> calls the configured model during the run
-llm_cached -> ignores live model calls and replays cache_path
-heuristic  -> uses local rule-based cricket policy
-```
-This can be changed with:
 ```bash
-export CRICKET_OPPONENT_MODE=llm_live
-export CRICKET_OPPONENT_MODEL=<model-name>
-```
-For example:
-```bash
-export CRICKET_OPPONENT_MODEL=google/gemma-4-26B-A4B-it
 ```
-or with another OpenAI-compatible server:
 ```bash
-export CRICKET_OPPONENT_API_BASE=http://localhost:8080/v1
-export CRICKET_OPPONENT_MODEL=<local-model-name>
 ```
-## 4. Baseline First: Random Agent
-Before testing any trained model, run the random baseline.
-Start the server:
 ```bash
-PYTHONPATH=. python server/app.py
 ```
-Run baseline:
 ```bash
-PYTHONPATH=. python inference.py \
-  --model random \
-  --episodes 5 \
-  --env-url "$CRICKET_CAPTAIN_ENV_URL" \
-  --opponent-mode heuristic \
-  --eval-pack-id adaptive_t20_v1
 ```
-Track:
-- `total_reward`
-- `mean_coherence`
-- `adaptation`
-- `parse_error_rate`
-- score/wickets
-This tells us whether the environment works and gives a baseline to beat.
-## 5. Untrained LLM Baseline
-Next, evaluate a base model without training.
-Example:
 ```bash
-PYTHONPATH=. python inference.py \
-  --model Qwen/Qwen2.5-7B-Instruct \
-  --episodes 5 \
-  --env-url "$CRICKET_CAPTAIN_ENV_URL" \
-  --opponent-mode heuristic \
-  --eval-pack-id adaptive_t20_v1
 ```
-This shows what a general instruction model can do before any RL.
-Expected weaknesses:
-- may output invalid JSON,
-- may choose wrong tools,
-- may ignore opponent plan,
-- may be verbose instead of tool-only,
-- may not adapt after bad outcomes.
-## 6. Why SFT Exists If We Use GRPO
-SFT is not the main training objective. It is a warmup.
-GRPO should optimize strategic behavior, but if the model cannot produce valid tool JSON, GRPO wastes rollouts learning syntax.
-SFT helps the model learn:
-- valid JSON shape,
-- available tools,
-- argument schemas,
-- when tools are legal,
-- one-tool-call responses.
-Then GRPO can focus on:
-- coherence,
-- adaptation,
-- opponent awareness,
-- match result quality.
-Recommended stack:
-```text
-SFT/tool warmup -> GRPO stage 1 -> GRPO stage 2 -> eval
 ```
-## 7. Stage 0: Generate SFT Tool Data
 ```bash
-python train.py sft-data \
-  --output data/training/tool_sft_examples.jsonl \
-  --examples 500
 ```
-This creates supervised examples for:
-- toss,
-- batting tools,
-- bowling tools,
-- reflection,
-- analysis.
-These examples are useful for quick tool-format finetuning.
-## 8. Stage 1: GRPO Format / Tool Training
 ```bash
-python train.py train \
-  --stage 1 \
-  --steps 100 \
-  --prompts 200 \
-  --model Qwen/Qwen2.5-7B-Instruct
 ```
-Goal:
-- reduce parse errors,
-- make tool calls valid,
-- stabilize action format.
-Metrics to watch:
-- format reward,
-- parse error rate,
-- invalid tool rate.
-## 9. Stage 2: GRPO Strategic Training
 ```bash
-python train.py train \
-  --stage 2 \
-  --steps 200 \
-  --prompts 300 \
-  --model ./checkpoints/stage1_final
 ```
-Goal:
-- improve plan-action coherence,
-- improve adaptation,
-- improve opponent awareness,
-- improve tool efficiency,
-- improve cricket result quality.
-Metrics to watch:
-- total reward,
-- coherence,
-- adaptation,
-- opponent awareness,
-- regret-style score,
-- score/wickets,
-- chase/defense success.
-## 10. 5-Over vs 20-Over Evaluation
-### 5-Over Evaluation
-Use for:
-- debugging,
-- model sanity checks,
-- comparing before/after quickly,
-- cheap experiments.
-Both `inference.py` and `eval.py` support `--max-overs`, and the YAML configs set `max_overs: 5` by default for quick iteration.
-Random captain sanity check:
 ```bash
-PYTHONPATH=. python inference.py \
-  --model random \
-  --episodes 5 \
-  --max-overs 5 \
-  --env-url "$CRICKET_CAPTAIN_ENV_URL" \
-  --eval-pack-id adaptive_t20_v1 \
-  --opponent-mode llm_cached
 ```
-HF Gemma captain with live HF inference:
-```bash
-export HF_TOKEN="hf_..."
-PYTHONPATH=. python inference.py \
-  --model google/gemma-4-26B-A4B-it \
-  --api-base https://router.huggingface.co/v1 \
-  --api-key "$HF_TOKEN" \
-  --episodes 1 \
-  --max-overs 5 \
-  --env-url "$CRICKET_CAPTAIN_ENV_URL" \
-  --eval-pack-id adaptive_t20_v1 \
-  --opponent-mode llm_cached
 ```
-### 20-Over Evaluation
-Use for:
-- final benchmark,
-- README numbers,
-- competition evidence,
-- trained-vs-baseline comparison.
-Command:
 ```bash
-PYTHONPATH=. python eval.py \
   --episodes 20 \
-  --env-url "$CRICKET_CAPTAIN_ENV_URL" \
-  --eval-pack-id adaptive_t20_v1
-```
-## 11. Evaluation Pack
-The main adaptive pack is:
-```text
-data/eval_packs/adaptive_t20_v1.json
 ```
-It contains:
-- 5 dev scenarios,
-- 60 official scenarios,
-- chase states,
-- defend states,
-- death-over states,
-- collapse states,
-- matchup states.
-Use dev scenarios for local iteration and official scenarios for final comparison.
-## 12. Recommended Experiment Order
-Use this sequence:
-```text
-1. Random baseline, 5-over
-2. Base LLM baseline, 5-over
-3. Training-side rollout smoke, 1 match / 5 overs
-4. SFT warmup
-5. GRPO stage 1, short
-6. GRPO stage 2, short
-7. Trained eval, 5-over
-8. Trained eval, 20-over
-9. Cached LLM opponent official eval
-10. Add plots and before/after examples to README
-```
-Training-side smoke command:
-```bash
-python train.py train-smoke \
-  --matches 1 \
-  --max-overs 2 \
-  --max-steps 240 \
-  --log-steps 90 \
-  --eval-pack-id adaptive_t20_v1 \
-  --opponent-mode heuristic
-```
-This does not load a model or run GRPO. It verifies the rollout/prompt/reward path before spending GPU time. Smoke logs include timing fields (`t_elapsed`, `step_dt`, `since_prev`, `match_elapsed`, `avg_step_dt`) for latency analysis.
-## 13. What To Show In The Final Submission
-For the OpenEnv competition, the strongest evidence is:
-| Evidence | Why It Matters |
-|---|---|
-| Random baseline | Shows the environment is non-trivial |
-| Base LLM baseline | Shows general LLM behavior before training |
-| Trained model metrics | Shows improvement |
-| Reward curve | Shows learning progress |
-| Parse error curve | Shows tool-use improvement |
-| Before/after examples | Makes the story clear |
-| Eval against cached opponent | Shows fairness/reproducibility |
-Minimum final numbers to report:
-- total reward,
-- parse error rate,
-- coherence,
-- adaptation,
-- opponent awareness,
-- score/wickets,
-- chase or defense success rate.
-## 14. Latest Smoke-Test Evidence
-The current reproducible run artifacts live under `illustrations/`.
-```text
-Random captain + cached LLM opponent:
-  mean score: 13.5 across 2 episodes
-  mean reward: 0.984
-  mean coherence: 0.555
-  parse errors: 0.0%
-HF Gemma 4 captain + cached LLM opponent:
-  model: google/gemma-4-26B-A4B-it
-  trace: 40 OpenEnv turns with reset/step/action logs
-  score: 7/0 after 2.2 overs
-  reward sum: 0.168
-  coherence: 0.657
-  adaptation: 0.502
-  opponent awareness: 0.750
-  parse errors: 0.0%
-Latest captured training-side smoke, 1 match / 5 overs:
-  first innings: opponent 30/6, target 31
-  first-innings reward: +0.170 from par/run-rate/wicket context
-  chase: 26/1 in 5 overs
-  match result: loss
-  terminal reward: 0.634 (r_cric=0.759, r_dream11=1.317, r_strategy=0.536)
-  tactical events: deep-cover save, edge/catch chances, no-ball, misfield, caught-in-zone
 ```
-These are smoke checks, not final leaderboard numbers. They demonstrate that OpenEnv websocket interaction, HF router inference, tool-call parsing, cached opponent replay, observation updates, opponent plans, target/run-rate context, field-aware tactical outcomes, rule-gated bowler/batter changes, timing instrumentation, and reward metrics are all functioning.
-## 15. Immediate Next Engineering Improvement
-Next useful work:
-- Generate a fresh cached-opponent file using `google/gemma-4-26B-A4B-it` in `llm_live` mode.
-- Run a 5-over base-model comparison across random, Gemma 4, and one trained checkpoint.
-- Move the strongest setup to 20-over evaluation.

 # Experiment Workflow: Baselines, Opponents, Short Runs, and Training
+This document explains how to run CricketCaptain experiments in a practical order: smoke checks, 5-over baselines, training, then longer evaluation.
+## 1. Why Start With 5-Over Smoke + Baselines?
+A full T20 innings is 20 overs. That is useful for final evaluation but slow for debugging.
+For early experiments, 5-over runs are better because they quickly answer:
 - Is the OpenEnv server working?
 - Is the client connecting correctly?
 The workflow should be:
 ```text
+5-over smoke → 5-over untrained baseline → short training → 5-over trained eval → 20-over final eval
 ```
+Do not start with full 20-over training unless the 5-over smoke loop is stable.
+## 2. Opponent Modes
+Four modes in `server/opponent_policy.py`. Controlled via `--opponent-mode`, `CRICKET_OPPONENT_MODE`, or `configs/default.yaml`.
+**Default is `llm_live`** in `configs/default.yaml` so training can face a real LLM opponent when credentials are present. For cheap/local checks, explicitly pass `--opponent-mode heuristic`.
+### 2.1 Heuristic Opponent
 ```bash
+--opponent-mode heuristic
 ```
+Format-aware local policy. Uses T5/T20/ODI rules from `server/format_mapper.py`:
+- Phase-weighted shot distributions (powerplay/middle/death per format)
+- Wicket-pressure shift (heavier weight toward defensive shots when 7+ down)
+- Batter/bowler roles selected from `data/format_rules.json`
+Use for: fast local tests, cheap training rollouts, deterministic-ish baselines.
+**No API key needed.**
+### 2.2 Cricsheet Opponent
 ```bash
+--opponent-mode cricsheet
 ```
+Samples real Cricsheet ball-by-ball deliveries, indexed by `(phase, wickets_band, innings_type)`.
+Auto-selects data by format:
+- `max_overs ≤ 25` → `ball_outcomes_t20_v1.pkl` (1.17M T20 deliveries)
+- `max_overs > 25` → `ball_outcomes_odi_v1.pkl` (1.65M ODI deliveries)
+Progressive fallback: drop innings_type → drop wickets_band → any phase record.
+**No API key needed.** Data files must be present under `data/processed/`.
+### 2.3 Live LLM Opponent
 ```bash
+export HF_TOKEN=hf_...
+--opponent-mode llm_live
 ```
+Calls `google/gemma-4-26B-A4B-it` via HF Router (or any OpenAI-compatible endpoint). Set `HF_TOKEN` or pass `--opponent-api-key`; otherwise use `--opponent-mode heuristic` for local runs.
+Use for: demos, realistic opponent behavior, self-play experiments.
+### 2.4 Cached LLM Opponent
+```bash
+--opponent-mode llm_cached
+export CRICKET_OPPONENT_CACHE=data/opponent_cache/adaptive_t20_v1.jsonl
 ```
+Replays pre-recorded decisions. Does **not** call any live model. Use for official/reproducible eval — every compared captain faces identical opponent decisions.
+## 3. Starting The Server
 ```bash
+# Recommended (uvicorn auto-reload)
+cd cricket_captain
+python -m uvicorn server.app:app --port 8001
+# Or via app.py directly
+PYTHONPATH=. python server/app.py --port 8001
 ```
+Health check: `curl http://localhost:8001/health` → `{"status":"healthy"}`
+Set the URL for runners:
 ```bash
+export CRICKET_CAPTAIN_ENV_URL=http://localhost:8001
 ```
+On Lightning / remote runtimes, expose the port and pass the external URL:
 ```bash
+export CRICKET_CAPTAIN_ENV_URL=ws://<lightning-exposed-host>/ws
 ```
+## 4. Step 1: Random Baseline (all 3 local modes)
+No API key, no GPU needed. Verify the full loop works.
 ```bash
+# Run all 3 modes in parallel
+python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode heuristic  --env-url http://localhost:8001
+python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode cricsheet  --env-url http://localhost:8001
+python inference.py --model random --episodes 5 --max-overs 5 --opponent-mode llm_live   --env-url http://localhost:8001
 ```
+**Verified baselines (2026-04-25):**
+| Opponent | score | coherence | reward | parse_err |
+|----------|-------|-----------|--------|-----------|
+| heuristic | 20.8 | 0.556 | −0.826 | 0% |
+| cricsheet | 28.0 | 0.527 | −0.410 | 0% |
+| llm_live  | 27.4 | 0.537 | −0.723 | 0% |
+## 5. Step 2: Train-Smoke (verify reward signals, no GPU)
+`train.py train-smoke` runs direct `CricketEnvironment` rollouts — **no server needed**.
 ```bash
+python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode heuristic
+python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode cricsheet
+python train.py train-smoke --matches 3 --max-overs 5 --opponent-mode llm_live
 ```
+**Verified train-smoke baselines (2026-04-25):** r_validity=1.0 on all 9 matches (3 modes × 3 matches). All reward signals active: coherence, adaptation, opponent_awareness, plan_commitment, staleness, regret.
+Quick 2-over smoke for CI-style checks:
+```bash
+python train.py train-smoke --matches 1 --max-overs 2 --max-steps 240 --log-steps 90 --opponent-mode heuristic
 ```
+## 6. Step 3: Untrained LLM Baseline (requires HF token)
 ```bash
+export HF_TOKEN=hf_...
+python inference.py \
+  --model google/gemma-4-26B-A4B-it \
+  --api-base https://router.huggingface.co/v1 \
+  --api-key "$HF_TOKEN" \
+  --episodes 3 --max-overs 5 \
+  --opponent-mode llm_live \
+  --env-url http://localhost:8001
 ```
+**Verified LLM captain run (2026-04-25):**
+```
+model: google/gemma-4-26B-A4B-it via HF Router
+coherence: 0.657 | adaptation: 0.502 | opp_aware: 0.750
+parse errors: 0.0% | reward: +0.168
+```
+## 7. Step 4: SFT Tool Warmup
 ```bash
+python train.py sft-data --output data/training/tool_sft_examples.jsonl --examples 500
 ```
+Teaches tool-call shape, tool names, and argument schemas before RL. Not the main objective — just reduces wasted GRPO rollouts on syntax/tool-selection errors.
+## 8. Step 5: GRPO Warmup (5-over format)
+The two-stage curriculum was collapsed into a single-stage full-reward run because Qwen3.5-4B
+already supports tool calling natively (XML+JSON both accepted). What used to be "Stage 1" is
+now a fast 5-over warmup, controlled entirely from YAML:
 ```bash
+PYTORCH_ALLOC_CONF=expandable_segments:True \
+  python train.py train --config configs/cricket_train_warmup.yaml
 ```
+Config: `max_overs=5`, `steps=25`, `num_generations=8`, `batch_size=8`, `max_completion_length=3072`,
+`save_steps=5`, full composite reward from step 0. Approx 1.5–2 hrs on H200.
+Goal: bootstrap the LoRA adapter on cheap short matches before the longer 20-over run.
+Watch in WandB: `reward/composite_mean`, `tools/freq_*`, `rollout/match_completion_rate`,
+`completions/clipped_ratio`.
+## 9. Step 6: GRPO Main Run (20-over T20)
 ```bash
+PYTORCH_ALLOC_CONF=expandable_segments:True \
+  python train.py train --config configs/cricket_train.yaml
 ```
+Config: `max_overs=20`, `steps=100`, `num_generations=4`, `batch_size=8`, `max_completion_length=3072`,
+`save_steps=10`, `beta=0.0` (no reference model). Approx 15-18 min/step (~30-40 steps fit a 10-hr budget).
+To **resume from the warmup adapter**, uncomment in `configs/cricket_train.yaml`:
+```yaml
+resume_from: ./checkpoints/stage2_final
 ```
+or pass `--resume-from ./checkpoints/stage2_final` on the CLI. The base model still loads from
+`Qwen/Qwen3.5-4B`; only the LoRA weights resume.
+Goal: improve coherence, adaptation, opponent awareness, match outcomes on full T20s.
+Watch in WandB: `reward/r_result_mean` (sparse outcome), `reward/r_coherence_mean`,
+`reward/r_adaptation_mean`, `episode/tool_calls_mean` (should approach 720), `episode/agent_score_mean`.
+Switch the opponent in YAML (`opponent.mode: llm_live`) and set `HF_TOKEN` for adversarial
+training against live Gemma. Use `cricsheet` or `heuristic` for cheaper ablations.
+`train.py train` uses TRL `GRPOTrainer(environment_factory=CricketCaptainToolEnv)`, so the model interacts with a live environment over multiple tool-calling turns. This is not inference through the HF Router; the trained captain model is loaded locally by Transformers/TRL, with LoRA adapters when using quantized weights.
+## 10. Step 7: Evaluation
 ```bash
+python eval.py \
   --episodes 20 \
+  --env-url http://localhost:8001 \
+  --eval-pack-id adaptive_t20_v1 \
+  --opponent-mode llm_cached
 ```
+Compare: random baseline → untrained Qwen3.5-4B → trained Qwen3.5-4B (warmup + main adapter via `compare_eval.py`).
+## 11. Format Comparison
+| max_overs | Format selected | Data used by cricsheet | Typical target |
+|-----------|----------------|------------------------|---------------|
+| 5 | T5 | T20 pkl (closest) | ~47 runs |
+| 7 | T5 | T20 pkl | ~66 runs |
+| 20 | T20 | T20 pkl | ~160 runs |
+| 50 | ODI | ODI pkl | ~290 runs |
+All formats work with all opponent modes. Use `--max-overs N` with any runner.
+## 12. Tool budget and training
+Implemented in `server/cricket_environment.py` (`TOOL_BUDGET_PER_OVER=3`, `TOOL_FINE_PER_EXCESS=0.04`).
+**Overhead tools (only these increment the per-over counter):** `set_strategy`, `set_bowling_strategy`, `plan_delivery`, `reflect_after_ball`, `analyze_situation`.
+**Not overhead:** `plan_shot`, `set_match_plan`, `update_match_plan`, `select_batter`, `choose_bowler`, `set_field_setting`, `play_delivery`, `bowl_delivery`, `call_toss`, and other tools that advance or directly execute the ball.
+**Rule:** the first 3 overhead calls in each over are not fined; each further overhead call in that over adds **−0.04** to the step reward. The prompt includes `Tool budget: N/3 overhead calls used this over`.
+**Why this matters for GRPO:** training uses the same environment as inference. Fines are part of the reward the trainer optimizes, so the policy learns to use reflection and `analyze_situation` when they matter, and to lean on `plan_shot` plus match-level planning (`set_match_plan` / `update_match_plan`) for routine structure without spending the 3 free overhead calls every ball.
+## 13. Configs
+```bash
+# Start server with default config (llm_live opponent, 5-over default)
+PYTHONPATH=. python server/app.py --port 8001 --config configs/default.yaml
+# Start with reproducible cached eval
+PYTHONPATH=. python server/app.py --port 8001 --config configs/cached_eval.yaml
+```
+Config controls: `env.agent_team`, `env.max_overs`, `env.eval_pack_id`, `train.model`, `train.max_completion_length`, `train.max_tool_calling_iterations`, `opponent.mode`, `opponent.model`, `opponent.api_base`, `captain.model`, and `captain.api_base`.
+## 14. Latest Verified Run Results (2026-04-25)
+All runs in `illustrations/`. Zero parse errors across all 14 inference runs. r_validity=1.0 across all train-smoke matches.
+```
+Random agent — 5-over, heuristic:   score=20.8  coherence=0.556  reward=−0.826
+Random agent — 5-over, cricsheet:   score=28.0  coherence=0.527  reward=−0.410
+Random agent — 5-over, llm_live:    score=27.4  coherence=0.537  reward=−0.723
+Random agent — 20-over, cricsheet:  score=63.6  coherence=0.568  reward=−5.632
+Random agent — 20-over, heuristic:  score=82.4  coherence=0.545  reward=−8.174
+Train-smoke — 5-over, heuristic:    r_validity=1.0  coherence=0.596  3W/0L
+Train-smoke — 5-over, cricsheet:    r_validity=1.0  coherence=0.620  2W/1L
+Train-smoke — 5-over, llm_live:     r_validity=1.0  coherence=0.552  2W/1L
+LLM captain (gemma-4-26B) — 3-over, llm_live:
+  coherence=0.657  adaptation=0.502  opp_aware=0.750  parse_err=0%
 ```
+## 15. Immediate Next Steps
+1. **Run GRPO training** via `bash scripts/run_warmup_then_main.sh` (warmup curriculum + main run + auto-resume) to produce reward curves.
+2. **Deploy HF Space** for live Gradio demo — Dockerfile present, just needs HF push.
+3. **Generate opponent cache** using `llm_live` for reproducible official eval.
+4. **Produce training plots** — coherence heatmap, reward curve, tool-usage timeline.

docs/slides.html CHANGED Viewed

@@ -350,35 +350,36 @@
     <tr>
       <th>Rubric</th><th>Weight</th><th>Frequency</th><th>Measures</th><th>Key Sub-signals</th>
     </tr>
-    <tr>
-      <td><code>r_result</code></td>
-      <td><strong>55%</strong></td>
-      <td>Episode end</td>
-      <td>Win/loss vs DLS par, target margin</td>
-      <td>score/par, wickets_remaining, lead/deficit</td>
-    </tr>
     <tr>
       <td><code>r_cricket</code></td>
-      <td><strong>25%</strong></td>
-      <td>Innings end</td>
       <td>Dream11 proxy: runs, wickets, milestones</td>
-      <td>dot%, boundary%, 50s/100s, maiden overs</td>
     </tr>
     <tr>
       <td><code>r_behavior</code></td>
-      <td><strong>15%</strong></td>
-      <td>Every delivery</td>
       <td>Declaration-execution alignment</td>
       <td>coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%)</td>
     </tr>
     <tr>
       <td><code>r_validity</code></td>
-      <td><strong>5%</strong></td>
       <td>Every turn</td>
-      <td>Parseable JSON tool call</td>
       <td>Format gate; 0 = parse fail, 1 = valid</td>
     </tr>
   </table>
   <div class="two-col" style="margin-top:18px;">
     <div>
       <h3>Coherence Score Formula (per delivery)</h3>
@@ -389,12 +390,12 @@
 )</pre>
     </div>
     <div>
-      <h3>Two-Stage Curriculum (ToolRL)</h3>
       <ul>
-        <li><strong>Stage 1:</strong> <code>r_validity</code> only — teaches JSON format fast</li>
-        <li><strong>Stage 2:</strong> all 4 rubrics — teaches strategy and coherence</li>
-        <li>Non-zero floor (0.05–0.15) for valid structural calls — prevents dead gradient</li>
-        <li>GRPO group size = 8; per-turn advantage estimation (MT-GRPO)</li>
       </ul>
     </div>
   </div>
@@ -445,15 +446,19 @@ python inference.py \
   --config configs/default.yaml \
   --max-overs 3 --opponent-mode llm_live
-<span class="dim"># 4. Stage 1 — format mastery</span>
-python train.py train \
-  --config configs/default.yaml \
-  --stage 1 --steps 200
-<span class="dim"># 5. Stage 2 — strategic coherence</span>
-python train.py train \
-  --config configs/default.yaml \
-  --stage 2 --steps 600</pre>
       <div class="wn" style="font-size:0.84rem;">
         All model / API / env settings live in <code>configs/default.yaml</code>. Zero hardcoding.
       </div>
@@ -489,8 +494,8 @@ python train.py train \
     <div>
       <h3>What training should produce (target)</h3>
       <ul>
-        <li>r_validity: 0.70 → 0.98+ after Stage 1 (50 steps)</li>
-        <li>Coherence: ~0.52 (random) → 0.75+ after Stage 2</li>
         <li>analyze_situation calls cluster at over 6, 16, 36 transitions</li>
         <li>Strategy declarations become more specific (&gt;15 word rationales)</li>
         <li>Shot choices match declared aggression level &gt;80% of deliveries</li>
@@ -570,7 +575,7 @@ python train.py train \
     <div>
       <h3>🔴 Critical Path (on-site, Day 1–2)</h3>
       <ul>
-        <li>Run Colab notebook on HF compute credits → Stage 1 then Stage 2 training</li>
         <li>Export: reward_curves.png, coherence_heatmap.png, tool_timeline.png</li>
         <li>Deploy to HuggingFace Spaces → live interactive Gradio demo URL</li>
         <li>Add HF Space URL + plot images to README</li>

     <tr>
       <th>Rubric</th><th>Weight</th><th>Frequency</th><th>Measures</th><th>Key Sub-signals</th>
     </tr>
     <tr>
       <td><code>r_cricket</code></td>
+      <td><strong>45%</strong></td>
+      <td>Per ball</td>
       <td>Dream11 proxy: runs, wickets, milestones</td>
+      <td>dot%, boundary%, 50s/100s, maiden overs, economy</td>
     </tr>
     <tr>
       <td><code>r_behavior</code></td>
+      <td><strong>25%</strong></td>
+      <td>Every turn</td>
       <td>Declaration-execution alignment</td>
       <td>coherence (50%) + adaptation (20%) + opponent_awareness (20%) + regret (10%)</td>
     </tr>
+    <tr>
+      <td><code>r_result</code></td>
+      <td><strong>20%</strong></td>
+      <td>Innings/episode end</td>
+      <td>Win/loss vs DLS par, target margin</td>
+      <td>score/par, wickets_remaining, lead/deficit, +0.25 progress bonus</td>
+    </tr>
     <tr>
       <td><code>r_validity</code></td>
+      <td><strong>10%</strong></td>
       <td>Every turn</td>
+      <td>Parseable XML/JSON tool call</td>
       <td>Format gate; 0 = parse fail, 1 = valid</td>
     </tr>
   </table>
+  <p style="margin-top:8px;font-size:0.9em;color:#888">Rebalanced from 55/25/15/5 → 45/25/20/10 to match the SWE-RL recipe (60% intermediate / 40% terminal). Reasoning: partial-trajectory training rarely fires <code>r_result</code>; weighting it 55% wastes gradient on a near-zero signal.</p>
   <div class="two-col" style="margin-top:18px;">
     <div>
       <h3>Coherence Score Formula (per delivery)</h3>
 )</pre>
     </div>
     <div>
+      <h3>Single-Stage Training + Format Curriculum</h3>
       <ul>
+        <li><strong>Warmup (5-over curriculum):</strong> per-scenario <code>max_overs</code> sampled from [2,2,2,2,2,3,3,3,4,4,5] so episodes complete in budget and <code>r_result</code> can fire</li>
+        <li><strong>Main run (full T20):</strong> resumes warmup adapter, trains on target eval distribution</li>
+        <li>Qwen3.5-4B already does tool calling natively (XML+JSON) — no Stage 1 SFT needed</li>
+        <li>GRPO group size = 4; full episode advantages (TRL <code>environment_factory</code>)</li>
       </ul>
     </div>
   </div>
   --config configs/default.yaml \
   --max-overs 3 --opponent-mode llm_live
+<span class="dim"># 4. Warmup → Main chained run (auto-resumes adapter)</span>
+bash scripts/run_warmup_then_main.sh
+<span class="dim"># 5. Eval: untrained vs trained head-to-head</span>
+python compare_eval.py --model Qwen/Qwen3.5-4B \
+  --label baseline --episodes 20 --max-overs 5 \
+  --output eval_results/baseline.json
+python compare_eval.py --model Qwen/Qwen3.5-4B \
+  --adapter ./checkpoints/stage2_final \
+  --label trained --episodes 20 --max-overs 5 \
+  --output eval_results/trained.json
+python compare_eval.py --compare \
+  eval_results/baseline.json eval_results/trained.json</pre>
       <div class="wn" style="font-size:0.84rem;">
         All model / API / env settings live in <code>configs/default.yaml</code>. Zero hardcoding.
       </div>
     <div>
       <h3>What training should produce (target)</h3>
       <ul>
+        <li>r_validity: 0.70 → 0.98+ after warmup (25 steps)</li>
+        <li>Coherence: ~0.52 (random) → 0.75+ after main run</li>
         <li>analyze_situation calls cluster at over 6, 16, 36 transitions</li>
         <li>Strategy declarations become more specific (&gt;15 word rationales)</li>
         <li>Shot choices match declared aggression level &gt;80% of deliveries</li>
     <div>
       <h3>🔴 Critical Path (on-site, Day 1–2)</h3>
       <ul>
+        <li>Run Colab notebook (notebooks/colab_train_minimal.ipynb) → warmup → main chained training</li>
         <li>Export: reward_curves.png, coherence_heatmap.png, tool_timeline.png</li>
         <li>Deploy to HuggingFace Spaces → live interactive Gradio demo URL</li>
         <li>Add HF Space URL + plot images to README</li>

openenv.yaml CHANGED Viewed

@@ -6,23 +6,24 @@ app: server.app:app
 port: 8000
 description: >
-  CricketCaptain-LLM trains LLMs to exhibit strategic coherence — aligning
-  declared intentions with executed actions across 300 sequential decisions.
-  The agent uses 14 tools (toss, match-plan, batting, bowling, reflection)
-  and is scored on four rubrics: match outcome (55%), cricket contribution (25%),
-  behavioral coherence (15%), and tool-call validity (5%).
   Two-sided: a live or heuristic LLM opponent plays the opposing team.
-  Two-stage ToolRL curriculum (format mastery → full strategic reward).
 tasks:
-  - name: stage1_format
-    description: "5-over mini-match. r_validity only — teaches valid JSON tool-call structure (ToolRL Stage 1)."
-    difficulty: easy
   - name: stage2_full
-    description: "20-over match. Full reward: r_result (55%) + r_cricket (25%) + r_behavior (15%) + r_validity (5%)."
     difficulty: medium
   - name: eval_50over
-    description: "Full 50-over ODI. Evaluation benchmark — measures trained coherence vs DLS par."
     difficulty: hard

 port: 8000
 description: >
+  CricketCaptain-LLM is a multi-turn agentic RL benchmark: train an LLM to captain
+  a cricket match end-to-end, alternating between batting and bowling phases across
+  ~180 sequential tool calls. The agent uses 14 phase-gated tools (toss, match plan,
+  batting, bowling, fielding, reflection, analysis) and is scored by a composite
+  4-rubric reward — Dream11-style per-ball cricket-position signal (45%), per-turn
+  behavioral coherence/adaptation/opponent-awareness/regret (25%), match outcome
+  with DLS par + win bonus + progress bonus (20%), and tool-call validity (10%).
   Two-sided: a live or heuristic LLM opponent plays the opposing team.
+  Real Markov outcome engine trained on 1.65M cricsheet deliveries.
+  Single-stage GRPO training with format-length curriculum (T2 → T5 in warmup,
+  full T20 in main run). Partial-trajectory training generalizes to full match
+  completion at eval time (same recipe as SWE-RL coding agents).
 tasks:
   - name: stage2_full
+    description: "Full T20 match (configurable max_overs 2-50). All 4 rubrics active. Composite reward: r_cricket (45%) + r_behavior (25%) + r_result (20%) + r_validity (10%)."
     difficulty: medium
   - name: eval_50over
+    description: "Full 50-over ODI. Evaluation benchmark — measures trained captaincy across the longest format (DLS par + chase pressure)."
     difficulty: hard

pyproject.toml CHANGED Viewed

@@ -15,15 +15,26 @@ dependencies = [
     "openai>=1.0.0",
 ]
 [project.optional-dependencies]
 train = [
-    "trl>=0.24.0",
-    "transformers>=4.50.0",
     "accelerate>=1.0.0",
     "datasets>=4.0.0",
     "bitsandbytes>=0.43.0",
-    "mergekit>=0.1.0",
-    "peft>=0.13.0",
 ]
 eval = [
     "matplotlib>=3.8.0",

     "openai>=1.0.0",
 ]
+# Training extras — these are the versions that actually reconcile in 2026:
+#   transformers 5.6.2 ─┐
+#   trl 1.2.0           ├─ TRL multi-turn environment_factory needs transformers >=5.2,
+#   vllm 0.19.1         ┘  vLLM 0.19+ supports transformers 5, vLLM 0.18 does not.
+# Earlier we tried vllm 0.11.x — it pinned transformers <5 and broke environment_factory.
+# mergekit removed: pinned pydantic <2.11 which conflicts with openenv-core 0.2.3 (>=2.11.7).
+# Not used by training anyway.
 [project.optional-dependencies]
 train = [
+    "torch==2.10.0",
+    "transformers==5.6.2",
+    "trl==1.2.0",
+    "vllm==0.19.1",
+    "peft>=0.13.0,<0.20.0",
     "accelerate>=1.0.0",
     "datasets>=4.0.0",
     "bitsandbytes>=0.43.0",
+    "wandb>=0.16",
+    # flash-attn is optional — vLLM has its own attention backends; uncomment if you want it:
+    # "flash-attn>=2.5.0",
 ]
 eval = [
     "matplotlib>=3.8.0",

scripts/eval_all_checkpoints.sh ADDED Viewed

	@@ -0,0 +1,95 @@

+#!/usr/bin/env bash
+# Evaluate every saved checkpoint and find the best one.
+#
+# RL training is bumpy — the FINAL checkpoint isn't always the best policy.
+# This script runs compare_eval.py against the baseline AND each checkpoint
+# in ./checkpoints/, then picks whichever has the highest mean composite
+# reward to be the "best trained" submission.
+#
+# Usage:
+#   bash scripts/eval_all_checkpoints.sh                      # 10 episodes each
+#   EVAL_EPISODES=20 bash scripts/eval_all_checkpoints.sh     # more confidence
+cd "$(dirname "$0")/.."
+mkdir -p eval_results
+EVAL_EPISODES="${EVAL_EPISODES:-10}"
+EVAL_MAX_OVERS="${EVAL_MAX_OVERS:-5}"
+BASE_MODEL="${BASE_MODEL:-Qwen/Qwen3.5-4B}"
+# 1. Baseline (untrained)
+if [ ! -f eval_results/baseline.json ]; then
+    echo "=== [baseline] running ==="
+    python compare_eval.py \
+        --model "$BASE_MODEL" \
+        --label baseline \
+        --episodes "$EVAL_EPISODES" \
+        --max-overs "$EVAL_MAX_OVERS" \
+        --opponent-mode heuristic \
+        --output eval_results/baseline.json
+fi
+# 2. Each checkpoint
+for ckpt in ./checkpoints/stage*/checkpoint-* ./checkpoints/stage*_final; do
+    if [ -d "$ckpt" ]; then
+        # Verify it's a PEFT adapter dir (has adapter_config.json)
+        if [ ! -f "$ckpt/adapter_config.json" ]; then
+            echo "=== [$ckpt] no adapter_config.json — skip ==="
+            continue
+        fi
+        # Use a safe filename derived from path
+        label=$(echo "$ckpt" | sed 's|[/.]|_|g' | sed 's|^_||')
+        out="eval_results/${label}.json"
+        if [ -f "$out" ]; then
+            echo "=== [$ckpt] already evaluated, reading $out ==="
+            continue
+        fi
+        echo "=== [$ckpt] evaluating ==="
+        python compare_eval.py \
+            --model "$BASE_MODEL" \
+            --adapter "$ckpt" \
+            --label "$label" \
+            --episodes "$EVAL_EPISODES" \
+            --max-overs "$EVAL_MAX_OVERS" \
+            --opponent-mode heuristic \
+            --output "$out"
+    fi
+done
+# 3. Pick the best one and run a final compare against baseline
+echo ""
+echo "=== picking best checkpoint by mean composite reward ==="
+python - <<'PYEOF'
+import json, glob
+best = None
+for path in glob.glob("eval_results/*.json"):
+    if path.endswith("baseline.json"):
+        continue
+    try:
+        with open(path) as f:
+            data = json.load(f)
+        score = data["summary"].get("mean_composite_reward", 0.0) or 0.0
+        win_rate = data["summary"].get("win_rate_overall", 0.0) or 0.0
+        # Composite ranking: composite reward + 0.5 * win_rate
+        composite_score = score + 0.5 * win_rate
+        print(f"  {path:50s}  composite={score:.4f}  win_rate={win_rate:.4f}  ranking={composite_score:.4f}")
+        if best is None or composite_score > best[0]:
+            best = (composite_score, path, data)
+    except Exception as e:
+        print(f"  {path}: skip ({e})")
+if best:
+    print(f"\nBEST: {best[1]} (composite_score={best[0]:.4f}, label={best[2].get('label')})")
+    print(f"      adapter: {best[2].get('adapter')}")
+    # Save a symlink/copy as 'best.json' for easy reference
+    import shutil
+    shutil.copy(best[1], "eval_results/best_trained.json")
+    print(f"\nCopied to eval_results/best_trained.json")
+PYEOF
+echo ""
+echo "=== final comparison: baseline vs best trained ==="
+python compare_eval.py --compare eval_results/baseline.json eval_results/best_trained.json \
+    | tee eval_results/final_comparison.txt

scripts/generate_training_plots.py ADDED Viewed

	@@ -0,0 +1,248 @@

+"""
+Generate labeled PNG plots for the README from a WandB run OR from local
+episode_stats.jsonl files.
+Usage:
+    # From a WandB run id (preferred — uses the per-step rebalanced metrics)
+    python scripts/generate_training_plots.py \\
+        --wandb-run ptnv-s-research/huggingface/<RUN_ID> \\
+        --output-dir docs/plots/
+    # From local episode_stats.jsonl (faster, no API call)
+    python scripts/generate_training_plots.py \\
+        --jsonl logs/run_*/episode_stats.jsonl \\
+        --output-dir docs/plots/
+Generates (with axis labels + units):
+    docs/plots/training_reward_over_steps.png
+    docs/plots/per_rubric_breakdown.png
+    docs/plots/tool_call_frequency.png
+    docs/plots/match_completion_rate.png
+    docs/plots/before_after_comparison.png   (if --compare given)
+"""
+import argparse
+import glob
+import json
+import os
+from pathlib import Path
+from typing import Any
+import matplotlib
+matplotlib.use("Agg")  # headless
+import matplotlib.pyplot as plt
+def _load_jsonl(path: str) -> list[dict[str, Any]]:
+    rows = []
+    paths = glob.glob(path) if "*" in path else [path]
+    for p in paths:
+        with open(p) as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    try:
+                        rows.append(json.loads(line))
+                    except json.JSONDecodeError:
+                        continue
+    return rows
+def _load_wandb(run_path: str) -> tuple[list[dict[str, Any]], dict[str, Any]]:
+    """Returns (history, config). Requires `pip install wandb` and login."""
+    try:
+        import wandb
+    except ImportError:
+        raise RuntimeError("wandb not installed. pip install wandb")
+    api = wandb.Api()
+    run = api.run(run_path)
+    history = list(run.history(samples=10000))
+    return history, run.config
+def plot_training_reward(history, out_dir: Path, label: str):
+    steps, rewards = [], []
+    for row in history:
+        if "rewards/environment_reward/mean" in row and row["rewards/environment_reward/mean"] is not None:
+            steps.append(row.get("_step", row.get("step", len(steps))))
+            rewards.append(row["rewards/environment_reward/mean"])
+    if not rewards:
+        print("  no environment_reward/mean found, skipping")
+        return
+    fig, ax = plt.subplots(figsize=(8, 4.5))
+    ax.plot(steps, rewards, marker="o", linewidth=1.5, markersize=4, color="#0066cc")
+    ax.set_xlabel("Training step (gradient updates)")
+    ax.set_ylabel("Mean environment reward (composite)")
+    ax.set_title(f"GRPO training reward over time — {label}")
+    ax.grid(alpha=0.3)
+    fig.tight_layout()
+    out_path = out_dir / "training_reward_over_steps.png"
+    fig.savefig(out_path, dpi=130)
+    plt.close(fig)
+    print(f"  → {out_path}")
+def plot_per_rubric_breakdown(history, out_dir: Path, label: str):
+    """Plot the per-step means of all 4 rubrics on one axes."""
+    rubrics = ("reward/composite_mean", "reward/r_result_mean",
+               "reward/r_cricket_mean", "reward/r_behavior_mean",
+               "reward/r_validity_mean")
+    series = {r: [] for r in rubrics}
+    steps_per = {r: [] for r in rubrics}
+    for row in history:
+        for r in rubrics:
+            if r in row and row[r] is not None:
+                series[r].append(row[r])
+                steps_per[r].append(row.get("_step", row.get("step", len(series[r]))))
+    if not any(series.values()):
+        print("  no per-rubric metrics found, skipping")
+        return
+    fig, ax = plt.subplots(figsize=(9, 5))
+    colors = {"reward/composite_mean": "#000",
+              "reward/r_result_mean": "#cc0000",
+              "reward/r_cricket_mean": "#0066cc",
+              "reward/r_behavior_mean": "#009900",
+              "reward/r_validity_mean": "#9900cc"}
+    for r in rubrics:
+        if series[r]:
+            ax.plot(steps_per[r], series[r], marker="o", markersize=3, linewidth=1.3,
+                    label=r.replace("reward/", "").replace("_mean", ""),
+                    color=colors[r])
+    ax.set_xlabel("Training step (gradient updates)")
+    ax.set_ylabel("Mean reward")
+    ax.set_title(f"Per-rubric reward breakdown — {label}")
+    ax.legend(loc="best", fontsize=9)
+    ax.grid(alpha=0.3)
+    fig.tight_layout()
+    out_path = out_dir / "per_rubric_breakdown.png"
+    fig.savefig(out_path, dpi=130)
+    plt.close(fig)
+    print(f"  → {out_path}")
+def plot_tool_call_frequency(history, out_dir: Path, label: str):
+    steps, freq = [], []
+    for row in history:
+        if "tools/call_frequency" in row and row["tools/call_frequency"] is not None:
+            steps.append(row.get("_step", row.get("step", len(steps))))
+            freq.append(row["tools/call_frequency"])
+    if not freq:
+        print("  no tools/call_frequency found, skipping")
+        return
+    fig, ax = plt.subplots(figsize=(8, 4.5))
+    ax.plot(steps, freq, marker="o", linewidth=1.5, markersize=4, color="#cc6600")
+    ax.set_xlabel("Training step (gradient updates)")
+    ax.set_ylabel("Mean tool calls per rollout")
+    ax.set_title(f"Tool-call execution frequency (proxy for match progress) — {label}")
+    ax.grid(alpha=0.3)
+    fig.tight_layout()
+    out_path = out_dir / "tool_call_frequency.png"
+    fig.savefig(out_path, dpi=130)
+    plt.close(fig)
+    print(f"  → {out_path}")
+def plot_completion_rate(history, out_dir: Path, label: str):
+    steps, rate = [], []
+    for row in history:
+        if "rollout/match_completion_rate" in row and row["rollout/match_completion_rate"] is not None:
+            steps.append(row.get("_step", row.get("step", len(steps))))
+            rate.append(row["rollout/match_completion_rate"])
+    if not rate:
+        print("  no match_completion_rate found, skipping")
+        return
+    fig, ax = plt.subplots(figsize=(8, 4.5))
+    ax.plot(steps, rate, marker="o", linewidth=1.5, markersize=4, color="#009966")
+    ax.set_xlabel("Training step (gradient updates)")
+    ax.set_ylabel("Match completion rate")
+    ax.set_ylim(0, 1.05)
+    ax.set_title(f"Fraction of rollouts that completed the full match — {label}")
+    ax.grid(alpha=0.3)
+    fig.tight_layout()
+    out_path = out_dir / "match_completion_rate.png"
+    fig.savefig(out_path, dpi=130)
+    plt.close(fig)
+    print(f"  → {out_path}")
+def plot_before_after(baseline_json: str, trained_json: str, out_dir: Path):
+    """Bar chart comparing baseline vs trained on key eval metrics."""
+    with open(baseline_json) as f:
+        b = json.load(f)
+    with open(trained_json) as f:
+        t = json.load(f)
+    bs, ts = b["summary"], t["summary"]
+    metrics = [
+        ("match_completion_rate", "Match\ncompletion rate"),
+        ("win_rate_overall", "Overall\nwin rate"),
+        ("mean_validity_rate", "Mean\nvalidity rate"),
+        ("mean_composite_reward", "Mean composite\nreward (scaled)"),
+    ]
+    bvals = [bs.get(k, 0) or 0 for k, _ in metrics]
+    tvals = [ts.get(k, 0) or 0 for k, _ in metrics]
+    labels = [lbl for _, lbl in metrics]
+    x = range(len(metrics))
+    fig, ax = plt.subplots(figsize=(9, 5))
+    width = 0.35
+    bars_b = ax.bar([xi - width/2 for xi in x], bvals, width, label="baseline (untrained)", color="#999")
+    bars_t = ax.bar([xi + width/2 for xi in x], tvals, width, label="trained (LoRA r=64)", color="#0066cc")
+    for bars in (bars_b, bars_t):
+        for bar in bars:
+            h = bar.get_height()
+            ax.text(bar.get_x() + bar.get_width()/2, h + 0.01,
+                    f"{h:.2f}", ha="center", fontsize=8)
+    ax.set_xticks(list(x))
+    ax.set_xticklabels(labels)
+    ax.set_ylabel("Metric value")
+    ax.set_title(f"Before vs After training — {bs['n_episodes']} eval matches each")
+    ax.legend()
+    ax.grid(axis="y", alpha=0.3)
+    fig.tight_layout()
+    out_path = out_dir / "before_after_comparison.png"
+    fig.savefig(out_path, dpi=130)
+    plt.close(fig)
+    print(f"  → {out_path}")
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("--wandb-run", default=None,
+                   help="WandB run path: entity/project/run_id (e.g. ptnv-s-research/huggingface/abc123)")
+    p.add_argument("--jsonl", default=None,
+                   help="Local episode_stats.jsonl path (or glob)")
+    p.add_argument("--output-dir", default="docs/plots",
+                   help="Output directory for PNGs (default: docs/plots/)")
+    p.add_argument("--label", default="warmup", help="Label suffix for plot titles")
+    p.add_argument("--compare", nargs=2, metavar=("BASELINE_JSON", "TRAINED_JSON"),
+                   help="Also generate before/after bar chart from two compare_eval JSON files")
+    args = p.parse_args()
+    out_dir = Path(args.output_dir)
+    out_dir.mkdir(parents=True, exist_ok=True)
+    history = []
+    if args.wandb_run:
+        print(f"Loading WandB run: {args.wandb_run}")
+        history, _ = _load_wandb(args.wandb_run)
+        print(f"  {len(history)} history rows")
+    elif args.jsonl:
+        print(f"Loading local jsonl: {args.jsonl}")
+        history = _load_jsonl(args.jsonl)
+        print(f"  {len(history)} rows")
+    if history:
+        plot_training_reward(history, out_dir, args.label)
+        plot_per_rubric_breakdown(history, out_dir, args.label)
+        plot_tool_call_frequency(history, out_dir, args.label)
+        plot_completion_rate(history, out_dir, args.label)
+    if args.compare:
+        plot_before_after(args.compare[0], args.compare[1], out_dir)
+    print(f"\nDone — PNGs in {out_dir}/")
+if __name__ == "__main__":
+    main()

scripts/run_full_pipeline.sh ADDED Viewed

	@@ -0,0 +1,84 @@

+#!/usr/bin/env bash
+# Full pipeline: warmup → main → baseline eval → trained eval → comparison.
+#
+# This is the end-to-end deliverable:
+#   1. Warmup (5-over curriculum, 25 steps)        ~2 hr
+#   2. Main run (20-over T20, 100 steps)           ~6-8 hr
+#   3. Baseline eval (untrained Qwen3.5-4B)        ~30 min
+#   4. Trained eval  (warmup+main checkpoint)      ~30 min
+#   5. Print comparison table                      ~instant
+#
+# Outputs:
+#   - ./checkpoints/stage2_final/         trained LoRA adapter
+#   - eval_results/baseline.json          baseline match stats
+#   - eval_results/trained.json           trained match stats
+#   - /tmp/train_warmup.log               warmup training log
+#   - /tmp/train_main.log                 main run training log
+#   - /tmp/eval_baseline.log              baseline eval log
+#   - /tmp/eval_trained.log               trained eval log
+#
+# Usage:  bash scripts/run_full_pipeline.sh
+#         (or run individual stages as documented in the README)
+# NOTE: deliberately NOT using `set -e`. We want to inspect each stage's exit
+# code and decide whether to continue, not abort on first non-zero return.
+cd "$(dirname "$0")/.."
+export PYTORCH_ALLOC_CONF=expandable_segments:True
+EVAL_EPISODES="${EVAL_EPISODES:-15}"
+EVAL_MAX_OVERS="${EVAL_MAX_OVERS:-5}"
+# --------------------------------------------------------------------------
+# Stage 1+2: warmup → main (chained)
+# --------------------------------------------------------------------------
+echo "=== [$(date '+%H:%M:%S')] FULL PIPELINE: warmup → main → eval ==="
+bash scripts/run_warmup_then_main.sh
+PIPE_STATUS=$?
+if [ $PIPE_STATUS -ne 0 ]; then
+    echo "!!! Training pipeline failed (exit $PIPE_STATUS). Skipping eval."
+    exit $PIPE_STATUS
+fi
+# --------------------------------------------------------------------------
+# Stage 3: baseline eval (untrained Qwen3.5-4B)
+# --------------------------------------------------------------------------
+mkdir -p eval_results
+echo "=== [$(date '+%H:%M:%S')] EVAL: baseline (untrained Qwen3.5-4B) ==="
+python compare_eval.py \
+    --model Qwen/Qwen3.5-4B \
+    --label baseline \
+    --episodes "$EVAL_EPISODES" \
+    --max-overs "$EVAL_MAX_OVERS" \
+    --opponent-mode heuristic \
+    --output eval_results/baseline.json \
+    > /tmp/eval_baseline.log 2>&1
+# --------------------------------------------------------------------------
+# Stage 4: trained eval (warmup + main adapter)
+# --------------------------------------------------------------------------
+echo "=== [$(date '+%H:%M:%S')] EVAL: trained (LoRA from ./checkpoints/stage2_final) ==="
+python compare_eval.py \
+    --model Qwen/Qwen3.5-4B \
+    --adapter ./checkpoints/stage2_final \
+    --label trained \
+    --episodes "$EVAL_EPISODES" \
+    --max-overs "$EVAL_MAX_OVERS" \
+    --opponent-mode heuristic \
+    --output eval_results/trained.json \
+    > /tmp/eval_trained.log 2>&1
+# --------------------------------------------------------------------------
+# Stage 5: comparison
+# --------------------------------------------------------------------------
+echo "=== [$(date '+%H:%M:%S')] COMPARISON ==="
+python compare_eval.py --compare eval_results/baseline.json eval_results/trained.json \
+    | tee eval_results/comparison.txt
+echo ""
+echo "=== [$(date '+%H:%M:%S')] DONE ==="
+echo "Trained adapter:  ./checkpoints/stage2_final/"
+echo "Eval JSON:        eval_results/{baseline,trained}.json"
+echo "Comparison:       eval_results/comparison.txt"
+echo "Training logs:    /tmp/train_warmup.log, /tmp/train_main.log"
+echo "Eval logs:        /tmp/eval_baseline.log, /tmp/eval_trained.log"

scripts/run_warmup_then_main.sh ADDED Viewed

	@@ -0,0 +1,43 @@

+#!/usr/bin/env bash
+# Chained run: warmup (2-3 over curriculum) → main (5-over T20).
+# Main auto-resumes from warmup's LoRA adapter at ./checkpoints/stage2_final
+# (set via configs/cricket_train_qwen3.yaml resume_from).
+#
+# Active configs target Qwen3-4B-Instruct-2507 + vLLM colocate. The legacy
+# Qwen3.5-4B configs are archived in configs/extras/.
+#
+# Usage:  bash scripts/run_warmup_then_main.sh
+# Logs:   /tmp/train_warmup.log  then  /tmp/train_main.log
+# NOTE: deliberately NOT using `set -e`. We want to inspect the warmup exit
+# code and decide whether to continue, not abort on first non-zero return.
+cd "$(dirname "$0")/.."
+export PYTORCH_ALLOC_CONF=expandable_segments:True
+export TRL_EXPERIMENTAL_SILENCE=1
+echo "=== [$(date '+%H:%M:%S')] WARMUP starting (2-3 over curriculum, 30 steps) ==="
+python train.py train --config configs/cricket_train_qwen3_warmup.yaml \
+    > /tmp/train_warmup.log 2>&1
+WARMUP_EXIT=$?
+if [ $WARMUP_EXIT -ne 0 ]; then
+    echo "!!! WARMUP failed with exit $WARMUP_EXIT — see /tmp/train_warmup.log"
+    echo "    Skipping main run."
+    exit $WARMUP_EXIT
+fi
+# Sanity: confirm checkpoint exists before launching main.
+if [ ! -d ./checkpoints/stage2_final ]; then
+    echo "!!! Expected ./checkpoints/stage2_final not found after warmup."
+    echo "    Main run wants to resume from there — aborting."
+    exit 1
+fi
+echo "=== [$(date '+%H:%M:%S')] WARMUP done — adapter saved to ./checkpoints/stage2_final ==="
+echo "=== [$(date '+%H:%M:%S')] MAIN starting (5-over, 100 steps, resuming from warmup) ==="
+python train.py train --config configs/cricket_train_qwen3.yaml \
+    > /tmp/train_main.log 2>&1
+echo "=== [$(date '+%H:%M:%S')] MAIN done ==="

server/coherence_grader.py CHANGED Viewed

@@ -64,37 +64,44 @@ def bowling_coherence_score(
 ) -> float:
     """
     Grade bowling strategy coherence.
-    Weights: 40% rationale specificity + 40% line/length/field logic + 20% phase appropriateness.
     """
     if not bowling_strategy:
         return 0.0
     rationale = bowling_strategy.get("rationale", "")
     r_spec = rationale_specificity(rationale)
-    # Simple logic: Aggressive field with attacking line/length gets high score
-    line = bowling_strategy.get("line", "outside off")
-    length = bowling_strategy.get("length", "good length")
-    logic_score = 0.5
     if field_setting == "Aggressive":
-        if line in ["stumps", "on pads"] or length in ["bouncer", "short"]: # Attacking
-            logic_score = 1.0
     elif field_setting == "Defensive":
-        if line in ["outside off", "wide"] or length in ["yorker", "full"]: # Defensive
-            logic_score = 1.0
-    else:
-        logic_score = 0.8 # Balanced is generally safe
-    # Phase appropriate (e.g., spin in middle overs)
-    p_approp = 1.0
     bowler_type = bowling_strategy.get("bowler_type", "pace")
     if phase == "middle" and bowler_type == "spin":
         p_approp = 1.0
-    elif phase == "powerplay" and bowler_type == "pace":
         p_approp = 1.0
-    score = 0.40 * r_spec + 0.40 * logic_score + 0.20 * p_approp
     return round(score, 4)

 ) -> float:
     """
     Grade bowling strategy coherence.
+    Weights (from game_knowledge.yaml): 40% rationale + 30% field logic + 30% phase fit.
+    Line/length values must already be normalized (normalize_line / normalize_length
+    from field_model.py) — e.g. "pads" not "on pads", "outside_off" not "outside off".
     """
     if not bowling_strategy:
         return 0.0
     rationale = bowling_strategy.get("rationale", "")
     r_spec = rationale_specificity(rationale)
+    line   = bowling_strategy.get("line",   "outside_off")
+    length = bowling_strategy.get("length", "good")
+    # Attacking plan: attack the stumps/pads with short/full threatening lengths
+    _ATTACKING_LINES   = {"stumps", "pads"}
+    _ATTACKING_LENGTHS = {"bouncer", "short", "yorker"}
+    # Containing plan: bowl wide or full to restrict scoring
+    _DEFENSIVE_LINES   = {"outside_off", "wide"}
+    _DEFENSIVE_LENGTHS = {"yorker", "full"}
     if field_setting == "Aggressive":
+        logic_score = 1.0 if (line in _ATTACKING_LINES or length in _ATTACKING_LENGTHS) else 0.5
     elif field_setting == "Defensive":
+        logic_score = 1.0 if (line in _DEFENSIVE_LINES or length in _DEFENSIVE_LENGTHS) else 0.5
+    else:  # Balanced
+        logic_score = 0.8
+    # Phase appropriateness: spin in middle, pace in powerplay/death
     bowler_type = bowling_strategy.get("bowler_type", "pace")
     if phase == "middle" and bowler_type == "spin":
         p_approp = 1.0
+    elif phase in {"powerplay", "death"} and bowler_type == "pace":
         p_approp = 1.0
+    else:
+        p_approp = 0.6
+    score = 0.40 * r_spec + 0.30 * logic_score + 0.30 * p_approp
     return round(score, 4)

server/cricket_environment.py CHANGED Viewed

@@ -163,7 +163,7 @@ class CricketEnvironment(Environment):
             start_wickets = self._rng.randint(0, 9)
             start_score = int(start_over * self._rng.uniform(5.5, 8.5))
-        start_phase = over_to_phase(start_over)
         start_bowler = sample_bowler_type(start_phase, self._rng)
         self._state = CricketState(
@@ -206,6 +206,20 @@ class CricketEnvironment(Environment):
         # Load roster for the agent's team
         agent_team = options.get("agent_team", os.environ.get("CRICKET_AGENT_TEAM", "india"))
         self._agent_roster = load_team_roster(agent_team)
         # Reset tool budget
         self._overhead_calls_this_over = 0
         self._total_tool_fines = 0.0
@@ -427,6 +441,23 @@ class CricketEnvironment(Environment):
         if shot_intent not in VALID_SHOT_INTENTS:
             self._format_violations += 1
             shot_intent = "defensive"
         context = self._context_for_policy()
         self._opponent_plan = self._opponent.bowling_plan(context)
         self._state.opponent_plan = self._opponent_plan
@@ -559,6 +590,23 @@ class CricketEnvironment(Environment):
         return self._build_obs(last_ball=f"Field set to {setting}.")
     def _handle_bowl_delivery(self, args: dict) -> CricketObservation:
         context = self._context_for_policy()
         self._opponent_plan = self._opponent.batting_plan(context)
         self._state.opponent_plan = self._opponent_plan
@@ -599,6 +647,7 @@ class CricketEnvironment(Environment):
                 field_setting=self._field_setting,
                 wickets=self._state.wickets_lost,
                 score=self._state.total_score,
             )
             metadata = {"event_type": "base_outcome", "target_area": normalize_target_area("", shot_intent)}
         return self._process_delivery(runs, wicket, extra, shot_intent, dismissal_type, metadata)
@@ -843,7 +892,7 @@ class CricketEnvironment(Environment):
         return round(max(-0.15, min(0.15, reward)), 4)
     def _update_phase(self):
-        new_phase = over_to_phase(self._state.over)
         if new_phase != self._state.phase:
             self._state.phase = new_phase
             self._bowler_type = sample_bowler_type(new_phase, self._rng)
@@ -1083,8 +1132,8 @@ class CricketEnvironment(Environment):
         if wicket:
             return 0.0
         if self._state.game_state == GameState.BATTING:
-            expected = self._engine.expected_runs(self._state.over, shot_intent, self._bowler_type)
-            baseline = max(self._engine.expected_runs(self._state.over, shot, self._bowler_type) for shot in SHOT_INTENTS)
             return round(min(1.0, expected / max(baseline, 0.1)), 4)
         # For bowling, lower conceded runs are better.
         return round(max(0.0, 1.0 - (runs / 6.0)), 4)
@@ -1189,7 +1238,10 @@ def _render_prompt(ctx, batting_strat, bowling_strat, shot_plan, delivery_plan,
     budget = 3
     budget_str = f"Tool budget: {overhead_used}/{budget} overhead calls used this over"
     if overhead_used >= budget:
-        budget_str += " ⚠ BUDGET EXHAUSTED — further set_strategy/plan_shot/analyze/reflect calls will be FINED"
     lines.append(budget_str)
     # Opponent plan intentionally NOT shown — agent must infer via analyze_situation

             start_wickets = self._rng.randint(0, 9)
             start_score = int(start_over * self._rng.uniform(5.5, 8.5))
+        start_phase = over_to_phase(start_over, max_overs)
         start_bowler = sample_bowler_type(start_phase, self._rng)
         self._state = CricketState(
         # Load roster for the agent's team
         agent_team = options.get("agent_team", os.environ.get("CRICKET_AGENT_TEAM", "india"))
         self._agent_roster = load_team_roster(agent_team)
+        if self._agent_roster:
+            playing_xi = build_playing_xi(self._agent_roster)
+            batters = [p for p in playing_xi if p.get("role") != "bowler"]
+            bowlers = [p for p in playing_xi if p.get("bowler_type")]
+            if len(batters) >= 2:
+                self._current_batter = batter_profile_from_player(batters[0])
+                self._non_striker = batter_profile_from_player(batters[1])
+            elif len(playing_xi) >= 2:
+                self._current_batter = batter_profile_from_player(playing_xi[0])
+                self._non_striker = batter_profile_from_player(playing_xi[1])
+            matching_bowler = next((p for p in bowlers if p.get("bowler_type") == self._bowler_type), None)
+            if matching_bowler or bowlers:
+                self._current_bowler = bowler_profile_from_player(matching_bowler or bowlers[0])
+                self._bowler_type = self._current_bowler["type"]
         # Reset tool budget
         self._overhead_calls_this_over = 0
         self._total_tool_fines = 0.0
         if shot_intent not in VALID_SHOT_INTENTS:
             self._format_violations += 1
             shot_intent = "defensive"
+        # Inline shot-plan capture: if the model passes target_area/risk/trajectory/
+        # rationale directly to play_delivery (the new "execute-first" pattern that
+        # collapses plan_shot+play_delivery into a single turn), update the shot plan
+        # and score adaptation/opp_awareness here so we don't lose the reward signal.
+        has_plan_args = any(args.get(k) for k in ("target_area", "risk", "trajectory", "rationale", "explanation"))
+        if has_plan_args:
+            target_area = normalize_target_area(args.get("target_area", "gaps"), shot_intent)
+            risk = str(args.get("risk", "balanced")).lower()
+            self._shot_plan = {
+                "shot_intent": shot_intent,
+                "target_area": target_area,
+                "trajectory": infer_trajectory(shot_intent, risk, args.get("trajectory")),
+                "risk": risk,
+                "rationale": str(args.get("rationale", args.get("explanation", ""))),
+            }
+            self._score_adaptation(self._shot_plan)
+            self._score_opponent_awareness(self._shot_plan)
         context = self._context_for_policy()
         self._opponent_plan = self._opponent.bowling_plan(context)
         self._state.opponent_plan = self._opponent_plan
         return self._build_obs(last_ball=f"Field set to {setting}.")
     def _handle_bowl_delivery(self, args: dict) -> CricketObservation:
+        # Inline delivery-plan capture: if the model passes line/length/delivery_type
+        # directly to bowl_delivery (the new "execute-first" pattern that collapses
+        # plan_delivery+bowl_delivery into a single turn), update the delivery plan
+        # and score adaptation/opp_awareness here so we don't lose the reward signal.
+        has_plan_args = any(args.get(k) for k in ("line", "length", "delivery_type", "rationale"))
+        if has_plan_args:
+            current_type = str(self._current_bowler.get("type", self._bowler_type)).lower()
+            self._delivery_plan = {
+                "bowler_type": current_type,
+                "line": normalize_line(args.get("line", "outside off")),
+                "length": normalize_length(args.get("length", "good length")),
+                "delivery_type": normalize_variation(args.get("delivery_type", "stock"), current_type),
+                "rationale": str(args.get("rationale", "")),
+            }
+            self._bowling_strategy = dict(self._delivery_plan)
+            self._score_adaptation(self._delivery_plan)
+            self._score_opponent_awareness(self._delivery_plan)
         context = self._context_for_policy()
         self._opponent_plan = self._opponent.batting_plan(context)
         self._state.opponent_plan = self._opponent_plan
                 field_setting=self._field_setting,
                 wickets=self._state.wickets_lost,
                 score=self._state.total_score,
+                max_overs=self._state.max_overs,
             )
             metadata = {"event_type": "base_outcome", "target_area": normalize_target_area("", shot_intent)}
         return self._process_delivery(runs, wicket, extra, shot_intent, dismissal_type, metadata)
         return round(max(-0.15, min(0.15, reward)), 4)
     def _update_phase(self):
+        new_phase = over_to_phase(self._state.over, self._state.max_overs)
         if new_phase != self._state.phase:
             self._state.phase = new_phase
             self._bowler_type = sample_bowler_type(new_phase, self._rng)
         if wicket:
             return 0.0
         if self._state.game_state == GameState.BATTING:
+            expected = self._engine.expected_runs(self._state.over, shot_intent, self._bowler_type, self._state.max_overs)
+            baseline = max(self._engine.expected_runs(self._state.over, shot, self._bowler_type, self._state.max_overs) for shot in SHOT_INTENTS)
             return round(min(1.0, expected / max(baseline, 0.1)), 4)
         # For bowling, lower conceded runs are better.
         return round(max(0.0, 1.0 - (runs / 6.0)), 4)
     budget = 3
     budget_str = f"Tool budget: {overhead_used}/{budget} overhead calls used this over"
     if overhead_used >= budget:
+        budget_str += (
+            " ⚠ BUDGET EXHAUSTED — further set_strategy, set_bowling_strategy, plan_delivery, "
+            "analyze_situation, or reflect_after_ball calls will be FINED"
+        )
     lines.append(budget_str)
     # Opponent plan intentionally NOT shown — agent must infer via analyze_situation

server/markov_engine.py CHANGED Viewed

@@ -65,12 +65,19 @@ BOWLER_TYPES = ["pace", "spin"]
 EXTRAS_RATE = 0.05
-def over_to_phase(over: int) -> str:
-    if over <= 5:
-        return "powerplay"
-    if over <= 35:
-        return "middle"
-    return "death"
 def sample_bowler_type(phase: str, rng: random.Random) -> str:
@@ -118,6 +125,7 @@ class MarkovCricketEngine:
         score: int = 0,
         bowler_type: str = "pace",
         field_setting: str = "Balanced",
     ) -> tuple[int, bool, bool, str]:
         """Sample an outcome for one delivery.
@@ -131,7 +139,7 @@ class MarkovCricketEngine:
         if self._rng.random() < EXTRAS_RATE:
             return 1, False, True, ""
-        phase = over_to_phase(over)
         if self._cricsheet is not None:
             runs, wicket = self._cricsheet_step(over, wickets, score, phase, bowler_type, shot_intent)
@@ -214,6 +222,7 @@ class MarkovCricketEngine:
             score=score,
             bowler_type=bowler_type,
             field_setting="Balanced",
         )
         if extra:
             metadata.update({"event_type": "wide", "fielder_effect": "none", "base_runs": runs, "base_wicket": wicket})
@@ -274,12 +283,13 @@ class MarkovCricketEngine:
         field_setting: str = "Balanced",
         wickets: int = 0,
         score: int = 0,
     ) -> tuple[int, bool, bool, str]:
         """Simulate an AI batter faced with the agent's bowling/fielding.
         Returns (runs, wicket, extra, shot_intent).
         """
-        phase = over_to_phase(over)
         # Decide AI batter's shot intent based on state and phase
         # Aggression increases in death overs or with wickets in hand
@@ -311,15 +321,16 @@ class MarkovCricketEngine:
             wickets=wickets,
             score=score,
             bowler_type=bowler_type,
-            field_setting=field_setting
         )
         return runs, wicket, extra, shot_intent, dismissal_type
-    def expected_runs(self, over: int, shot_intent: str, bowler_type: str = "pace") -> float:
         if shot_intent not in SHOT_AGGRESSION:
             return 0.0
-        phase = over_to_phase(over)
         if self._cricsheet:
             dist = self._get_cricsheet_dist(over, 3, 15, phase, bowler_type, shot_intent)
             if dist:
@@ -327,10 +338,10 @@ class MarkovCricketEngine:
         dist = self._synthetic[shot_intent][phase]
         return sum(r * p for r, _, p in dist)
-    def wicket_probability(self, over: int, shot_intent: str, bowler_type: str = "pace") -> float:
         if shot_intent not in SHOT_AGGRESSION:
             return 0.0
-        phase = over_to_phase(over)
         if self._cricsheet:
             dist = self._get_cricsheet_dist(over, 3, 15, phase, bowler_type, shot_intent)
             if dist:

 EXTRAS_RATE = 0.05
+def over_to_phase(over: int, max_overs: int | None = None) -> str:
+    """Return the phase label for a given over, respecting the match format.
+    Without max_overs the old hardcoded thresholds (designed for ODI) would
+    leave T20 overs 16-19 classified as "middle" instead of "death".  We now
+    delegate to format_mapper.get_phase which reads the correct phase windows
+    from data/format_rules.json.
+    """
+    try:
+        from server.format_mapper import get_phase
+    except ImportError:
+        from .format_mapper import get_phase
+    return get_phase(over, max_overs)
 def sample_bowler_type(phase: str, rng: random.Random) -> str:
         score: int = 0,
         bowler_type: str = "pace",
         field_setting: str = "Balanced",
+        max_overs: int | None = None,
     ) -> tuple[int, bool, bool, str]:
         """Sample an outcome for one delivery.
         if self._rng.random() < EXTRAS_RATE:
             return 1, False, True, ""
+        phase = over_to_phase(over, max_overs)
         if self._cricsheet is not None:
             runs, wicket = self._cricsheet_step(over, wickets, score, phase, bowler_type, shot_intent)
             score=score,
             bowler_type=bowler_type,
             field_setting="Balanced",
+            max_overs=max_overs,
         )
         if extra:
             metadata.update({"event_type": "wide", "fielder_effect": "none", "base_runs": runs, "base_wicket": wicket})
         field_setting: str = "Balanced",
         wickets: int = 0,
         score: int = 0,
+        max_overs: int | None = None,
     ) -> tuple[int, bool, bool, str]:
         """Simulate an AI batter faced with the agent's bowling/fielding.
         Returns (runs, wicket, extra, shot_intent).
         """
+        phase = over_to_phase(over, max_overs)
         # Decide AI batter's shot intent based on state and phase
         # Aggression increases in death overs or with wickets in hand
             wickets=wickets,
             score=score,
             bowler_type=bowler_type,
+            field_setting=field_setting,
+            max_overs=max_overs,
         )
         return runs, wicket, extra, shot_intent, dismissal_type
+    def expected_runs(self, over: int, shot_intent: str, bowler_type: str = "pace", max_overs: int | None = None) -> float:
         if shot_intent not in SHOT_AGGRESSION:
             return 0.0
+        phase = over_to_phase(over, max_overs)
         if self._cricsheet:
             dist = self._get_cricsheet_dist(over, 3, 15, phase, bowler_type, shot_intent)
             if dist:
         dist = self._synthetic[shot_intent][phase]
         return sum(r * p for r, _, p in dist)
+    def wicket_probability(self, over: int, shot_intent: str, bowler_type: str = "pace", max_overs: int | None = None) -> float:
         if shot_intent not in SHOT_AGGRESSION:
             return 0.0
+        phase = over_to_phase(over, max_overs)
         if self._cricsheet:
             dist = self._get_cricsheet_dist(over, 3, 15, phase, bowler_type, shot_intent)
             if dist:

server/reward_calculator.py CHANGED Viewed

@@ -130,36 +130,47 @@ def compute_episode_reward(
             chase_progress = total_score / max(target, 1)
             wicket_penalty = wickets_lost * 0.08
             if total_score >= target:
-                outcome_bonus = 1.0
             elif total_score == target - 1:
-                outcome_bonus = 0.35
             else:
-                outcome_bonus = 0.0
             r_cric = chase_progress + outcome_bonus - wicket_penalty
         else:
             # Bowling to defend: reward keeping opponent below target.
             defense_margin = max(target - total_score, 0) / max(target, 1)
             wicket_pressure = wickets_lost * 0.08
             if total_score < target - 1:
-                outcome_bonus = 1.0
             elif total_score == target - 1:
-                outcome_bonus = 0.35
             else:
-                outcome_bonus = 0.0
             r_cric = defense_margin + wicket_pressure + outcome_bonus
     elif game_state == "batting":
         r_cric = (total_score / max(dls_par, 1.0)) - (wickets_lost * 0.08)
     else:
         # Bowling first innings: reward conceding fewer runs than DLS par.
-        r_cric = max(0.0, (dls_par - total_score) / max(dls_par, 1.0))
-    r_cric = max(-1.5, min(2.5, r_cric))
     # r_cricket: dense per-ball position signal via Dream11 proxy.
     # Normalised per innings then averaged so two-innings totals stay in [0, 1].
     if dream11_scores:
         r_dream11 = mean(normalize_dream11(s) for s in dream11_scores)
     else:
         r_dream11 = 0.0
     # Load weights from game_knowledge.yaml (cached after first load).
     w = get_reward_weights() if get_reward_weights is not None else None
@@ -179,12 +190,21 @@ def compute_episode_reward(
     r_tools = compute_tool_efficiency(tool_calls_made, analyze_calls, overs_played)
     eff_behavior_w = (w.r_behavior if w else 0.15) * coherence_weight_ramp
     composite = (
-        (w.r_result  if w else 0.55) * r_cric
-        + (w.r_cricket if w else 0.25) * r_dream11
         + eff_behavior_w * r_strategy
-        + (w.r_validity if w else 0.05) * r_format
     )
     return {

             chase_progress = total_score / max(target, 1)
             wicket_penalty = wickets_lost * 0.08
             if total_score >= target:
+                outcome_bonus = 1.0       # win
             elif total_score == target - 1:
+                outcome_bonus = 0.0       # tie — neutral, no consolation
             else:
+                outcome_bonus = -1.0      # loss — explicit negative signal
             r_cric = chase_progress + outcome_bonus - wicket_penalty
         else:
             # Bowling to defend: reward keeping opponent below target.
             defense_margin = max(target - total_score, 0) / max(target, 1)
             wicket_pressure = wickets_lost * 0.08
             if total_score < target - 1:
+                outcome_bonus = 1.0       # win (defended)
             elif total_score == target - 1:
+                outcome_bonus = 0.0       # tie
             else:
+                outcome_bonus = -1.0      # loss (target chased)
             r_cric = defense_margin + wicket_pressure + outcome_bonus
     elif game_state == "batting":
         r_cric = (total_score / max(dls_par, 1.0)) - (wickets_lost * 0.08)
     else:
         # Bowling first innings: reward conceding fewer runs than DLS par.
+        # Allow negative when conceding above par (was clamped to ≥0; now signed).
+        r_cric = (dls_par - total_score) / max(dls_par, 1.0)
+    # Progress bonus: small reward for actually executing balls instead of getting
+    # stuck in a planning loop. Reduced cap (0.25 → 0.10) so it doesn't drown out
+    # the loss penalty. Caps at +0.10 once the agent makes >=10 tool calls.
+    progress_bonus = min(0.10, tool_calls_made / 100.0)
+    r_cric = r_cric + progress_bonus
+    r_cric = max(-2.0, min(2.5, r_cric))
     # r_cricket: dense per-ball position signal via Dream11 proxy.
     # Normalised per innings then averaged so two-innings totals stay in [0, 1].
+    # When no innings has completed yet, dream11_scores is empty: redistribute
+    # its weight to r_result so the composite stays in the same [0, 1] range
+    # instead of silently capping at 0.75.
     if dream11_scores:
         r_dream11 = mean(normalize_dream11(s) for s in dream11_scores)
+        r_cricket_available = True
     else:
         r_dream11 = 0.0
+        r_cricket_available = False
     # Load weights from game_knowledge.yaml (cached after first load).
     w = get_reward_weights() if get_reward_weights is not None else None
     r_tools = compute_tool_efficiency(tool_calls_made, analyze_calls, overs_played)
     eff_behavior_w = (w.r_behavior if w else 0.15) * coherence_weight_ramp
+    r_result_w  = w.r_result  if w else 0.55
+    r_cricket_w = w.r_cricket if w else 0.25
+    r_validity_w = w.r_validity if w else 0.05
+    # If no innings has completed, fold r_cricket weight into r_result so the
+    # composite ceiling stays at 1.0 and gradients are not systematically suppressed.
+    if not r_cricket_available:
+        r_result_w  = r_result_w + r_cricket_w
+        r_cricket_w = 0.0
     composite = (
+        r_result_w  * r_cric
+        + r_cricket_w * r_dream11
         + eff_behavior_w * r_strategy
+        + r_validity_w * r_format
     )
     return {

train.py CHANGED Viewed

@@ -2,26 +2,32 @@
 MT-GRPO training script for CricketCaptain-LLM.
 Two-stage curriculum (ToolRL-style):
-  Stage 1 (steps 0–N):  format mastery — reward only valid JSON
-  Stage 2 (steps N–M):  full 4-rubric reward — coherence + cricket score + tools
 Design:
-  - Prompts are collected by running the CricketEnvironment directly (no server needed)
-  - GRPOTrainer generates its own completions and calls our stateless reward_fn
-  - reward_fn(prompts, completions, **kwargs) — no shared env state required
-  - Plain TRL + transformers + bitsandbytes (no Unsloth)
-Usage:
-  python train.py --stage 1 --steps 200 --model Qwen/Qwen2.5-7B-Instruct
-  python train.py --stage 2 --steps 600 --model ./checkpoints/stage1_final
 """
 import argparse
 import json
 import os
 import random
 import re
 import sys
 import time
 from pathlib import Path
 from typing import Any
@@ -32,12 +38,16 @@ from typing import Any
 try:
     import torch
     from datasets import Dataset
     from trl import GRPOConfig, GRPOTrainer
     from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
     _TRAIN_IMPORTS_AVAILABLE = True
 except ImportError:
     torch = None
     Dataset = None
     GRPOConfig = None
     GRPOTrainer = None
     AutoModelForCausalLM = None
@@ -49,12 +59,14 @@ try:
     from server.cricket_environment import CricketEnvironment
     from server.coherence_grader import aggression_match, phase_appropriate, rationale_specificity
     from server.markov_engine import SHOT_AGGRESSION
     from models import CricketAction
     from config_yaml import get_game_constants, get_reward_weights
 except ImportError:
     from cricket_captain.server.cricket_environment import CricketEnvironment
     from cricket_captain.server.coherence_grader import aggression_match, phase_appropriate, rationale_specificity
     from cricket_captain.server.markov_engine import SHOT_AGGRESSION
     from cricket_captain.models import CricketAction
     from cricket_captain.config_yaml import get_game_constants, get_reward_weights
@@ -106,37 +118,172 @@ def extract_phase_from_prompt(prompt: str) -> str:
 # Per-turn reward components (all stateless)                          #
 # ------------------------------------------------------------------ #
 def _parse_completion(raw: str) -> dict | None:
     raw = raw.strip()
     if raw.startswith("```"):
         lines = raw.split("\n")
         raw = "\n".join(lines[1:-1]) if len(lines) > 2 else raw
     try:
         return json.loads(raw)
     except (json.JSONDecodeError, ValueError):
         return None
 def r_validity(completion: str) -> float:
-    """r_validity: legal tool call with valid required fields. Returns 0 or 1."""
     data = _parse_completion(completion)
     if data is None:
         return 0.0
     tool = data.get("tool", "")
     args = data.get("arguments", {})
     if tool not in _VALID_TOOLS:
-        return 0.0
     if tool == "play_delivery" and args.get("shot_intent") not in SHOT_AGGRESSION:
-        return 0.0
     if tool == "set_strategy":
         agg = args.get("aggression")
         if not isinstance(agg, (int, float)):
-            return 0.0
     if tool == "plan_shot" and args.get("shot_intent") not in SHOT_AGGRESSION:
-        return 0.0
     if tool in {"choose_bowler", "set_bowling_strategy", "plan_delivery"}:
         if args.get("bowler_type") not in (None, "pace", "spin"):
-            return 0.0
     return 1.0
@@ -299,13 +446,9 @@ def make_reward_fn(curriculum_stage: int):
     Returns reward_fn(prompts, completions, **kwargs) → list[float].
     Weights align with compute_episode_reward in reward_calculator.py:
-      r_result   55%  — not computable stateless; omitted, remainder re-scaled
-      r_cricket  25%  — not computable stateless; omitted
-      r_behavior 15%  → 0.75 of stateless composite
-      r_validity  5%  → 0.25 of stateless composite
-    Scaling: behaviour=0.75, validity=0.25 preserves the relative 15:5 ratio
-    from the episode-level rubric while using only what's available per-turn.
     """
     # Minimum reward for any structurally valid completion — ensures GRPO has a
     # positive gradient to reinforce valid tool use even for unscored tool types.
@@ -315,8 +458,18 @@ def make_reward_fn(curriculum_stage: int):
         rewards = []
         for prompt, completion in zip(prompts, completions):
             fmt = r_validity(completion)
             if curriculum_stage == 1:
-                rewards.append(fmt)
                 continue
             behavior = r_behavior_stateless(prompt, completion)
@@ -327,9 +480,12 @@ def make_reward_fn(curriculum_stage: int):
                 + _RW.behavior_adaptation         * adapt
                 + _RW.behavior_opponent_awareness * aware
             )
-            reward = _RW.training_behavior * r_beh + _RW.training_validity * fmt
             # Floor: valid JSON should always beat invalid JSON (reward=0)
-            if fmt > 0.0:
                 reward = max(reward, _VALID_FLOOR)
             rewards.append(round(reward, 4))
         return rewards
@@ -344,66 +500,136 @@ def make_reward_fn(curriculum_stage: int):
 SYSTEM_PROMPT = (
     "You are an expert adaptive cricket captain. Each turn you receive a scorecard "
-    "and must respond with a SINGLE valid JSON tool call.\n\n"
     "Available tools:\n"
-    "  call_toss         — Call heads/tails and choose bat/bowl\n"
-    "  select_batter     — Choose batter profile for the match situation\n"
-    "  set_strategy      — Declare batting intent (aggression 0–1, rationale)\n"
-    "  plan_shot         — Pre-ball batting plan\n"
-    "  play_delivery     — Choose a shot and advance the game\n\n"
-    "  choose_bowler     — Choose bowler profile for the situation\n"
-    "  set_bowling_strategy — Declare bowling line/length/type/rationale\n"
-    "  plan_delivery     — Pre-ball bowling plan\n"
-    "  set_field_setting — Aggressive/Balanced/Defensive field\n"
-    "  bowl_delivery     — Execute the delivery\n"
-    "  reflect_after_ball — Adapt after the previous ball\n"
-    "  analyze_situation — Query pitch/bowler/field info\n\n"
     "Shot intents: leave | defensive | single | rotate | boundary | six\n\n"
-    "Be specific about phase, target pressure, opponent plan, field, batter, and bowler.\n\n"
-    "Respond with exactly one JSON object on a single line."
 )
 _RANDOM_SHOTS = list(SHOT_AGGRESSION.keys())
 _RANDOM_QUERIES = ["pitch_conditions", "bowler_info", "field_setting", "match_situation"]
 _RANDOM_ZONES = ["cover", "point", "straight", "midwicket", "square_leg", "fine_leg", "long_on", "long_off"]
 def _random_action(
     rng: random.Random,
     game_state: str = "batting",
     available_tools: list[str] | None = None,
     current_bowler_type: str | None = None,
 ) -> CricketAction:
-    def allowed(action: CricketAction) -> CricketAction:
-        if available_tools is None or action.tool in available_tools:
-            return action
-        if "bowl_delivery" in available_tools:
-            return CricketAction(tool="bowl_delivery", arguments={})
-        if "play_delivery" in available_tools:
-            return CricketAction(tool="play_delivery", arguments={"shot_intent": "defensive", "explanation": "fallback"})
-        if "call_toss" in available_tools:
-            return CricketAction(tool="call_toss", arguments={"call": "heads", "decision": "bat"})
-        return action
     if game_state == "toss":
-        return allowed(CricketAction(
             tool="call_toss",
             arguments={"call": rng.choice(["heads", "tails"]), "decision": rng.choice(["bat", "bowl"])},
-        ))
     if game_state == "bowling":
         choice = rng.random()
-        if choice < 0.15:
-            return allowed(CricketAction(
                 tool="choose_bowler",
                 arguments={
-                    "name": rng.choice(["Strike Pacer", "Control Spinner", "Death Specialist"]),
-                    "bowler_type": rng.choice(["pace", "spin"]),
-                    "style": rng.choice(["swing", "economy", "yorker"]),
-                    "rationale": "Match bowler to phase and batter matchup",
                 },
-            ))
-        if choice < 0.35:
-            return allowed(CricketAction(
                 tool="plan_delivery",
                 arguments={
                     "bowler_type": current_bowler_type or rng.choice(["pace", "spin"]),
@@ -412,74 +638,96 @@ def _random_action(
                     "delivery_type": rng.choice(["stock", "yorker", "bouncer", "slower ball"]),
                     "rationale": "Use field and batter style to control scoring zones",
                 },
-            ))
-        if choice < 0.5:
-            return allowed(CricketAction(tool="set_field_setting", arguments={"setting": rng.choice(["Aggressive", "Balanced", "Defensive"])}))
-        if choice < 0.6:
-            return allowed(CricketAction(tool="reflect_after_ball", arguments={"reflection": "Adjust line and field after the last ball"}))
-        return allowed(CricketAction(tool="bowl_delivery", arguments={}))
     choice = rng.random()
-    if choice < 0.15:
-        return allowed(CricketAction(
             tool="select_batter",
             arguments={
-                "name": rng.choice(["Opener", "Anchor", "Finisher"]),
-                "style": rng.choice(["balanced", "anchor", "hitter", "finisher"]),
-                "aggression": round(rng.uniform(0.2, 0.8), 2),
                 "rationale": "Select batter based on phase, wickets, and target pressure",
             },
-        ))
-    if choice < 0.3:
-        return allowed(CricketAction(
             tool="set_strategy",
             arguments={
                 "phase_intent": rng.choice(["attack", "consolidate", "rotate"]),
                 "aggression": round(rng.uniform(0.1, 0.9), 2),
-                "rationale": "random rollout",
             },
-        ))
-    if choice < 0.45:
-        return allowed(CricketAction(
             tool="plan_shot",
             arguments={
                 "shot_intent": rng.choice(_RANDOM_SHOTS),
-                    "target_area": rng.choice(_RANDOM_ZONES),
-                    "trajectory": rng.choice(["ground", "lofted", "aerial"]),
                 "risk": rng.choice(["low", "balanced", "high"]),
                 "rationale": "Plan shot against bowler, field, and required rate",
             },
-        ))
-    if choice < 0.55:
-        return allowed(CricketAction(
             tool="analyze_situation",
             arguments={"query_type": rng.choice(_RANDOM_QUERIES)},
-        ))
-    if choice < 0.65:
-        return allowed(CricketAction(tool="reflect_after_ball", arguments={"reflection": "Revise risk after previous ball"}))
-    return allowed(CricketAction(
-        tool="play_delivery",
-        arguments={"shot_intent": rng.choice(_RANDOM_SHOTS), "explanation": "random"},
-    ))
 def collect_prompts(
     n_prompts: int,
     task: str = "stage2_full",
     seed: int = 42,
 ) -> list[str]:
     """
     Collect game-state prompts by running episodes with random actions.
     Returns a list of prompt strings (one per game state observation).
     """
     rng = random.Random(seed)
     prompts: list[str] = []
     ep_count = 0
     while len(prompts) < n_prompts:
         env = CricketEnvironment()
-        obs = env.reset(seed=rng.randint(0, 99999), options={"task": task, "random_start": True})
-        prompts.append(_format_prompt(obs.prompt_text))
         steps = 0
         while not obs.done and steps < 80:
@@ -488,10 +736,11 @@ def collect_prompts(
                 obs.game_state,
                 obs.available_tools,
                 obs.current_bowler.get("type") if obs.current_bowler else None,
             )
             obs = env.step(action)
             if not obs.done:
-                prompts.append(_format_prompt(obs.prompt_text))
             steps += 1
         ep_count += 1
@@ -513,13 +762,548 @@ def build_dataset(prompts: list[str]) -> Dataset:
     return Dataset.from_dict({"prompt": prompts})
-def generate_sft_examples(out_path: str, n_examples: int = 240, seed: int = 42):
     """Stage 0 bootstrap data: valid tool JSON for every tool family."""
     rng = random.Random(seed)
     examples = []
     for _ in range(n_examples):
         game_state = rng.choice(["toss", "batting", "bowling"])
-        action = _random_action(rng, game_state)
         prompt = (
             f"{SYSTEM_PROMPT}\n\n"
             f"[CricketCaptain] {game_state.upper()} | Example adaptive scenario\n"
@@ -546,28 +1330,68 @@ def generate_sft_examples(out_path: str, n_examples: int = 240, seed: int = 42):
 # Model loading (plain transformers + bitsandbytes 4-bit)            #
 # ------------------------------------------------------------------ #
-def load_model(model_name: str):
     if not _TRAIN_IMPORTS_AVAILABLE:
         raise ImportError("Training dependencies are missing. Install with: pip install '.[train]'")
-    print(f"Loading {model_name} …")
-    bnb_cfg = BitsAndBytesConfig(
-        load_in_4bit=True,
-        bnb_4bit_compute_dtype=torch.bfloat16,
-        bnb_4bit_use_double_quant=True,
-        bnb_4bit_quant_type="nf4",
-    )
     tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token
-    model = AutoModelForCausalLM.from_pretrained(
-        model_name,
-        quantization_config=bnb_cfg,
         device_map="auto",
         trust_remote_code=True,
         torch_dtype=torch.bfloat16,
     )
     print(f"Loaded. Parameters: {model.num_parameters():,}")
     return model, tokenizer
@@ -578,52 +1402,173 @@ def load_model(model_name: str):
 def train(args):
     if not _TRAIN_IMPORTS_AVAILABLE:
         raise ImportError("Training dependencies are missing. Install with: pip install '.[train]'")
     task     = "stage1_format" if args.stage == 1 else "stage2_full"
-    out_dir  = f"./checkpoints/stage{args.stage}"
-    save_dir = f"./checkpoints/stage{args.stage}_final"
     print(f"\n=== Stage {args.stage} Training ===")
     print(f"Task: {task} | Prompts: {args.prompts} | Steps: {args.steps}")
-    # Collect prompts
-    print("\nCollecting game-state prompts …")
-    prompts  = collect_prompts(args.prompts, task=task, seed=args.seed)
-    dataset  = build_dataset(prompts)
-    # Load model
-    model, tokenizer = load_model(args.model)
     # GRPO config
     config = GRPOConfig(
         output_dir=out_dir,
         num_train_epochs=1,
         max_steps=args.steps,
         per_device_train_batch_size=args.batch_size,
         gradient_accumulation_steps=args.grad_accum,
-        learning_rate=2e-5 if args.stage == 1 else 1e-5,
         warmup_ratio=0.05,
         lr_scheduler_type="cosine",
-        logging_steps=10,
-        save_steps=max(50, args.steps // 5),
-        save_total_limit=2,
         bf16=True,
-        max_prompt_length=512,
-        max_completion_length=256,
         num_generations=args.num_generations,
-        temperature=0.8,
-        report_to="none",
         log_completions=True,
         seed=args.seed,
     )
-    reward_fn = make_reward_fn(args.stage)
     trainer = GRPOTrainer(
         model=model,
-        reward_funcs=reward_fn,
         args=config,
         train_dataset=dataset,
         processing_class=tokenizer,
     )
     print(f"\nStarting training ({args.steps} steps, {len(dataset)} prompts) …")
@@ -659,7 +1604,11 @@ def evaluate(args):
     for ep in range(args.eval_episodes):
         env = CricketEnvironment()
-        obs = env.reset(seed=rng.randint(0, 99999), options={"task": "stage2_full", "random_start": False})
         steps = 0
         while not obs.done and steps < 150:
@@ -670,12 +1619,7 @@ def evaluate(args):
             if data:
                 action = CricketAction(tool=data["tool"], arguments=data.get("arguments", {}))
             else:
-                if obs.game_state == "bowling":
-                    action = CricketAction(tool="bowl_delivery", arguments={})
-                elif obs.game_state == "toss":
-                    action = CricketAction(tool="call_toss", arguments={"call": "heads", "decision": "bat"})
-                else:
-                    action = CricketAction(tool="play_delivery", arguments={"shot_intent": "defensive", "explanation": "fallback"})
             obs   = env.step(action)
             steps += 1
@@ -706,6 +1650,7 @@ def _make_run_folder(prefix: str, model: str | None, opponent_mode: str | None,
 def train_smoke(args):
     """Run short direct-environment training rollouts without loading a model."""
     rng = random.Random(args.seed)
     # Auto-create run folder unless --output explicitly given
     if args.output:
@@ -746,6 +1691,7 @@ def train_smoke(args):
             "eval_pack_id": args.eval_pack_id,
             "opponent_mode": args.opponent_mode,
             "opponent_cache_path": args.opponent_cache_path,
         })
         prompts = [_format_prompt(obs.prompt_text)]
         total_reward = 0.0
@@ -763,6 +1709,7 @@ def train_smoke(args):
                 obs.game_state,
                 obs.available_tools,
                 obs.current_bowler.get("type") if obs.current_bowler else None,
             )
             obs = env.step(action)
             step_end = time.perf_counter()
@@ -855,6 +1802,7 @@ def train_smoke(args):
 def _apply_yaml_defaults(args, cfg: dict) -> None:
     """Merge YAML config values into args, CLI args take precedence."""
     captain = cfg.get("captain", {}) or {}
     env_cfg  = cfg.get("env", {}) or {}
     train_cfg = cfg.get("train", {}) or {}
@@ -862,17 +1810,48 @@ def _apply_yaml_defaults(args, cfg: dict) -> None:
         if val is not None and getattr(args, attr, None) is None:
             setattr(args, attr, val)
-    _set("model",            captain.get("model"))
     _set("api_base",         captain.get("api_base"))
     _set("api_key",          os.environ.get(captain.get("api_key_env", "")) or None)
     _set("eval_pack_id",     env_cfg.get("eval_pack_id"))
-    _set("opponent_mode",    cfg.get("opponent", {}).get("mode"))
-    _set("opponent_cache_path", cfg.get("opponent", {}).get("cache_path"))
     _set("max_overs",        env_cfg.get("max_overs"))
     _set("steps",            train_cfg.get("steps"))
     _set("prompts",          train_cfg.get("prompts"))
     _set("batch_size",       train_cfg.get("batch_size"))
     _set("stage",            train_cfg.get("stage"))
 def main():
@@ -888,15 +1867,55 @@ def main():
     t.add_argument("--prompts",         type=int, default=None, help="Game state prompts to collect")
     t.add_argument("--steps",           type=int, default=None, help="GRPOTrainer max_steps")
     t.add_argument("--batch-size",      type=int, default=None, dest="batch_size")
-    t.add_argument("--grad-accum",      type=int, default=4,    dest="grad_accum")
-    t.add_argument("--num-generations", type=int, default=4,    dest="num_generations")
     t.add_argument("--seed",            type=int, default=42)
     # eval
     e = sub.add_parser("eval", help="Evaluate a checkpoint")
     e.add_argument("--config", default=None)
     e.add_argument("--model",          default=None)
     e.add_argument("--eval-episodes",  type=int, default=10,   dest="eval_episodes")
     e.add_argument("--seed",           type=int, default=0)
     # quick test (no GPU needed)
@@ -911,12 +1930,14 @@ def main():
     smoke.add_argument("--eval-pack-id", default=None, dest="eval_pack_id")
     smoke.add_argument("--opponent-mode", default=None, choices=["heuristic", "llm_live", "llm_cached", "cricsheet"], dest="opponent_mode")
     smoke.add_argument("--opponent-cache-path", default=None, dest="opponent_cache_path")
     smoke.add_argument("--output", default=None)
     smoke.add_argument("--seed", type=int, default=42)
     sft = sub.add_parser("sft-data", help="Generate Stage 0 supervised tool-format examples")
     sft.add_argument("--output", default="./data/training/tool_sft_examples.jsonl")
     sft.add_argument("--examples", type=int, default=240)
     sft.add_argument("--seed", type=int, default=42)
     args = parser.parse_args()
@@ -934,35 +1955,45 @@ def main():
     if getattr(args, "stage", None) is None:
         args.stage = 1
     if getattr(args, "model", None) is None:
-        args.model = "Qwen/Qwen2.5-7B-Instruct"
     if getattr(args, "steps", None) is None:
         args.steps = 200
     if getattr(args, "prompts", None) is None:
         args.prompts = 500
     if getattr(args, "batch_size", None) is None:
         args.batch_size = 2
     if getattr(args, "eval_pack_id", None) is None:
         args.eval_pack_id = "adaptive_t20_v1"
     if getattr(args, "opponent_mode", None) is None:
         args.opponent_mode = "llm_live"
     if getattr(args, "max_overs", None) is None:
         args.max_overs = 5
     if args.cmd == "train":
         train(args)
     elif args.cmd == "eval":
         evaluate(args)
     elif args.cmd == "test":
-        _smoke_test()
     elif args.cmd == "train-smoke":
         train_smoke(args)
     elif args.cmd == "sft-data":
-        generate_sft_examples(args.output, args.examples, args.seed)
     else:
         parser.print_help()
-def _smoke_test():
     """Verify reward functions work correctly."""
     cases = [
         (
@@ -990,12 +2021,12 @@ def _smoke_test():
     ]
     print("Reward function smoke test:\n")
     for prompt, completion, expected in cases:
-        fmt = r_format(completion)
-        coh = r_coherence_stateless(prompt, completion)
         print(f"  expected={expected:4s} | fmt={fmt:.0f} | coh={coh:.3f} | {completion[:60]}")
     print("\nPrompt collection test (5 prompts):")
-    p = collect_prompts(5, task="stage1_format", seed=1)
     for i, pp in enumerate(p):
         print(f"  [{i}] {pp[:80].strip()} …")

 MT-GRPO training script for CricketCaptain-LLM.
 Two-stage curriculum (ToolRL-style):
+  Stage 1: tool-call mastery — emphasize valid, phase-legal tool usage
+  Stage 2: strategic behavior — full environment-backed reward (result + cricket + behavior + validity)
 Design:
+  - Training uses TRL GRPO with environment_factory=CricketCaptainToolEnv
+  - The model interacts with live CricketEnvironment instances over multi-turn tool calls
+  - Rewards are collected from the environment (environment_reward), not only from stateless prompt parsing
+  - The opponent policy is part of the environment: heuristic/cricsheet/llm_live/llm_cached
+  - Plain TRL + Transformers + bitsandbytes + PEFT (LoRA adapters for 4-bit models)
+Usage (canonical Qwen3 setup):
+  python train.py train --config configs/cricket_train_qwen3_warmup.yaml   # warmup
+  python train.py train --config configs/cricket_train_qwen3.yaml          # main 5-over
+Legacy Qwen3.5 configs live in configs/extras/.
 """
 import argparse
+import copy
+import datetime
 import json
 import os
 import random
 import re
 import sys
+import threading
 import time
 from pathlib import Path
 from typing import Any
 try:
     import torch
     from datasets import Dataset
+    from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
     from trl import GRPOConfig, GRPOTrainer
     from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
     _TRAIN_IMPORTS_AVAILABLE = True
 except ImportError:
     torch = None
     Dataset = None
+    LoraConfig = None
+    get_peft_model = None
+    prepare_model_for_kbit_training = None
     GRPOConfig = None
     GRPOTrainer = None
     AutoModelForCausalLM = None
     from server.cricket_environment import CricketEnvironment
     from server.coherence_grader import aggression_match, phase_appropriate, rationale_specificity
     from server.markov_engine import SHOT_AGGRESSION
+    from server.player_roster import build_playing_xi, load_team_roster
     from models import CricketAction
     from config_yaml import get_game_constants, get_reward_weights
 except ImportError:
     from cricket_captain.server.cricket_environment import CricketEnvironment
     from cricket_captain.server.coherence_grader import aggression_match, phase_appropriate, rationale_specificity
     from cricket_captain.server.markov_engine import SHOT_AGGRESSION
+    from cricket_captain.server.player_roster import build_playing_xi, load_team_roster
     from cricket_captain.models import CricketAction
     from cricket_captain.config_yaml import get_game_constants, get_reward_weights
 # Per-turn reward components (all stateless)                          #
 # ------------------------------------------------------------------ #
+_XML_FN_RE    = re.compile(r"<function\s*=?\s*([^>\s]+)\s*>", re.IGNORECASE)
+_XML_PARAM_RE = re.compile(r"<parameter\s*=\s*([^>\s]+)\s*>(.*?)</parameter>", re.IGNORECASE | re.DOTALL)
 def _parse_completion(raw: str) -> dict | None:
+    """Parse a tool-call from the raw completion into our canonical {tool, arguments} dict.
+    Handles four common model output patterns:
+      1. Plain JSON (ideal).
+      2. Markdown code block (```json ... ```).
+      3. Thinking-model preamble: <think>...</think> followed by JSON.
+         Qwen3/Qwen3.5 in default mode emits reasoning inside <think> tags;
+         we strip everything up to and including the closing </think> tag.
+      4. XML function-call format that Qwen3.5 was trained on:
+            <function=tool_name><parameter=foo>bar</parameter>...</function>
+         Empirically (see logs/run_2026-04-25_21-08-45) every Stage-1 completion
+         emitted this XML form instead of JSON — so we extract it as a fallback
+         to give GRPO a non-zero gradient before the model has been trained
+         onto the JSON contract.
+    """
     raw = raw.strip()
+    # Strip <think>...</think> preamble emitted by thinking-mode models.
+    if "<think>" in raw:
+        think_end = raw.rfind("</think>")
+        if think_end != -1:
+            raw = raw[think_end + len("</think>"):].strip()
     if raw.startswith("```"):
         lines = raw.split("\n")
         raw = "\n".join(lines[1:-1]) if len(lines) > 2 else raw
+    # Try parsing the whole string, then fall back to the first {...} block.
     try:
         return json.loads(raw)
     except (json.JSONDecodeError, ValueError):
+        pass
+    start = raw.find("{")
+    end   = raw.rfind("}")
+    if start != -1 and end > start:
+        try:
+            return json.loads(raw[start : end + 1])
+        except (json.JSONDecodeError, ValueError):
+            pass
+    # XML function-call fallback (Qwen3.5 default tool-call emission style).
+    fn_match = _XML_FN_RE.search(raw)
+    if fn_match:
+        tool = fn_match.group(1).strip().strip("\"'")
+        arguments: dict[str, Any] = {}
+        for pname, pval in _XML_PARAM_RE.findall(raw):
+            v = pval.strip()
+            # Coerce numeric/bool literals so downstream validators accept them.
+            try:
+                arguments[pname] = json.loads(v)
+            except (json.JSONDecodeError, ValueError):
+                arguments[pname] = v
+        return {"tool": tool, "arguments": arguments}
+    return None
+# Bounded LRU-ish cache. Each snapshot is a deepcopy of CricketEnvironment
+# (~1 MB) and only used by the LEGACY single-turn r_environment_rollout path,
+# not by the multi-turn environment_factory training path. Cap at 4096 entries
+# (~4 GB worst case) so a long collect_prompts call can't blow up host RAM.
+_PROMPT_ENV_SNAPSHOTS: dict[str, CricketEnvironment] = {}
+_PROMPT_SNAPSHOT_CAP = 4096
+_ENV_REWARD_ROLLOUT_STEPS = 12
+def _remember_prompt(obs_text: str, env: CricketEnvironment) -> str:
+    """Format an observation and keep the exact env state for rollout reward."""
+    prompt = _format_prompt(obs_text)
+    if len(_PROMPT_ENV_SNAPSHOTS) >= _PROMPT_SNAPSHOT_CAP:
+        # Evict oldest insertion (dict preserves insertion order in py3.7+).
+        oldest_key = next(iter(_PROMPT_ENV_SNAPSHOTS))
+        del _PROMPT_ENV_SNAPSHOTS[oldest_key]
+    _PROMPT_ENV_SNAPSHOTS[prompt] = copy.deepcopy(env)
+    return prompt
+def r_environment_rollout(prompt: str, completion: str) -> float | None:
+    """Env-backed score for a generated tool call plus short continuation.
+    Returns None when the prompt was not collected from an env snapshot, allowing
+    callers to fall back to stateless scoring. Otherwise returns [0, 1], where 0
+    means invalid JSON/tool-for-state and higher values reflect the env reward.
+    """
+    snapshot = _PROMPT_ENV_SNAPSHOTS.get(prompt)
+    if snapshot is None:
         return None
+    data = _parse_completion(completion)
+    if data is None:
+        return 0.0
+    tool = data.get("tool", "")
+    args = data.get("arguments", {})
+    if not isinstance(args, dict):
+        return 0.0
+    env = copy.deepcopy(snapshot)
+    if tool not in env._get_available_tools():
+        return 0.0
+    try:
+        obs = env.step(CricketAction(tool=tool, arguments=args))
+    except Exception:
+        return 0.0
+    reward = float(obs.reward or 0.0)
+    rng = random.Random(hash(prompt + completion) & 0xFFFFFFFF)
+    roster = build_playing_xi(getattr(env, "_agent_roster", []))
+    for _ in range(_ENV_REWARD_ROLLOUT_STEPS):
+        if obs.done:
+            break
+        action = _random_action(
+            rng,
+            obs.game_state,
+            obs.available_tools,
+            obs.current_bowler.get("type") if obs.current_bowler else None,
+            roster,
+        )
+        obs = env.step(action)
+        reward += float(obs.reward or 0.0)
+    if obs.done and env.state.reward_breakdown:
+        reward += float(env.state.reward_breakdown.get("composite", 0.0))
+    # Map rollout reward into [0,1] while preserving penalties for bad tool choices.
+    return round(max(0.0, min(1.0, 0.5 + reward)), 4)
 def r_validity(completion: str) -> float:
+    """Schema reward for tool calling.
+    Exact env-executable calls receive 1.0. Malformed but parseable JSON gets a
+    small shaping signal so early GRPO has non-zero variance before the model has
+    learned the strict `{"tool": ..., "arguments": {...}}` contract.
+    """
     data = _parse_completion(completion)
     if data is None:
         return 0.0
+    if not isinstance(data, dict):
+        return 0.05
     tool = data.get("tool", "")
     args = data.get("arguments", {})
+    if "tool" not in data or "arguments" not in data:
+        return 0.15
     if tool not in _VALID_TOOLS:
+        return 0.25
+    if not isinstance(args, dict):
+        return 0.35
     if tool == "play_delivery" and args.get("shot_intent") not in SHOT_AGGRESSION:
+        return 0.5
     if tool == "set_strategy":
         agg = args.get("aggression")
         if not isinstance(agg, (int, float)):
+            return 0.5
     if tool == "plan_shot" and args.get("shot_intent") not in SHOT_AGGRESSION:
+        return 0.5
     if tool in {"choose_bowler", "set_bowling_strategy", "plan_delivery"}:
         if args.get("bowler_type") not in (None, "pace", "spin"):
+            return 0.5
     return 1.0
     Returns reward_fn(prompts, completions, **kwargs) → list[float].
     Weights align with compute_episode_reward in reward_calculator.py:
+      r_env      — one-step env rollout reward when prompt snapshot exists
+      r_behavior — stateless tactical/tool coherence
+      r_validity — JSON/tool schema validity
     """
     # Minimum reward for any structurally valid completion — ensures GRPO has a
     # positive gradient to reinforce valid tool use even for unscored tool types.
         rewards = []
         for prompt, completion in zip(prompts, completions):
             fmt = r_validity(completion)
+            env_score = r_environment_rollout(prompt, completion)
             if curriculum_stage == 1:
+                # Length-efficiency penalty: a valid JSON tool call is ≤400 chars.
+                # Models with thinking mode (Qwen3/3.5) generate 800-2000 char
+                # preambles before the JSON; penalise that verbosity so GRPO
+                # learns to emit short, direct JSON.  The penalty scales from
+                # 1.0 at ≤400 chars to 0.0 at ≥2400 chars (linear).
+                _JSON_TARGET = 400
+                _RAMP_RANGE  = 2000
+                length_eff = max(0.0, 1.0 - max(0, len(completion) - _JSON_TARGET) / _RAMP_RANGE)
+                base = 0.5 * fmt + 0.5 * (env_score if env_score is not None else fmt)
+                rewards.append(round(length_eff * base, 4))
                 continue
             behavior = r_behavior_stateless(prompt, completion)
                 + _RW.behavior_adaptation         * adapt
                 + _RW.behavior_opponent_awareness * aware
             )
+            if env_score is None:
+                reward = _RW.training_behavior * r_beh + _RW.training_validity * fmt
+            else:
+                reward = 0.45 * env_score + 0.40 * r_beh + 0.15 * fmt
             # Floor: valid JSON should always beat invalid JSON (reward=0)
+            if fmt > 0.0 and (env_score is None or env_score > 0.0):
                 reward = max(reward, _VALID_FLOOR)
             rewards.append(round(reward, 4))
         return rewards
 SYSTEM_PROMPT = (
     "You are an expert adaptive cricket captain. Each turn you receive a scorecard "
+    "and must choose exactly one cricket captaincy tool call.\n\n"
+    "EXECUTE FIRST — strict rule:\n"
+    " - The match only progresses when you call `play_delivery` (batting) or\n"
+    "   `bowl_delivery` (bowling). Every other tool is overhead.\n"
+    " - Default action on EVERY ball: call `play_delivery` / `bowl_delivery` with\n"
+    "   plan args INLINE: e.g. `play_delivery(shot_intent='single', risk='low', rationale='rotate')`\n"
+    "   or `bowl_delivery(line='outside_off', length='good', delivery_type='stock')`.\n"
+    " - Use `set_match_plan` ONCE at the very start of an innings to declare strategy.\n"
+    " - Use `set_strategy` / `set_bowling_strategy` ONCE per phase boundary.\n"
+    " - DO NOT call `plan_shot` or `plan_delivery` (deprecated) — they only add a\n"
+    "   wasted turn. Pass the same parameters to play_delivery / bowl_delivery directly.\n"
+    " - SKIP `reflect_after_ball` unless the previous ball was a wicket or boundary.\n"
+    " - You are scored on MATCH OUTCOMES, not on philosophical depth. Bloated\n"
+    "   pre-ball planning truncates the episode and you forfeit the result reward.\n\n"
+    "THINKING BUDGET — HARD LIMIT:\n"
+    " - Per turn: ONE sentence of reasoning, max 30 tokens, inside <think>...</think>.\n"
+    " - Do NOT enumerate options, restate the scorecard, or re-derive the plan.\n"
+    " - Bad: '<think>This is the first ball, the field is balanced, Kohli is on strike at 0.45 aggression, I should consider...'\n"
+    " - Good: '<think>Powerplay, balanced field — single to rotate.</think>'\n"
+    " - Token budget per rollout is finite. Long thinking = match truncated = ZERO result reward.\n"
+    " - The plan you set at the start carries the strategy; do not re-derive it every ball.\n\n"
+    "Emit exactly one tool call wrapped in <tool_call>...</tool_call> XML tags. "
+    "Bare JSON without the wrapper is NOT recognized and will end the rollout.\n"
+    'Example: <tool_call>{"name": "play_delivery", "arguments": {"shot_intent": "single", "explanation": "rotate strike"}}</tool_call>\n\n'
     "Available tools:\n"
+    "  call_toss              — Call heads/tails and choose bat/bowl\n"
+    "  select_batter          — Choose batter profile for the match situation\n"
+    "  set_strategy           — Declare batting intent (aggression 0–1, rationale)\n"
+    "  plan_shot              — Pre-ball batting plan\n"
+    "  play_delivery          — Choose a shot and advance the game\n"
+    "  choose_bowler          — Choose bowler profile for the situation\n"
+    "  set_bowling_strategy   — Declare bowling line/length/type/rationale\n"
+    "  plan_delivery          — Pre-ball bowling plan\n"
+    "  set_field_setting      — Aggressive/Balanced/Defensive field\n"
+    "  bowl_delivery          — Execute the delivery\n"
+    "  reflect_after_ball     — Adapt after the previous ball\n"
+    "  analyze_situation      — Query pitch/bowler/field info\n\n"
     "Shot intents: leave | defensive | single | rotate | boundary | six\n\n"
+    "PRIORITIES (in order): (1) finish the match, (2) win the match, (3) score well per ball.\n"
+    "Verbose reasoning forfeits all three. Decide fast, act, move on."
 )
+def get_system_prompt(stage: int = 2) -> str:
+    return SYSTEM_PROMPT
 _RANDOM_SHOTS = list(SHOT_AGGRESSION.keys())
 _RANDOM_QUERIES = ["pitch_conditions", "bowler_info", "field_setting", "match_situation"]
 _RANDOM_ZONES = ["cover", "point", "straight", "midwicket", "square_leg", "fine_leg", "long_on", "long_off"]
+def _training_roster(agent_team: str | None = None) -> list[dict]:
+    team = agent_team or os.environ.get("CRICKET_AGENT_TEAM")
+    if not team:
+        raise ValueError("Roster-backed training requires --agent-team or CRICKET_AGENT_TEAM.")
+    roster = load_team_roster(team)
+    if not roster:
+        raise ValueError(f"No player profile roster found for agent team '{team}'.")
+    playing_xi = build_playing_xi(roster)
+    if len(playing_xi) < 11:
+        raise ValueError(f"Player profile roster for '{team}' could not produce a playing XI.")
+    return playing_xi
+def _sample_batter(rng: random.Random, roster: list[dict]) -> dict:
+    batters = [p for p in roster if p.get("role") != "bowler"] or roster
+    if not batters:
+        raise ValueError("Roster-backed training requires at least one batting-capable player.")
+    return rng.choice(batters)
+def _sample_bowler(rng: random.Random, roster: list[dict]) -> dict:
+    bowlers = [p for p in roster if p.get("bowler_type")]
+    if not bowlers:
+        raise ValueError("Roster-backed training requires at least one bowling-capable player.")
+    return rng.choice(bowlers)
 def _random_action(
     rng: random.Random,
     game_state: str = "batting",
     available_tools: list[str] | None = None,
     current_bowler_type: str | None = None,
+    roster: list[dict] | None = None,
 ) -> CricketAction:
+    legal = set(available_tools or [])
+    def can(tool: str) -> bool:
+        return available_tools is None or tool in legal
+    def match_plan_action() -> CricketAction:
+        return CricketAction(tool="set_match_plan", arguments={
+            "powerplay_intent": "Use roster strengths to establish tempo while protecting wickets",
+            "middle_intent": "Rotate strike, attack favorable matchups, and preserve finishers",
+            "death_intent": "Commit boundary options with wickets and target pressure in mind",
+            "risk_budget": "Escalate only when phase, target, and wickets justify the risk",
+            "trigger_conditions": "Review after wicket clusters, phase changes, target pressure, or repeated boundary/dot outcomes",
+            "rationale": "Create a long-horizon plan before choosing ball-by-ball tactics",
+        })
     if game_state == "toss":
+        return CricketAction(
             tool="call_toss",
             arguments={"call": rng.choice(["heads", "tails"]), "decision": rng.choice(["bat", "bowl"])},
+        )
+    if can("set_match_plan") and rng.random() < 0.12:
+        return match_plan_action()
+    if can("update_match_plan") and rng.random() < 0.08:
+        return CricketAction(tool="update_match_plan", arguments={
+            "reason": "Adjust plan after phase, score pressure, wickets, and field information",
+            "risk_budget": "Shift risk based on current target pressure and wickets in hand",
+        })
     if game_state == "bowling":
         choice = rng.random()
+        if choice < 0.15 and can("choose_bowler"):
+            bowler = _sample_bowler(rng, roster or [])
+            return CricketAction(
                 tool="choose_bowler",
                 arguments={
+                    "name": bowler["name"],
+                    "bowler_type": bowler["bowler_type"],
+                    "style": bowler.get("bowl_style", bowler.get("style", "stock")),
+                    "rationale": "Match roster bowler to phase, batter matchup, and remaining overs",
                 },
+            )
+        if choice < 0.35 and can("plan_delivery"):
+            return CricketAction(
                 tool="plan_delivery",
                 arguments={
                     "bowler_type": current_bowler_type or rng.choice(["pace", "spin"]),
                     "delivery_type": rng.choice(["stock", "yorker", "bouncer", "slower ball"]),
                     "rationale": "Use field and batter style to control scoring zones",
                 },
+            )
+        if choice < 0.5 and can("set_field_setting"):
+            return CricketAction(tool="set_field_setting", arguments={"setting": rng.choice(["Aggressive", "Balanced", "Defensive"])})
+        if choice < 0.6 and can("reflect_after_ball"):
+            return CricketAction(tool="reflect_after_ball", arguments={"reflection": "Adjust line and field after the last ball"})
+        if can("bowl_delivery"):
+            return CricketAction(tool="bowl_delivery", arguments={})
+        if can("set_bowling_strategy"):
+            return CricketAction(tool="set_bowling_strategy", arguments={
+                "bowler_type": current_bowler_type or "pace",
+                "line": "outside off",
+                "length": "good length",
+                "delivery_type": "stock",
+                "rationale": "Set a legal bowling plan before executing the delivery",
+            })
+        raise ValueError(f"No legal bowling action available from tools={available_tools}")
     choice = rng.random()
+    if choice < 0.15 and can("select_batter"):
+        batter = _sample_batter(rng, roster or [])
+        return CricketAction(
             tool="select_batter",
             arguments={
+                "name": batter["name"],
+                "style": batter.get("style", "balanced"),
+                "aggression": round(float(batter["aggression"]), 2),
                 "rationale": "Select batter based on phase, wickets, and target pressure",
             },
+        )
+    if choice < 0.3 and can("set_strategy"):
+        return CricketAction(
             tool="set_strategy",
             arguments={
                 "phase_intent": rng.choice(["attack", "consolidate", "rotate"]),
                 "aggression": round(rng.uniform(0.1, 0.9), 2),
+                "rationale": "Align roster strengths with phase, target pressure, and wickets",
             },
+        )
+    if choice < 0.45 and can("plan_shot"):
+        return CricketAction(
             tool="plan_shot",
             arguments={
                 "shot_intent": rng.choice(_RANDOM_SHOTS),
+                "target_area": rng.choice(_RANDOM_ZONES),
+                "trajectory": rng.choice(["ground", "lofted", "aerial"]),
                 "risk": rng.choice(["low", "balanced", "high"]),
                 "rationale": "Plan shot against bowler, field, and required rate",
             },
+        )
+    if choice < 0.55 and can("analyze_situation"):
+        return CricketAction(
             tool="analyze_situation",
             arguments={"query_type": rng.choice(_RANDOM_QUERIES)},
+        )
+    if choice < 0.65 and can("reflect_after_ball"):
+        return CricketAction(tool="reflect_after_ball", arguments={"reflection": "Revise risk after previous ball"})
+    if can("play_delivery"):
+        return CricketAction(
+            tool="play_delivery",
+            arguments={"shot_intent": rng.choice(_RANDOM_SHOTS), "explanation": "Advance the innings according to the current plan"},
+        )
+    raise ValueError(f"No legal batting action available from tools={available_tools}")
 def collect_prompts(
     n_prompts: int,
     task: str = "stage2_full",
     seed: int = 42,
+    agent_team: str | None = None,
+    opponent_mode: str = "heuristic",
 ) -> list[str]:
     """
     Collect game-state prompts by running episodes with random actions.
     Returns a list of prompt strings (one per game state observation).
     """
     rng = random.Random(seed)
+    roster = _training_roster(agent_team)
+    _PROMPT_ENV_SNAPSHOTS.clear()
     prompts: list[str] = []
     ep_count = 0
     while len(prompts) < n_prompts:
         env = CricketEnvironment()
+        obs = env.reset(seed=rng.randint(0, 99999), options={
+            "task": task,
+            "random_start": True,
+            "agent_team": agent_team or os.environ.get("CRICKET_AGENT_TEAM"),
+            "opponent_mode": opponent_mode,
+        })
+        prompts.append(_remember_prompt(obs.prompt_text, env))
         steps = 0
         while not obs.done and steps < 80:
                 obs.game_state,
                 obs.available_tools,
                 obs.current_bowler.get("type") if obs.current_bowler else None,
+                roster,
             )
             obs = env.step(action)
             if not obs.done:
+                prompts.append(_remember_prompt(obs.prompt_text, env))
             steps += 1
         ep_count += 1
     return Dataset.from_dict({"prompt": prompts})
+class CricketCaptainToolEnv:
+    """TRL environment wrapper exposing CricketCaptain actions as real tools."""
+    _stats_lock = threading.Lock()
+    def __init__(self):
+        self.env = CricketEnvironment()
+        self.reward = 0.0
+        self.done = False
+        self.final_reward = 0.0
+        self._episode_seed: int | None = None
+        self._episode_started = False
+        self._max_tool_iters: int | None = None
+        self._episode_had_step = False
+        self._episode_logged = False
+    def _maybe_log_episode_end(self, termination_reason: str):
+        # Avoid double-logging the same episode (e.g. once at termination, again on reset()).
+        if self._episode_logged:
+            return
+        stats_path = os.environ.get("CRICKET_EPISODE_STATS_PATH")
+        if not stats_path:
+            return
+        state = getattr(self.env, "state", None)
+        payload = {
+            "ts": datetime.datetime.now().isoformat(),
+            "seed": self._episode_seed,
+            "done": bool(self.done),
+            "termination_reason": termination_reason,
+            "reward_running_sum": float(self.reward),
+            "final_reward_bonus": float(self.final_reward),
+        }
+        if state is not None:
+            # ---- match config / context ----
+            payload["max_overs"]       = getattr(state, "max_overs", None)
+            payload["opponent_mode"]   = getattr(state, "opponent_mode", None)
+            payload["agent_team"]      = getattr(state, "eval_pack_id", None) or getattr(state, "agent_team", None)
+            payload["innings_type"]    = getattr(state, "innings_type", None)
+            payload["game_state"]      = getattr(state, "game_state", None)
+            # ---- match outcome ----
+            payload["overs_played"]         = getattr(state, "over", None)
+            payload["balls_played"]         = getattr(state, "ball", None)
+            payload["agent_score"]          = getattr(state, "total_score", None)
+            payload["wickets_lost"]         = getattr(state, "wickets_lost", None)
+            payload["first_innings_score"]  = getattr(state, "first_innings_score", None)
+            payload["target"]               = getattr(state, "target", None)
+            payload["match_result"]         = getattr(state, "match_result", None) or None
+            # ---- tool calls ----
+            tool_calls_made = int(getattr(state, "tool_calls_made", 0) or 0)
+            payload["tool_calls"] = tool_calls_made
+            tool_history = getattr(state, "tool_history", None) or []
+            tool_breakdown: dict[str, int] = {}
+            for c in tool_history:
+                t = c.get("tool", "unknown")
+                tool_breakdown[t] = tool_breakdown.get(t, 0) + 1
+            payload["tool_breakdown"] = tool_breakdown
+            payload["analyze_calls"] = len(getattr(state, "analyze_calls", []) or [])
+            # ---- per-turn rubric averages (mean across the full episode) ----
+            def _mean(xs):
+                xs = list(xs or [])
+                return round(sum(xs) / len(xs), 4) if xs else None
+            payload["mean_coherence"]          = _mean(getattr(state, "coherence_scores", None))
+            payload["mean_adaptation"]         = _mean(getattr(state, "adaptation_scores", None))
+            payload["mean_opponent_awareness"] = _mean(getattr(state, "opponent_awareness_scores", None))
+            payload["mean_regret"]             = _mean(getattr(state, "regret_scores", None))
+            payload["mean_plan_commitment"]    = _mean(getattr(state, "plan_commitment_scores", None))
+            payload["mean_plan_freshness"]     = _mean(getattr(state, "plan_freshness_scores", None))
+            payload["strategy_changes"]        = getattr(state, "strategy_changes", None)
+            payload["plan_version"]            = getattr(state, "plan_version", None)
+            # ---- composite + per-rubric reward (already computed in reward_calculator) ----
+            if getattr(state, "reward_breakdown", None):
+                payload["reward_breakdown"] = dict(state.reward_breakdown)
+        with self._stats_lock:
+            with open(stats_path, "a", encoding="utf-8") as f:
+                f.write(json.dumps(payload, ensure_ascii=False) + "\n")
+                f.flush()
+        self._episode_logged = True
+    def reset(self, **kwargs) -> str:
+        # If the previous episode ended because the trainer hit the tool-iteration cap,
+        # TRL will stop calling tools and then call reset() for the next scenario.
+        # In that case, self.done will still be False, but tool_calls_made will be at/near the cap.
+        if self._episode_started and self._episode_had_step and not self._episode_logged:
+            prev_calls = getattr(getattr(self.env, "state", None), "tool_calls_made", None)
+            if self.done:
+                self._maybe_log_episode_end("natural")
+            elif self._max_tool_iters and prev_calls is not None and int(prev_calls) >= int(self._max_tool_iters):
+                self._maybe_log_episode_end("cap")
+            # Otherwise: trainer reset the env mid-episode (e.g. generation bookkeeping).
+            # Don't log — it would skew the termination distribution.
+        self.reward = 0.0
+        self.done = False
+        self.final_reward = 0.0
+        self._episode_seed = kwargs.get("seed")
+        self._episode_started = True
+        self._episode_had_step = False
+        self._episode_logged = False
+        self._max_tool_iters = (
+            int(kwargs["max_tool_calling_iterations"])
+            if "max_tool_calling_iterations" in kwargs and kwargs["max_tool_calling_iterations"] is not None
+            else (int(os.environ["CRICKET_MAX_TOOL_ITERS"]) if os.environ.get("CRICKET_MAX_TOOL_ITERS") else None)
+        )
+        obs = self.env.reset(seed=kwargs.get("seed"), options={
+            "task": kwargs.get("task", "stage2_full"),
+            "random_start": bool(kwargs.get("random_start", False)),
+            "max_overs": int(kwargs.get("max_overs", 5)),
+            "eval_pack_id": kwargs.get("eval_pack_id", "adaptive_t20_v1"),
+            "opponent_mode": kwargs.get("opponent_mode", "heuristic"),
+            "opponent_cache_path": kwargs.get("opponent_cache_path"),
+            "agent_team": kwargs.get("agent_team"),
+        })
+        return obs.prompt_text
+    def _apply(self, tool: str, arguments: dict[str, Any]) -> str:
+        if self.done:
+            raise ValueError("Match is already finished.")
+        self._episode_had_step = True
+        available = self.env.state.game_state and self.env._get_available_tools()
+        if tool not in available:
+            self.reward -= 0.2
+            raise ValueError(f"Tool '{tool}' is not available. Available tools: {available}")
+        obs = self.env.step(CricketAction(tool=tool, arguments=arguments))
+        self.done = bool(obs.done)
+        self.reward += float(obs.reward or 0.0)
+        if obs.done and self.env.state.reward_breakdown:
+            self.final_reward = float(self.env.state.reward_breakdown.get("composite", 0.0))
+            self.reward += self.final_reward
+        # Log at the time of termination (do not wait for reset()) so the file appears promptly.
+        if self.done:
+            self._maybe_log_episode_end("natural")
+        # Also log cap termination as soon as we hit it, so runs always get stats even if TRL delays reset().
+        elif self._max_tool_iters:
+            state = getattr(self.env, "state", None)
+            calls = getattr(state, "tool_calls_made", None) if state is not None else None
+            if calls is not None and int(calls) >= int(self._max_tool_iters):
+                self._maybe_log_episode_end("cap")
+        return obs.prompt_text
+    def call_toss(self, call: str, decision: str) -> str:
+        """
+        Call the coin toss and choose whether to bat or bowl if the toss is won.
+        Args:
+            call: Coin call, either "heads" or "tails".
+            decision: Preferred decision, either "bat" or "bowl".
+        Returns:
+            Updated match observation after the toss.
+        """
+        return self._apply("call_toss", {"call": call, "decision": decision})
+    def set_match_plan(
+        self,
+        powerplay_intent: str,
+        middle_intent: str,
+        death_intent: str,
+        risk_budget: str,
+        trigger_conditions: str,
+        rationale: str,
+    ) -> str:
+        """
+        Establish the long-horizon plan for the innings.
+        Args:
+            powerplay_intent: Plan for overs in the powerplay.
+            middle_intent: Plan for middle overs.
+            death_intent: Plan for death overs.
+            risk_budget: How wickets, overs, and target pressure affect risk.
+            trigger_conditions: Match-state changes that should trigger a plan update.
+            rationale: Why this plan fits the roster and match situation.
+        Returns:
+            Updated match observation after setting the plan.
+        """
+        return self._apply("set_match_plan", {
+            "powerplay_intent": powerplay_intent,
+            "middle_intent": middle_intent,
+            "death_intent": death_intent,
+            "risk_budget": risk_budget,
+            "trigger_conditions": trigger_conditions,
+            "rationale": rationale,
+        })
+    def update_match_plan(self, reason: str, risk_budget: str = "", trigger_conditions: str = "") -> str:
+        """
+        Update the long-horizon plan after a meaningful match-state change.
+        Args:
+            reason: Specific reason for updating the plan.
+            risk_budget: Optional revised risk budget.
+            trigger_conditions: Optional revised trigger conditions.
+        Returns:
+            Updated match observation after revising the plan.
+        """
+        args = {"reason": reason}
+        if risk_budget:
+            args["risk_budget"] = risk_budget
+        if trigger_conditions:
+            args["trigger_conditions"] = trigger_conditions
+        return self._apply("update_match_plan", args)
+    def select_batter(self, name: str, style: str, aggression: float, rationale: str) -> str:
+        """
+        Select the next batter from the configured roster.
+        Args:
+            name: Player name from the team roster.
+            style: Batter style from the roster or tactical role.
+            aggression: Batting aggression from 0.0 to 1.0.
+            rationale: Why this batter fits the phase, wickets, and target.
+        Returns:
+            Updated match observation after selecting the batter.
+        """
+        return self._apply("select_batter", {
+            "name": name,
+            "style": style,
+            "aggression": aggression,
+            "rationale": rationale,
+        })
+    def set_strategy(self, phase_intent: str, aggression: float, rationale: str) -> str:
+        """
+        Set batting strategy for the current phase.
+        Args:
+            phase_intent: Tactical batting intent for this phase.
+            aggression: Batting aggression from 0.0 to 1.0.
+            rationale: Why the strategy fits score, wickets, target, and field.
+        Returns:
+            Updated match observation after setting batting strategy.
+        """
+        return self._apply("set_strategy", {
+            "phase_intent": phase_intent,
+            "aggression": aggression,
+            "rationale": rationale,
+        })
+    def plan_shot(self, shot_intent: str, target_area: str, risk: str, trajectory: str, rationale: str) -> str:
+        """DEPRECATED — pass these args inline to play_delivery() instead.
+        Args:
+            shot_intent: leave|defensive|single|rotate|boundary|six.
+            target_area: scoring area.
+            risk: low|balanced|high.
+            trajectory: ground|lofted|aerial.
+            rationale: one-line reason.
+        Returns:
+            Updated observation.
+        """
+        return self._apply("plan_shot", {
+            "shot_intent": shot_intent,
+            "target_area": target_area,
+            "risk": risk,
+            "trajectory": trajectory,
+            "rationale": rationale,
+        })
+    def play_delivery(
+        self,
+        shot_intent: str = "",
+        target_area: str = "",
+        risk: str = "",
+        trajectory: str = "",
+        rationale: str = "",
+    ) -> str:
+        """
+        Execute the ball. Pass shot params inline to skip plan_shot.
+        Args:
+            shot_intent: leave|defensive|single|rotate|boundary|six.
+            target_area: optional scoring area.
+            risk: optional low|balanced|high.
+            trajectory: optional ground|lofted|aerial.
+            rationale: optional one-line reason.
+        Returns:
+            Updated observation after the ball outcome.
+        """
+        args: dict[str, Any] = {}
+        if shot_intent:    args["shot_intent"] = shot_intent
+        if target_area:    args["target_area"] = target_area
+        if risk:           args["risk"] = risk
+        if trajectory:     args["trajectory"] = trajectory
+        if rationale:      args["rationale"] = rationale
+        return self._apply("play_delivery", args)
+    def choose_bowler(self, name: str, bowler_type: str, style: str, rationale: str) -> str:
+        """
+        Choose the bowler at the start of an over from the configured roster.
+        Args:
+            name: Player name from the team roster.
+            bowler_type: Bowler type, either pace or spin.
+            style: Bowling style or role.
+            rationale: Why this bowler fits phase, matchup, and remaining overs.
+        Returns:
+            Updated match observation after choosing the bowler.
+        """
+        return self._apply("choose_bowler", {
+            "name": name,
+            "bowler_type": bowler_type,
+            "style": style,
+            "rationale": rationale,
+        })
+    def set_bowling_strategy(self, bowler_type: str, line: str, length: str, delivery_type: str, rationale: str) -> str:
+        """
+        Set bowling strategy for the current bowler.
+        Args:
+            bowler_type: Current bowler type, either pace or spin.
+            line: Intended line.
+            length: Intended length.
+            delivery_type: Variation or stock delivery type.
+            rationale: Why this plan fits batter, field, phase, and target.
+        Returns:
+            Updated match observation after setting bowling strategy.
+        """
+        return self._apply("set_bowling_strategy", {
+            "bowler_type": bowler_type,
+            "line": line,
+            "length": length,
+            "delivery_type": delivery_type,
+            "rationale": rationale,
+        })
+    def plan_delivery(self, bowler_type: str, line: str, length: str, delivery_type: str, rationale: str) -> str:
+        """DEPRECATED — pass these args inline to bowl_delivery() instead.
+        Args:
+            bowler_type: pace|spin.
+            line: line.
+            length: length.
+            delivery_type: variation or stock.
+            rationale: one-line reason.
+        Returns:
+            Updated observation.
+        """
+        return self._apply("plan_delivery", {
+            "bowler_type": bowler_type,
+            "line": line,
+            "length": length,
+            "delivery_type": delivery_type,
+            "rationale": rationale,
+        })
+    def set_field_setting(self, setting: str) -> str:
+        """
+        Set the field preset.
+        Args:
+            setting: One of Aggressive, Balanced, or Defensive.
+        Returns:
+            Updated match observation after setting the field.
+        """
+        return self._apply("set_field_setting", {"setting": setting})
+    def bowl_delivery(
+        self,
+        line: str = "",
+        length: str = "",
+        delivery_type: str = "",
+        rationale: str = "",
+    ) -> str:
+        """
+        Execute the delivery. Pass plan params inline to skip plan_delivery.
+        Args:
+            line: optional line.
+            length: optional length.
+            delivery_type: optional variation or stock.
+            rationale: optional one-line reason.
+        Returns:
+            Updated observation after the ball outcome.
+        """
+        args: dict[str, Any] = {}
+        if line:           args["line"] = line
+        if length:         args["length"] = length
+        if delivery_type:  args["delivery_type"] = delivery_type
+        if rationale:      args["rationale"] = rationale
+        return self._apply("bowl_delivery", args)
+    def reflect_after_ball(self, reflection: str) -> str:
+        """
+        Reflect after the previous ball and adapt the plan.
+        Args:
+            reflection: Specific tactical lesson from the previous ball.
+        Returns:
+            Updated match observation after recording reflection.
+        """
+        return self._apply("reflect_after_ball", {"reflection": reflection})
+    def analyze_situation(self, query_type: str) -> str:
+        """
+        Analyze part of the match context.
+        Args:
+            query_type: One of pitch_conditions, bowler_info, field_setting, or match_situation.
+        Returns:
+            Updated observation containing the analysis result.
+        """
+        return self._apply("analyze_situation", {"query_type": query_type})
+def build_agent_dataset(n_examples: int, args) -> Dataset:
+    if Dataset is None:
+        raise ImportError("datasets is required for training. Install with: pip install '.[train]'")
+    rows = []
+    rng = random.Random(args.seed)
+    stage_prompt = get_system_prompt(args.stage)
+    # Curriculum distribution. If --max-overs is set, use it as a fixed format.
+    # Otherwise sample per-scenario from a T2-heavy distribution that tapers to T5.
+    # Rationale: T2 episodes (~72 tool calls) actually COMPLETE within our token
+    # budget so r_result fires; T5 episodes (~180) train the model on its
+    # eval distribution. Heavy weight on short formats early so the policy
+    # escapes the "planning loop" before tackling longer matches.
+    overs_distribution = getattr(args, "overs_distribution", None)
+    fixed_overs = args.max_overs if args.max_overs and args.max_overs > 0 else None
+    if fixed_overs is None and not overs_distribution:
+        # default curriculum: 50% T2, 30% T3, 15% T4, 5% T5
+        overs_distribution = [2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5]
+    for idx in range(n_examples):
+        scenario_overs = fixed_overs if fixed_overs is not None else rng.choice(overs_distribution)
+        rows.append({
+            "prompt": [
+                {"role": "system", "content": stage_prompt},
+                {"role": "user", "content": ""},
+            ],
+            "seed": rng.randint(0, 999999),
+            "task": "stage1_format" if args.stage == 1 else "stage2_full",
+            "random_start": False,
+            "max_overs": scenario_overs,
+            "eval_pack_id": args.eval_pack_id,
+            "opponent_mode": args.opponent_mode,
+            "opponent_cache_path": getattr(args, "opponent_cache_path", None),
+            "agent_team": args.agent_team,
+            "scenario_id": idx,
+        })
+    return Dataset.from_list(rows)
+def environment_reward(environments, **kwargs) -> list[float]:
+    rewards = []
+    # Aggregate metrics across all envs in this gradient step for WandB logging.
+    agg = {
+        "r_result": [], "r_cricket": [], "r_behavior": [], "r_validity": [],
+        "r_coherence": [], "r_adaptation": [], "r_opponent_awareness": [], "r_regret": [],
+        "composite": [], "tool_calls": [], "wickets_lost": [], "agent_score": [],
+        "matches_completed": 0, "n": 0,
+    }
+    tool_freq: dict[str, int] = {}
+    for env in environments:
+        state = env.env.state
+        breakdown = state.reward_breakdown or {}
+        terminal = float(breakdown.get("composite", 0.0))
+        plan_score = (sum(state.plan_commitment_scores) / len(state.plan_commitment_scores)) if state.plan_commitment_scores else 0.0
+        validity = 1.0 - min(1.0, len([c for c in state.tool_history if c.get("tool") == "invalid_json"]) / max(state.step_count, 1))
+        reward = env.reward + terminal + 0.1 * plan_score + 0.05 * validity
+        # Reward clip removed: when rollouts complete naturally, the composite
+        # reward easily saturates [-1, 1], causing GRPO group-std → 0 and
+        # killing the gradient signal. Let GRPO standardize the advantage itself.
+        rewards.append(round(reward, 4))
+        # Collect for aggregate logging.
+        agg["n"] += 1
+        if env.done:
+            agg["matches_completed"] += 1
+        for k in ("r_result", "r_cricket", "r_behavior", "r_validity",
+                  "r_coherence", "r_adaptation", "r_opponent_awareness",
+                  "r_regret", "composite"):
+            v = breakdown.get(k)
+            if v is not None:
+                agg[k].append(float(v))
+        agg["tool_calls"].append(int(getattr(state, "tool_calls_made", 0) or 0))
+        agg["wickets_lost"].append(int(getattr(state, "wickets_lost", 0) or 0))
+        agg["agent_score"].append(int(getattr(state, "total_score", 0) or 0))
+        for c in (state.tool_history or []):
+            t = c.get("tool", "unknown")
+            tool_freq[t] = tool_freq.get(t, 0) + 1
+    # WandB log — only if wandb is initialised in this process.
+    try:
+        import wandb
+        if wandb.run is not None and agg["n"] > 0:
+            log_dict: dict[str, float] = {
+                "rollout/n_episodes":       agg["n"],
+                "rollout/matches_completed": agg["matches_completed"],
+                "rollout/match_completion_rate": agg["matches_completed"] / agg["n"],
+            }
+            for k in ("r_result", "r_cricket", "r_behavior", "r_validity",
+                      "r_coherence", "r_adaptation", "r_opponent_awareness",
+                      "r_regret", "composite"):
+                if agg[k]:
+                    log_dict[f"reward/{k}_mean"] = sum(agg[k]) / len(agg[k])
+                    log_dict[f"reward/{k}_max"]  = max(agg[k])
+                    log_dict[f"reward/{k}_min"]  = min(agg[k])
+            for k in ("tool_calls", "wickets_lost", "agent_score"):
+                if agg[k]:
+                    log_dict[f"episode/{k}_mean"] = sum(agg[k]) / len(agg[k])
+                    log_dict[f"episode/{k}_max"]  = max(agg[k])
+            # Tool usage breakdown — frequency per tool name across this step.
+            total_tools = sum(tool_freq.values()) or 1
+            for t, n in tool_freq.items():
+                log_dict[f"tools/freq_{t}"] = n / total_tools
+            wandb.log(log_dict)
+    except Exception:
+        # Never let logging break training.
+        pass
+    return rewards
+def generate_sft_examples(out_path: str, n_examples: int = 240, seed: int = 42, agent_team: str | None = None):
     """Stage 0 bootstrap data: valid tool JSON for every tool family."""
     rng = random.Random(seed)
+    roster = _training_roster(agent_team)
     examples = []
     for _ in range(n_examples):
         game_state = rng.choice(["toss", "batting", "bowling"])
+        action = _random_action(rng, game_state, roster=roster)
         prompt = (
             f"{SYSTEM_PROMPT}\n\n"
             f"[CricketCaptain] {game_state.upper()} | Example adaptive scenario\n"
 # Model loading (plain transformers + bitsandbytes 4-bit)            #
 # ------------------------------------------------------------------ #
+def load_model(model_name: str, *, use_vllm: bool = False, resume_adapter_from: str | None = None):
+    """Load base + LoRA. When use_vllm=True, base is loaded in bf16 (vLLM
+    does not support 4-bit BNB inference); otherwise 4-bit NF4.
+    resume_adapter_from: optional path to a PEFT adapter directory (e.g. a previous
+    checkpoint dir). If provided, loads the adapter weights instead of initializing
+    a fresh LoRA. The base model is still loaded from `model_name`. The adapter's
+    LoraConfig is preserved (so you can resume even if r= or alpha= drift between runs)."""
     if not _TRAIN_IMPORTS_AVAILABLE:
         raise ImportError("Training dependencies are missing. Install with: pip install '.[train]'")
+    print(f"Loading {model_name} … (use_vllm={use_vllm}, dtype={'bf16' if use_vllm else '4-bit NF4'})")
     tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token
+    try:
+        import flash_attn  # noqa: F401
+        attn_impl = "flash_attention_2"
+    except ImportError:
+        attn_impl = "sdpa"
+    load_kwargs = dict(
         device_map="auto",
         trust_remote_code=True,
         torch_dtype=torch.bfloat16,
+        attn_implementation=attn_impl,
     )
+    if not use_vllm:
+        load_kwargs["quantization_config"] = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_compute_dtype=torch.bfloat16,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type="nf4",
+        )
+    model = AutoModelForCausalLM.from_pretrained(model_name, **load_kwargs)
+    if not use_vllm:
+        model = prepare_model_for_kbit_training(model)
+    if resume_adapter_from:
+        # Resume from a previous PEFT adapter checkpoint (e.g. warmup output).
+        # PeftModel.from_pretrained reads the adapter_config.json from the dir,
+        # so any r/alpha/target_modules saved with the warmup run is preserved.
+        from peft import PeftModel
+        adapter_path = Path(resume_adapter_from)
+        if not adapter_path.exists():
+            raise FileNotFoundError(f"resume_adapter_from path does not exist: {adapter_path}")
+        print(f"Resuming LoRA adapter from {adapter_path}")
+        model = PeftModel.from_pretrained(model, str(adapter_path), is_trainable=True)
+    else:
+        lora_cfg = LoraConfig(
+            r=64,
+            lora_alpha=128,
+            lora_dropout=0.05,
+            bias="none",
+            task_type="CAUSAL_LM",
+            target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
+        )
+        model = get_peft_model(model, lora_cfg)
     print(f"Loaded. Parameters: {model.num_parameters():,}")
+    model.print_trainable_parameters()
     return model, tokenizer
 def train(args):
     if not _TRAIN_IMPORTS_AVAILABLE:
         raise ImportError("Training dependencies are missing. Install with: pip install '.[train]'")
+    if args.opponent_mode == "llm_live":
+        if args.opponent_model:
+            os.environ["CRICKET_OPPONENT_MODEL"] = args.opponent_model
+        if args.opponent_api_base:
+            os.environ["CRICKET_OPPONENT_API_BASE"] = args.opponent_api_base
+        if args.opponent_api_key:
+            os.environ["CRICKET_OPPONENT_API_KEY"] = args.opponent_api_key
     task     = "stage1_format" if args.stage == 1 else "stage2_full"
+    # CRICKET_CKPT_ROOT lets a side-by-side run write checkpoints to a different
+    # tree (e.g. ./checkpoints_smoke) without trampling an active production run.
+    # Default unchanged: ./checkpoints/.
+    ckpt_root = os.environ.get("CRICKET_CKPT_ROOT", "./checkpoints").rstrip("/")
+    out_dir  = f"{ckpt_root}/stage{args.stage}"
+    save_dir = f"{ckpt_root}/stage{args.stage}_final"
+    ts = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
+    log_dir = Path(f"./logs/run_{ts}_stage{args.stage}_{args.opponent_mode}")
+    log_dir.mkdir(parents=True, exist_ok=True)
+    # Make episode termination stats available to the environment wrapper.
+    # This lets us distinguish natural terminations from tool-iteration cap truncations.
+    stats_path = log_dir / "episode_stats.jsonl"
+    os.environ["CRICKET_EPISODE_STATS_PATH"] = str(stats_path)
+    os.environ["CRICKET_MAX_TOOL_ITERS"] = str(args.max_tool_calling_iterations)
+    # Create the file immediately so users can find/tail it even before the first termination.
+    stats_path.touch(exist_ok=True)
     print(f"\n=== Stage {args.stage} Training ===")
     print(f"Task: {task} | Prompts: {args.prompts} | Steps: {args.steps}")
+    print(f"Logs: {log_dir}/  | Checkpoints: {out_dir}/")
+    print(f"max_tool_calling_iterations={args.max_tool_calling_iterations} (full 5-over match needs ~180; 20-over needs ~720)")
+    (log_dir / "metadata.json").write_text(json.dumps({
+        "stage": args.stage, "model": args.model, "agent_team": args.agent_team,
+        "max_overs": args.max_overs, "opponent_mode": args.opponent_mode,
+        "prompts": args.prompts, "steps": args.steps,
+        "batch_size": args.batch_size, "grad_accum": args.grad_accum,
+        "num_generations": args.num_generations,
+        "max_completion_length": args.max_completion_length,
+        "max_tool_calling_iterations": args.max_tool_calling_iterations,
+        "logging_steps": args.logging_steps,
+        "timestamp": ts,
+    }, indent=2))
+    # Build scenario seeds. TRL's environment_factory performs the actual
+    # multi-turn rollout and tool execution during training.
+    print("\nBuilding environment scenarios …")
+    dataset = build_agent_dataset(args.prompts, args)
+    # Load model — bf16 if vLLM is on (vLLM rejects 4-bit BNB) or --bf16-base, else 4-bit NF4.
+    # If resume_from is set, load the LoRA adapter from that path instead of fresh init.
+    bf16_base = getattr(args, "use_vllm", False) or getattr(args, "bf16_base", False)
+    resume_from = getattr(args, "resume_from", None)
+    model, tokenizer = load_model(args.model, use_vllm=bf16_base, resume_adapter_from=resume_from)
     # GRPO config
+    #
+    # Qwen3 / Qwen3.5 ship with hybrid thinking ENABLED by default.  Empirically
+    # (see logs/run_2026-04-25_21-08-45 completions parquet) every generation
+    # spent ~1200 chars inside <think>...</think> and then emitted XML-style
+    # <function><parameter> tags instead of the JSON tool call we asked for.
+    # That meant 0/32 generations were parseable, _apply() never advanced the
+    # match, and episodes always hit max_tool_calling_iterations before any
+    # innings finished — so r_result (55% of the composite) was never earned.
+    #
+    chat_template_kwargs = {}
+    generation_kwargs = {}
+    completion_len = max(args.max_completion_length, 2048)
+    use_vllm = getattr(args, "use_vllm", False)
+    vllm_kwargs = {}
+    if use_vllm:
+        # vllm_model_impl: None (default) → vLLM picks its native class. Use this for
+        # Qwen3-* (Qwen3ForCausalLM is registered, native path with full LoRA support).
+        # Set to "transformers" only for Qwen3.5-* where vLLM has no text-only class
+        # registered and the native path tries to load a vision tower.
+        vllm_kwargs = dict(
+            use_vllm=True,
+            vllm_mode="colocate",
+            vllm_gpu_memory_utilization=getattr(args, "vllm_gpu_memory", 0.5),
+            vllm_max_model_length=completion_len + 2048,
+        )
+        vllm_impl = getattr(args, "vllm_model_impl", None)
+        if vllm_impl:
+            vllm_kwargs["vllm_model_impl"] = vllm_impl
+    # Resolve hyperparameters from YAML/CLI with sensible fallbacks.
+    lr = args.learning_rate if getattr(args, "learning_rate", None) is not None \
+        else (2e-5 if args.stage == 1 else 1e-5)
+    grpo_beta = getattr(args, "beta", None)
+    grpo_temp = getattr(args, "temperature", None) or 0.8
+    grpo_top_p = getattr(args, "top_p", None)
+    grad_ckpt = getattr(args, "gradient_checkpointing", None)
+    grad_ckpt_kwargs = None
+    if grad_ckpt and getattr(args, "gradient_checkpointing_use_reentrant", None) is not None:
+        grad_ckpt_kwargs = {"use_reentrant": bool(args.gradient_checkpointing_use_reentrant)}
+    optional_cfg = {}
+    if grpo_beta is not None:
+        optional_cfg["beta"] = grpo_beta
+    if grpo_top_p is not None:
+        optional_cfg["top_p"] = grpo_top_p
+    if grad_ckpt is not None:
+        optional_cfg["gradient_checkpointing"] = bool(grad_ckpt)
+        if grad_ckpt_kwargs is not None:
+            optional_cfg["gradient_checkpointing_kwargs"] = grad_ckpt_kwargs
+    if getattr(args, "dataloader_pin_memory", None) is not None:
+        optional_cfg["dataloader_pin_memory"] = bool(args.dataloader_pin_memory)
+    if getattr(args, "dataloader_num_workers", None) is not None:
+        optional_cfg["dataloader_num_workers"] = int(args.dataloader_num_workers)
     config = GRPOConfig(
         output_dir=out_dir,
+        logging_dir=str(log_dir / "tensorboard"),
         num_train_epochs=1,
         max_steps=args.steps,
         per_device_train_batch_size=args.batch_size,
         gradient_accumulation_steps=args.grad_accum,
+        learning_rate=lr,
         warmup_ratio=0.05,
         lr_scheduler_type="cosine",
+        logging_steps=args.logging_steps,
+        save_steps=getattr(args, "save_steps", None) or 10,
+        save_total_limit=getattr(args, "save_total_limit", None) or 20,
         bf16=True,
+        max_completion_length=completion_len,
         num_generations=args.num_generations,
+        max_tool_calling_iterations=args.max_tool_calling_iterations,
+        temperature=grpo_temp,
+        report_to=args.report_to,
+        run_name=args.run_name,
         log_completions=True,
         seed=args.seed,
+        chat_template_kwargs=chat_template_kwargs,
+        generation_kwargs=generation_kwargs,
+        **optional_cfg,
+        **vllm_kwargs,
     )
+    # TRL's add_response_schema pattern-matches tokenizer.chat_template against
+    # a fixed list and raises "Unrecognized chat template" if no match. Some
+    # newer Qwen3 builds (e.g. Qwen3-4B-Instruct-2507, Aug 2025) ship a
+    # template that differs from TRL's stored string (the Instruct release
+    # dropped the enable_thinking block) — but the tool-call format
+    # (<tool_call>…</tool_call>) is identical, so the appropriate schema still
+    # parses correctly. We assign it manually before GRPOTrainer init; TRL
+    # checks `response_schema is None` first so this is a safe override.
+    if getattr(tokenizer, "response_schema", None) is None:
+        try:
+            from trl.chat_template_utils import qwen3_schema, qwen3_5_schema
+            m = args.model.lower()
+            if "qwen3.5" in m or "qwen3_5" in m:
+                tokenizer.response_schema = qwen3_5_schema
+                print("Set tokenizer.response_schema = qwen3_5_schema (manual override).")
+            elif "qwen3" in m:
+                tokenizer.response_schema = qwen3_schema
+                print("Set tokenizer.response_schema = qwen3_schema (manual override).")
+        except ImportError:
+            pass
     trainer = GRPOTrainer(
         model=model,
+        reward_funcs=environment_reward,
         args=config,
         train_dataset=dataset,
         processing_class=tokenizer,
+        environment_factory=CricketCaptainToolEnv,
     )
     print(f"\nStarting training ({args.steps} steps, {len(dataset)} prompts) …")
     for ep in range(args.eval_episodes):
         env = CricketEnvironment()
+        obs = env.reset(seed=rng.randint(0, 99999), options={
+            "task": "stage2_full",
+            "random_start": False,
+            "agent_team": args.agent_team,
+        })
         steps = 0
         while not obs.done and steps < 150:
             if data:
                 action = CricketAction(tool=data["tool"], arguments=data.get("arguments", {}))
             else:
+                action = CricketAction(tool="invalid_json", arguments={})
             obs   = env.step(action)
             steps += 1
 def train_smoke(args):
     """Run short direct-environment training rollouts without loading a model."""
     rng = random.Random(args.seed)
+    roster = _training_roster(args.agent_team)
     # Auto-create run folder unless --output explicitly given
     if args.output:
             "eval_pack_id": args.eval_pack_id,
             "opponent_mode": args.opponent_mode,
             "opponent_cache_path": args.opponent_cache_path,
+            "agent_team": args.agent_team,
         })
         prompts = [_format_prompt(obs.prompt_text)]
         total_reward = 0.0
                 obs.game_state,
                 obs.available_tools,
                 obs.current_bowler.get("type") if obs.current_bowler else None,
+                roster,
             )
             obs = env.step(action)
             step_end = time.perf_counter()
 def _apply_yaml_defaults(args, cfg: dict) -> None:
     """Merge YAML config values into args, CLI args take precedence."""
     captain = cfg.get("captain", {}) or {}
+    opponent = cfg.get("opponent", {}) or {}
     env_cfg  = cfg.get("env", {}) or {}
     train_cfg = cfg.get("train", {}) or {}
         if val is not None and getattr(args, attr, None) is None:
             setattr(args, attr, val)
+    if getattr(args, "cmd", None) == "train":
+        _set("model",        train_cfg.get("model"))
+    else:
+        _set("model",        captain.get("model"))
     _set("api_base",         captain.get("api_base"))
     _set("api_key",          os.environ.get(captain.get("api_key_env", "")) or None)
     _set("eval_pack_id",     env_cfg.get("eval_pack_id"))
+    _set("opponent_mode",    opponent.get("mode"))
+    _set("opponent_cache_path", opponent.get("cache_path"))
+    _set("opponent_model",   opponent.get("model"))
+    _set("opponent_api_base", opponent.get("api_base"))
+    api_key_env = opponent.get("api_key_env")
+    _set("opponent_api_key", os.environ.get(api_key_env, "") if api_key_env else None)
     _set("max_overs",        env_cfg.get("max_overs"))
+    _set("agent_team",       env_cfg.get("agent_team"))
     _set("steps",            train_cfg.get("steps"))
     _set("prompts",          train_cfg.get("prompts"))
     _set("batch_size",       train_cfg.get("batch_size"))
+    _set("grad_accum",       train_cfg.get("grad_accum"))
     _set("stage",            train_cfg.get("stage"))
+    _set("num_generations",  train_cfg.get("num_generations"))
+    _set("max_completion_length", train_cfg.get("max_completion_length"))
+    _set("max_tool_calling_iterations", train_cfg.get("max_tool_calling_iterations"))
+    _set("logging_steps",    train_cfg.get("logging_steps"))
+    _set("report_to",        train_cfg.get("report_to"))
+    _set("run_name",         train_cfg.get("run_name"))
+    _set("learning_rate",    train_cfg.get("learning_rate"))
+    _set("beta",             train_cfg.get("beta"))
+    _set("temperature",      train_cfg.get("temperature"))
+    _set("top_p",            train_cfg.get("top_p"))
+    _set("gradient_checkpointing",          train_cfg.get("gradient_checkpointing"))
+    _set("gradient_checkpointing_use_reentrant", train_cfg.get("gradient_checkpointing_use_reentrant"))
+    _set("dataloader_pin_memory",   train_cfg.get("dataloader_pin_memory"))
+    _set("dataloader_num_workers",  train_cfg.get("dataloader_num_workers"))
+    _set("bf16_base",               train_cfg.get("bf16_base"))
+    _set("save_steps",              train_cfg.get("save_steps"))
+    _set("save_total_limit",        train_cfg.get("save_total_limit"))
+    _set("resume_from",             train_cfg.get("resume_from"))
+    _set("overs_distribution",      train_cfg.get("overs_distribution"))
+    _set("use_vllm",                train_cfg.get("use_vllm"))
+    _set("vllm_gpu_memory",         train_cfg.get("vllm_gpu_memory"))
+    _set("vllm_model_impl",         train_cfg.get("vllm_model_impl"))
 def main():
     t.add_argument("--prompts",         type=int, default=None, help="Game state prompts to collect")
     t.add_argument("--steps",           type=int, default=None, help="GRPOTrainer max_steps")
     t.add_argument("--batch-size",      type=int, default=None, dest="batch_size")
+    t.add_argument("--grad-accum",      type=int, default=None, dest="grad_accum")
+    t.add_argument("--num-generations", type=int, default=None, dest="num_generations")
+    t.add_argument("--agent-team",      default=None,           dest="agent_team")
+    t.add_argument("--opponent-mode",   default=None, choices=["heuristic", "llm_live", "llm_cached", "cricsheet"], dest="opponent_mode")
+    t.add_argument("--opponent-model",  default=None,           dest="opponent_model")
+    t.add_argument("--opponent-api-base", default=None,         dest="opponent_api_base")
+    t.add_argument("--opponent-api-key", default=None,          dest="opponent_api_key")
+    t.add_argument("--max-overs",       type=int, default=None, dest="max_overs")
+    t.add_argument("--eval-pack-id",    default=None,           dest="eval_pack_id")
+    t.add_argument("--opponent-cache-path", default=None,       dest="opponent_cache_path")
+    t.add_argument("--max-completion-length", type=int, default=None, dest="max_completion_length")
+    t.add_argument("--max-tool-calling-iterations", type=int, default=None, dest="max_tool_calling_iterations")
+    t.add_argument("--logging-steps",   type=int, default=None, dest="logging_steps")
+    t.add_argument("--report-to",       default=None,           dest="report_to")
+    t.add_argument("--run-name",        default=None,           dest="run_name")
     t.add_argument("--seed",            type=int, default=42)
+    t.add_argument("--resume-from",     default=None,           dest="resume_from",
+                   help="Path to a previous LoRA adapter dir (e.g. ./checkpoints/stage2_final). "
+                        "When set, the adapter is loaded on top of the base model instead of a fresh init.")
+    t.add_argument("--save-steps",      type=int, default=None, dest="save_steps")
+    t.add_argument("--save-total-limit", type=int, default=None, dest="save_total_limit")
+    t.add_argument("--learning-rate",   type=float, default=None, dest="learning_rate")
+    t.add_argument("--beta",            type=float, default=None, dest="beta",
+                   help="GRPO KL coefficient. Lower = more exploration.")
+    t.add_argument("--temperature",     type=float, default=None, dest="temperature")
+    t.add_argument("--top-p",           type=float, default=None, dest="top_p")
+    t.add_argument("--gradient-checkpointing", action="store_true", dest="gradient_checkpointing", default=None)
+    t.add_argument("--no-gradient-checkpointing", action="store_false", dest="gradient_checkpointing")
+    t.add_argument("--gradient-checkpointing-use-reentrant", action="store_true",
+                   dest="gradient_checkpointing_use_reentrant", default=None)
+    t.add_argument("--dataloader-pin-memory", action="store_true", dest="dataloader_pin_memory", default=None)
+    t.add_argument("--dataloader-num-workers", type=int, default=None, dest="dataloader_num_workers")
+    t.add_argument("--use-vllm",        action="store_true",    dest="use_vllm", default=None,
+                   help="Use vLLM-backed rollouts (colocate). Forces bf16 base.")
+    t.add_argument("--bf16-base",       action="store_true",    dest="bf16_base", default=None,
+                   help="Load base model in bf16 instead of 4-bit NF4. Faster matmul on H200 since 4B fits in 8GB.")
+    t.add_argument("--vllm-gpu-memory", type=float, default=0.5, dest="vllm_gpu_memory",
+                   help="Fraction of GPU memory reserved for vLLM (colocate). Default 0.5.")
+    t.add_argument("--vllm-model-impl", default=None, dest="vllm_model_impl",
+                   choices=["transformers", "vllm"],
+                   help="vLLM model backend. None (default) = native vLLM class (e.g. Qwen3ForCausalLM); "
+                        "'transformers' = HF transformers backend (workaround for Qwen3.5 — flaky with LoRA).")
     # eval
     e = sub.add_parser("eval", help="Evaluate a checkpoint")
     e.add_argument("--config", default=None)
     e.add_argument("--model",          default=None)
     e.add_argument("--eval-episodes",  type=int, default=10,   dest="eval_episodes")
+    e.add_argument("--agent-team",     default=None,           dest="agent_team")
     e.add_argument("--seed",           type=int, default=0)
     # quick test (no GPU needed)
     smoke.add_argument("--eval-pack-id", default=None, dest="eval_pack_id")
     smoke.add_argument("--opponent-mode", default=None, choices=["heuristic", "llm_live", "llm_cached", "cricsheet"], dest="opponent_mode")
     smoke.add_argument("--opponent-cache-path", default=None, dest="opponent_cache_path")
+    smoke.add_argument("--agent-team", default=None, dest="agent_team")
     smoke.add_argument("--output", default=None)
     smoke.add_argument("--seed", type=int, default=42)
     sft = sub.add_parser("sft-data", help="Generate Stage 0 supervised tool-format examples")
     sft.add_argument("--output", default="./data/training/tool_sft_examples.jsonl")
     sft.add_argument("--examples", type=int, default=240)
+    sft.add_argument("--agent-team", default=None, dest="agent_team")
     sft.add_argument("--seed", type=int, default=42)
     args = parser.parse_args()
     if getattr(args, "stage", None) is None:
         args.stage = 1
     if getattr(args, "model", None) is None:
+        args.model = "Qwen/Qwen3.5-4B"
     if getattr(args, "steps", None) is None:
         args.steps = 200
     if getattr(args, "prompts", None) is None:
         args.prompts = 500
     if getattr(args, "batch_size", None) is None:
         args.batch_size = 2
+    if getattr(args, "grad_accum", None) is None:
+        args.grad_accum = 4
     if getattr(args, "eval_pack_id", None) is None:
         args.eval_pack_id = "adaptive_t20_v1"
     if getattr(args, "opponent_mode", None) is None:
         args.opponent_mode = "llm_live"
     if getattr(args, "max_overs", None) is None:
         args.max_overs = 5
+    if getattr(args, "agent_team", None) is None:
+        args.agent_team = os.environ.get("CRICKET_AGENT_TEAM")
+    if getattr(args, "max_tool_calling_iterations", None) is None:
+        args.max_tool_calling_iterations = 200
+    if getattr(args, "logging_steps", None) is None:
+        args.logging_steps = 1
+    if getattr(args, "report_to", None) is None:
+        args.report_to = "none"
     if args.cmd == "train":
         train(args)
     elif args.cmd == "eval":
         evaluate(args)
     elif args.cmd == "test":
+        _smoke_test(args.agent_team, args.opponent_mode)
     elif args.cmd == "train-smoke":
         train_smoke(args)
     elif args.cmd == "sft-data":
+        generate_sft_examples(args.output, args.examples, args.seed, args.agent_team)
     else:
         parser.print_help()
+def _smoke_test(agent_team: str | None, opponent_mode: str):
     """Verify reward functions work correctly."""
     cases = [
         (
     ]
     print("Reward function smoke test:\n")
     for prompt, completion, expected in cases:
+        fmt = r_validity(completion)
+        coh = r_behavior_stateless(prompt, completion)
         print(f"  expected={expected:4s} | fmt={fmt:.0f} | coh={coh:.3f} | {completion[:60]}")
     print("\nPrompt collection test (5 prompts):")
+    p = collect_prompts(5, task="stage1_format", seed=1, agent_team=agent_team, opponent_mode=opponent_mode)
     for i, pp in enumerate(p):
         print(f"  [{i}] {pp[:80].strip()} …")