Spaces:

pratinavseth
/

cricket-captain-llm

Running

App Files Files Community

cricket-captain-llm / README.md

pratinavseth

Sync github 89af584 + 4b5b65c: demo video link + blog image-ref formatting fix

b634743 verified 17 days ago

preview code

raw

history blame contribute delete

23.6 kB

	---
	title: CaptainRL — Captain's Cockpit
	emoji: 🏏
	colorFrom: green
	colorTo: blue
	sdk: docker
	app_port: 8000
	base_path: /web
	pinned: false
	license: mit
	---

	\| Resource \| Link \|
	\| ----------------------- \| ------------------------------------------------------------------------ \|
	\| Hugging Face Space URL: \| https://huggingface.co/spaces/pratinavseth/cricket-captain-llm \|
	\| Hugging Face Repo: \| https://huggingface.co/spaces/pratinavseth/cricket-captain-llm/tree/main \|
	\| Demo Video: \| https://youtu.be/Cqsq6MaeNUg \|

	# CaptainRL — Captain's Cockpit

	An OpenEnv benchmark for long-horizon multi-turn agentic RL: train an LLM to captain a full cricket match.

	🔗 Live Hugging Face Space: https://huggingface.co/spaces/pratinavseth/cricket-captain-llm
	📊 Training runs (W&B): https://wandb.ai/ptnv-s-research/huggingface
	🤖 Trained adapter (HF Hub): https://huggingface.co/pratinavseth/cricket-captain-qwen3-06b-stage2
	📝 Mini-blog: [blog.md](blog.md)
	🎬 Demo video (≤2 min): https://youtu.be/Cqsq6MaeNUg

	Hackathon theme alignment — Theme #2: (Super) Long-Horizon Planning & Instruction Following. A T20 match is up to ~240 legal balls of strategic decision-making across both offensive and defensive roles — exactly the regime where current LLM agents struggle: deep multi-step reasoning, sparse terminal reward (win/loss arrives 100+ turns after the early decisions that caused it), recovery from early mistakes, and _two-sided_ opponent modelling.

	One match = ~180 sequential tool calls, 14 state-conditioned tools, a Markov ball-outcome engine trained on 1.65M cricsheet deliveries, and a 4-rubric composite reward. The Hugging Face Space exposes the OpenEnv server and a Gradio demo UI at `/web`.

	For the full problem statement, env design, reward architecture, and tool list, see [`docs/benchmark_explainer.md`](docs/benchmark_explainer.md). For the practical training/eval recipe, see [`docs/experiment_workflow.md`](docs/experiment_workflow.md).

	---

	## Train → Eval framing

	Multi-turn agentic RL with partial-trajectory training and full-task eval generalisation — the same recipe coding-agent RL papers (SWE-RL, AgentR) use.

	\| \| TRAINING (warmup) \| TRAINING (main) \| EVALUATION \|
	\| ------------------------ \| ------------------------------------------------------------------------------------------------------ \| ----------------------------------------------------------------- \| --------------------------------------------------- \|
	\| max overs \| 2-3 (curriculum) \| 5 (end-to-end) \| full T20 (20 over) \|
	\| steps \| 30 \| 100 \| n/a \|
	\| token budget per rollout \| 16k (≈ 80–180 turns) \| 24k (≈ 120–240 turns) \| unlimited (full match plays out) \|
	\| reward signal \| composite (`r_result` + `r_cricket` + `r_behavior` + `r_validity`) — see `server/reward_calculator.py` \| same \| headline metric: win rate vs heuristic baseline \|
	\| script \| `python train.py train --config configs/cricket_train_qwen3_warmup.yaml` \| `python train.py train --config configs/cricket_train_qwen3.yaml` \| `inference.py` (Round 1 OpenEnv benchmark runner) \|

	The trained policy generalises to full matches at inference because it learns _good per-state decisions_, not specific trajectory lengths. The whole chain runs via `bash scripts/run_warmup_then_main.sh`.

	---

	## Results

	Training stack: Qwen3-4B-Instruct-2507 (Branch A) and Qwen3-0.6B (sanity ablation) + LoRA r=64 + TRL GRPO, vLLM colocate, 1× H200.

	### Two formulations evaluated

	We evaluated CricketCaptain under two task formulations to isolate where the difficulty lives:

	\| \| Branch A — per-ball multi-turn (`main` branch) \| Branch B — per-over heuristic-reward (`per-over` branch) \|
	\| ------------------------------- \| -------------------------------------------------------- \| ----------------------------------------------------------------------- \|
	\| Captain decisions \| every ball (~180/match) \| once per over (~35/match) \|
	\| Effective horizon \| full 5-over match \| 5× shorter \|
	\| Training loop \| TRL `environment_factory` multi-turn rollouts (live env) \| Stateless GRPO on snapshot prompts + 12-step rollout reward \|
	\| Theme #2 fit (long-horizon) \| direct match \| weaker (smaller horizon, simplified) \|
	\| Step time \| ~95 s/step (4B) \| ~7 s/step \|
	\| Run code \| `train.py` + `server/cricket_environment.py` \| `train.py` (per-over branch) + `server/cricket_environment_per_over.py` \|

	Branch A is the headline submission for Theme #2. Branch B is preserved as a comparison: an architecturally simpler formulation that produces cleaner reward curves at the cost of horizon depth.

	### Branch A — training curves (Qwen3-4B-Instruct-2507, 99 steps)

	W&B run: https://wandb.ai/ptnv-s-research/huggingface/runs/aeo4hzs1 · Raw CSV: [docs/plots/4B_per_ball_main_run.csv](docs/plots/4B_per_ball_main_run.csv)

	> Outlier policy — per-step GRPO metrics have natural spikes (one all-rollout-aborted step pushes `reward_std` to 1.14 vs median 0.30; one OOM-recovery step pushes `grad_norm` to 0.006 vs median 0.000). Each metric is winsorized to [p2, p98] and overlaid with a 5-step centered rolling mean; faint per-step trace stays drawn so variance is honest. 0–4 points clipped per metric out of 60–100 steps. Regen: [scripts/regen_plots_clean_outliers.py](scripts/regen_plots_clean_outliers.py).

	\| (A) Per-step reward \| (A) Training loss (GRPO surrogate) \|
	\| ---------------------------------- \| ---------------------------------- \|
	\| ![](docs/plots/4B_reward_mean.png) \| ![](docs/plots/4B_loss.png) \|

	\| (A) Gradient magnitude (log) \| (A) Match completion rate \|
	\| -------------------------------- \| -------------------------------------- \|
	\| ![](docs/plots/4B_grad_norm.png) \| ![](docs/plots/4B_completion_rate.png) \|

	Honest read: gradient is real and `reward/std` shows GRPO has differential signal (group spread 0.3–1.0), but `reward/mean` is flat across 100 steps — the model isn't beating the heuristic on 5-over yet within this compute budget. The flat curve is the honest "100 steps isn't enough on this difficulty" finding; the engineering findings (below) document the structural fixes that actually moved the needle.

	#### Qualitative sample — best rollout from step 99

	A single 47-tool-call match from late training (env_reward `+0.068`, GRPO advantage `+1.353`). Edited for length; `rationale` fields preserved verbatim.

	\| # \| Tool \| Args (key fields) \| Outcome \|
	\| --- \| ---------------------- \| ----------------------------------------------------------------------------------------------------------------------- \| ----------------------------------------------------------------------------------------------------------------------- \|
	\| 1 \| `call_toss` \| `call=heads, decision=bat` \| Toss heads, agent chose to bat \|
	\| 2 \| `set_match_plan` \| powerplay=`rotate_strike_and_build_score`, middle=`conservative_play_to_save_wickets`, death=`aggressive_boundary_push` \| rationale: _"Balanced field and pace bowler require cautious powerplay to avoid early wickets while building momentum"_ \|
	\| 3 \| `set_strategy` \| `phase_intent=rotate_strike_and_build_score, aggression=0.4` \| rationale: _"Powerplay requires balance to avoid early wickets while rotating strike and building a steady score"_ \|
	\| 4 \| `play_delivery` \| `shot_intent=boundary, risk=balanced` \| Worked off the hips — a FOUR \|
	\| 5 \| `update_match_plan` \| `reason="Loss of wicket — shift to low risk_budget"` \| Turned to leg — sharp catch at short fine leg. OUT! \|
	\| 6 \| `set_bowling_strategy` \| `bowler_type=pace, line=outside_off, length=good` \| rationale: _"Maintain consistent stock delivery to exploit line and length against aggressive finisher"_ \|
	\| 7 \| `bowl_delivery` \| `line=outside_off, length=good, delivery_type=stock` \| Driven through the covers — a SIX! \|

	What this rollout shows: phase-aware match plan, coherent text rationales referencing the _actual_ phase / field / bowler observed, plan adaptation after a wicket loss, 9 of 14 tools used, 47/47 turns valid `<tool_call>` XML. Match was lost (heuristic chased 29/4) — but the _behaviour_ is strategically coherent, exactly what `r_result` alone can't reward and what the dense `r_behavior` rubric exists to train.

	### Branch B — per-over results

	The per-over branch trades horizon depth for trainability: each ball collapses to a single deterministic action, only ~35 strategic decisions per match, stateless reward computation. Curves climb faster because the task is structurally simpler. See [github.com/pratinavseth/cricket-captain-llm/tree/per-over](https://github.com/pratinavseth/cricket-captain-llm/tree/per-over) for that branch's training code. The B port also lives on `main` via `CRICKET_ENV_VARIANT=per_over` + [`configs/cricket_train_qwen3_06b_perover.yaml`](configs/cricket_train_qwen3_06b_perover.yaml) for a controlled side-by-side run on the same model.

	Run: Qwen/Qwen3.5-4B + LoRA (r=16, alpha=32), 70 steps Stage 1 + 70 steps Stage 2, 1× H200. Stage 1 final reward: 0.73. Stage 2 final reward: 0.57. `frac_reward_zero_std=0` throughout both stages (healthy GRPO gradient signal). Plot generation script: [`scripts/plot_per_over.py`](scripts/plot_per_over.py).

	\| (B) Training summary — Stage 1 vs Stage 2 \|
	\| --------------------------------------------- \|
	\| ![](docs/plots/per_over/per_over_summary.png) \|

	\| (B) Per-step reward \| (B) Within-group reward spread \|
	\| ------------------------------------------------- \| ------------------------------------------------ \|
	\| ![](docs/plots/per_over/per_over_reward_mean.png) \| ![](docs/plots/per_over/per_over_reward_std.png) \|

	\| (B) Gradient norm \| (B) Training loss \|
	\| ----------------------------------------------- \| ------------------------------------------ \|
	\| ![](docs/plots/per_over/per_over_grad_norm.png) \| ![](docs/plots/per_over/per_over_loss.png) \|

	### Key engineering findings during Branch A iteration

	Three structural bugs the rich tool-RL setup surfaced — all caught by reading the per-step traces, all fixed.

	\| Issue surfaced \| Fix \| Effect \|
	\| ------------------------------------------------------------------------------------- \| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \| ------------------------------------------------------------------------------------------------ \|
	\| Only ~19 % of rollouts reached `done` naturally; rest aborted mid-match \| Restrict the system prompt to a single `<tool_call>...</tool_call>` XML format. The original prompt advertised "XML or JSON — both accepted", but TRL's response-schema parser only recognises the XML wrapper. Bare JSON in any later turn aborted the rollout. \| `tools/call_frequency` 9 → 73; rollouts 5–8× longer; matches play out; `r_result` actually fires \|
	\| GRPO `reward/std` collapsed toward 0 once matches completed \| Remove the `[-1, 1]` reward clip. With completed matches the dense per-ball signal saturated the ceiling, group std → 0, advantages → 0, gradients → 0. Let GRPO standardise the advantage itself. \| `reward/std` 0.0 → 1.5; gradient signal restored \|
	\| Composite reward stayed positive even on dominant losses (chase-to-zero with all-out) \| Add explicit `outcome_bonus = -1.0` for clear losses (was `0.0`). Reduce always-positive `progress_bonus` cap from 0.25 → 0.10. Allow first-innings bowling reward to be signed. \| Composite spans negative AND positive — model has a real reason to win \|

	These are the kind of failure modes long-horizon agentic RL tends to hide. Logging them honestly is more valuable than a polished but uninterrogated curve.

	---

	## Quickstart

	System requirements: Python 3.10+, CUDA 12.x for training. Inference / random baselines / Gradio UI work CPU-only. Versions in [pyproject.toml](pyproject.toml) are pinned to the lowest combination that supports TRL multi-turn `environment_factory` AND vLLM colocate AND transformers v5 chat templates (transformers 5.6.2 + trl 1.2.0 + vllm 0.19.1 + torch 2.10.0).

	### Install

	```bash
	git clone <this-repo> cricket-captain-llm
	cd cricket-captain-llm

	# Full install (training + eval plots)
	uv venv .venv --python 3.10
	uv pip install --python .venv/bin/python -e ".[train,eval]"
	source .venv/bin/activate

	# OR inference-only (no GPU)
	uv venv .venv --python 3.10
	uv pip install --python .venv/bin/python -e .

	# HuggingFace login (only for gated models / Hub uploads)
	export HF_TOKEN=hf_...
	```

	### Train (warmup → main)

	```bash
	bash scripts/run_warmup_then_main.sh
	# Logs: /tmp/train_warmup.log → /tmp/train_main.log
	# Final adapter: ./checkpoints/stage2_final/
	```

	Internally:

	- Warmup (2-3 over curriculum, 30 steps, ~50–60 min on H200): bootstraps the LoRA adapter from base Qwen3-4B-Instruct-2507.
	- Main (5-over end-to-end, 100 steps, ~5–7 hr): resumes warmup adapter, trains on full 5-over matches with `r_result` as the dominant gradient driver. Final adapter at `./checkpoints/stage2_final/`.

	Run individually:

	```bash
	PYTORCH_ALLOC_CONF=expandable_segments:True \
	python train.py train --config configs/cricket_train_qwen3_warmup.yaml

	PYTORCH_ALLOC_CONF=expandable_segments:True \
	python train.py train --config configs/cricket_train_qwen3.yaml
	```

	### Inference / OpenEnv benchmark

	```bash
	# Start the env server (port 8000)
	PYTHONPATH=. python server/app.py --port 8000 --config configs/extras/default.yaml

	# Random captain (no API key needed)
	export CRICKET_CAPTAIN_ENV_URL="ws://localhost:8000"
	python inference.py --model random --episodes 5 --task easy

	# Live HF captain via OpenAI-compat router
	export HF_TOKEN=hf_... MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
	python inference.py --episodes 10 --task medium

	# Trained adapter served locally via model_server.py
	python model_server.py --checkpoint ./checkpoints/stage2_final --port 8080 &
	python inference.py --model local --api-base http://localhost:8080/v1 --episodes 20 --task easy
	```

	`inference.py` emits Round 1 `[START]` / `[STEP]` / `[END]` stdout markers per the OpenEnv benchmark spec. The three task IDs (`easy`/`medium`/`hard`) map to 5/20/50-over matches via `openenv.yaml`.

	### Configuration

	YAML is the single source of truth. Pass `--config configs/cricket_train_qwen3.yaml` and override on CLI as needed. Key files:

	- [`configs/cricket_train_qwen3_warmup.yaml`](configs/cricket_train_qwen3_warmup.yaml) — warmup (2-3 over, 30 steps).
	- [`configs/cricket_train_qwen3.yaml`](configs/cricket_train_qwen3.yaml) — main (5-over, 100 steps), resumes from warmup.
	- [`configs/cricket_train_qwen3_06b.yaml`](configs/cricket_train_qwen3_06b.yaml) — 0.6B 1-hour ablation with KL anchor on.
	- [`configs/cricket_train_qwen3_06b_perover.yaml`](configs/cricket_train_qwen3_06b_perover.yaml) — Branch B port (per-over) on the same 0.6B model.
	- [`configs/game_knowledge.yaml`](configs/game_knowledge.yaml) — reward weights and game constants.

	GPU sizing, opponent modes, smoke tests, and detailed sub-rubric definitions live in [`docs/experiment_workflow.md`](docs/experiment_workflow.md) and [`docs/benchmark_explainer.md`](docs/benchmark_explainer.md).

	---

	## Hackathon submission materials

	\| Material \| Link \|
	\| --------------------------------- \| -------------------------------------------------------------------- \|
	\| Live HF Space (env runs here) \| https://huggingface.co/spaces/pratinavseth/cricket-captain-llm \|
	\| GitHub repo \| https://github.com/pratinavseth/cricket-captain-llm \|
	\| W&B project (training runs) \| https://wandb.ai/ptnv-s-research/huggingface \|
	\| Trained adapter (HF Hub) \| https://huggingface.co/pratinavseth/cricket-captain-qwen3-06b-stage2 \|
	\| Mini-blog \| [blog.md](blog.md) \|
	\| Demo video (≤2 min, YouTube) \| https://youtu.be/Cqsq6MaeNUg \|

	### Submission checklist

	- [x] OpenEnv (latest release): `openenv-core[core]>=0.2.2` ([pyproject.toml](pyproject.toml))
	- [x] Working training script (TRL GRPO): [train.py](train.py)
	- [x] Reward + training pipeline coherent: 4-rubric composite (`r_result` / `r_cricket` / `r_behavior` / `r_validity`), documented in `server/reward_calculator.py`
	- [x] HF Space pushed and discoverable
	- [x] Loss + reward + grad-norm + match-completion plots from real run, with explicit outlier policy
	- [x] Mini-blog: [blog.md](blog.md)
	- [x] README motivates problem, explains env, links materials
	- [x] OpenEnv compliance: `Environment` base class, valid `openenv.yaml`, no reserved tool names
	- [x] ≤2 min demo video: https://youtu.be/Cqsq6MaeNUg

	---

	## Project structure

	```
	cricket-captain-llm/
	├── Dockerfile Container image (multi-stage, openenv-base)
	├── openenv.yaml OpenEnv manifest (runtime: fastapi, port 8000)
	├── pyproject.toml Pinned deps
	├── client.py EnvClient subclass — WebSocket client for the env
	├── models.py Pydantic types (CricketAction / CricketObservation / CricketState)
	├── train.py GRPO training script (TRL + vLLM colocate + LoRA)
	├── eval.py Reward-curve / coherence-heatmap visualisation from training logs
	├── inference.py Round 1 OpenEnv benchmark runner ([START]/[STEP]/[END] markers)
	├── model_server.py OpenAI-compat server backed by local transformers + PEFT
	├── validate-submission.sh HF Space + Docker + openenv-validate health check
	├── configs/ cricket_train_qwen3{,_warmup,_06b,_06b_perover}.yaml + game_knowledge.yaml
	├── data/ Markov tables, DLS par, format rules, eval packs, player profiles
	├── docs/
	│ ├── benchmark_explainer.md Full problem + design + reward walkthrough
	│ ├── experiment_workflow.md Step-by-step training + eval recipe
	│ ├── SUBMISSION.md Operator runbook for HF Space + endpoint + secrets
	│ └── plots/ A real / B placeholder training-dynamics PNGs + 4B CSV
	├── scripts/
	│ ├── run_warmup_then_main.sh Chain warmup → main
	│ ├── regen_plots_clean_outliers.py Re-render A plots from CSV with [p2, p98] winsorize
	│ └── make_b_placeholder.py Re-render B placeholders
	└── server/
	├── app.py FastAPI entrypoint (create_app + Gradio UI)
	├── cricket_environment.py Env state machine (TOSS → BAT → BOWL → FINISHED)
	├── cricket_environment_per_over.py Per-over (Branch B) variant
	├── markov_engine.py Per-ball outcome sampling (cricsheet-trained)
	├── opponent_policy.py heuristic / cricsheet / llm_live / llm_cached
	├── coherence_grader.py Plan-action coherence rubric
	├── reward_calculator.py 4-rubric composite reward
	└── ui.py Custom Gradio dashboard for the Space
	```

	---

	## Team

	- Pratinav Seth - [GitHub](https://github.com/pratinavseth) \| [LinkedIn](https://www.linkedin.com/in/pratinav-seth/)
	- Divyansh Kulshreshtha - [GitHub](https://github.com/divyanshkul) \| [LinkedIn](https://www.linkedin.com/in/divyanshkul)
	- Siddhant Bharadwaj - [GitHub](https://github.com/ignoreandfly) \| [LinkedIn](https://www.linkedin.com/in/siddhant0701)

	## Citation

	```
	@misc{cricketcaptain2025,
	title={CricketCaptain-LLM: Adaptive Strategic Decision-Making for Language Model Agents},
	year={2025},
	note={OpenEnv Hackathon submission}
	}

	@article{wdct2025,
	title={Large Language Models Often Say One Thing and Do Another},
	year={2025},
	url={https://arxiv.org/abs/2503.07003}
	}
	```