--- title: CaptainRL โ€” Captain's Cockpit emoji: ๐Ÿ colorFrom: green colorTo: blue sdk: docker app_port: 8000 base_path: /web pinned: false license: mit --- | Resource | Link | | ----------------------- | ------------------------------------------------------------------------ | | Hugging Face Space URL: | https://huggingface.co/spaces/pratinavseth/cricket-captain-llm | | Hugging Face Repo: | https://huggingface.co/spaces/pratinavseth/cricket-captain-llm/tree/main | | Demo Video: | https://youtu.be/Cqsq6MaeNUg | # CaptainRL โ€” Captain's Cockpit **An OpenEnv benchmark for long-horizon multi-turn agentic RL: train an LLM to captain a full cricket match.** ๐Ÿ”— **Live Hugging Face Space:** https://huggingface.co/spaces/pratinavseth/cricket-captain-llm ๐Ÿ“Š **Training runs (W&B):** https://wandb.ai/ptnv-s-research/huggingface ๐Ÿค– **Trained adapter (HF Hub):** https://huggingface.co/pratinavseth/cricket-captain-qwen3-06b-stage2 ๐Ÿ“ **Mini-blog:** [blog.md](blog.md) ๐ŸŽฌ **Demo video (โ‰ค2 min):** https://youtu.be/Cqsq6MaeNUg **Hackathon theme alignment โ€” Theme #2: (Super) Long-Horizon Planning & Instruction Following.** A T20 match is up to ~240 legal balls of strategic decision-making across both offensive and defensive roles โ€” exactly the regime where current LLM agents struggle: deep multi-step reasoning, sparse terminal reward (win/loss arrives 100+ turns after the early decisions that caused it), recovery from early mistakes, and _two-sided_ opponent modelling. One match = ~180 sequential tool calls, 14 state-conditioned tools, a Markov ball-outcome engine trained on **1.65M cricsheet deliveries**, and a 4-rubric composite reward. The Hugging Face Space exposes the OpenEnv server and a Gradio demo UI at `/web`. For the full problem statement, env design, reward architecture, and tool list, see [`docs/benchmark_explainer.md`](docs/benchmark_explainer.md). For the practical training/eval recipe, see [`docs/experiment_workflow.md`](docs/experiment_workflow.md). --- ## Train โ†’ Eval framing Multi-turn agentic RL with **partial-trajectory training and full-task eval generalisation** โ€” the same recipe coding-agent RL papers (SWE-RL, AgentR) use. | | TRAINING (warmup) | TRAINING (main) | EVALUATION | | ------------------------ | ------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | --------------------------------------------------- | | max overs | 2-3 (curriculum) | 5 (end-to-end) | full T20 (20 over) | | steps | 30 | 100 | n/a | | token budget per rollout | 16k (โ‰ˆ 80โ€“180 turns) | 24k (โ‰ˆ 120โ€“240 turns) | unlimited (full match plays out) | | reward signal | composite (`r_result` + `r_cricket` + `r_behavior` + `r_validity`) โ€” see `server/reward_calculator.py` | same | **headline metric: win rate vs heuristic baseline** | | script | `python train.py train --config configs/cricket_train_qwen3_warmup.yaml` | `python train.py train --config configs/cricket_train_qwen3.yaml` | `inference.py` (Round 1 OpenEnv benchmark runner) | The trained policy generalises to full matches at inference because it learns _good per-state decisions_, not specific trajectory lengths. The whole chain runs via `bash scripts/run_warmup_then_main.sh`. --- ## Results Training stack: Qwen3-4B-Instruct-2507 (Branch A) and Qwen3-0.6B (sanity ablation) + LoRA r=64 + TRL GRPO, vLLM colocate, 1ร— H200. ### Two formulations evaluated We evaluated CricketCaptain under two task formulations to isolate where the difficulty lives: | | **Branch A โ€” per-ball multi-turn** (`main` branch) | **Branch B โ€” per-over heuristic-reward** (`per-over` branch) | | ------------------------------- | -------------------------------------------------------- | ----------------------------------------------------------------------- | | Captain decisions | every ball (~180/match) | once per over (~35/match) | | Effective horizon | full 5-over match | 5ร— shorter | | Training loop | TRL `environment_factory` multi-turn rollouts (live env) | Stateless GRPO on snapshot prompts + 12-step rollout reward | | **Theme #2 fit (long-horizon)** | direct match | weaker (smaller horizon, simplified) | | Step time | ~95 s/step (4B) | ~7 s/step | | Run code | `train.py` + `server/cricket_environment.py` | `train.py` (per-over branch) + `server/cricket_environment_per_over.py` | Branch A is the headline submission for Theme #2. Branch B is preserved as a comparison: an architecturally simpler formulation that produces cleaner reward curves at the cost of horizon depth. ### Branch A โ€” training curves (Qwen3-4B-Instruct-2507, 99 steps) W&B run: https://wandb.ai/ptnv-s-research/huggingface/runs/aeo4hzs1 ยท Raw CSV: [docs/plots/4B_per_ball_main_run.csv](docs/plots/4B_per_ball_main_run.csv) > **Outlier policy** โ€” per-step GRPO metrics have natural spikes (one all-rollout-aborted step pushes `reward_std` to 1.14 vs median 0.30; one OOM-recovery step pushes `grad_norm` to 0.006 vs median 0.000). Each metric is winsorized to [p2, p98] and overlaid with a 5-step centered rolling mean; faint per-step trace stays drawn so variance is honest. 0โ€“4 points clipped per metric out of 60โ€“100 steps. Regen: [scripts/regen_plots_clean_outliers.py](scripts/regen_plots_clean_outliers.py). | (A) Per-step reward | (A) Training loss (GRPO surrogate) | | ---------------------------------- | ---------------------------------- | | ![](docs/plots/4B_reward_mean.png) | ![](docs/plots/4B_loss.png) | | (A) Gradient magnitude (log) | (A) Match completion rate | | -------------------------------- | -------------------------------------- | | ![](docs/plots/4B_grad_norm.png) | ![](docs/plots/4B_completion_rate.png) | Honest read: gradient is real and `reward/std` shows GRPO has differential signal (group spread 0.3โ€“1.0), but `reward/mean` is flat across 100 steps โ€” the model isn't beating the heuristic on 5-over yet within this compute budget. The flat curve is the honest "100 steps isn't enough on this difficulty" finding; the engineering findings (below) document the structural fixes that actually moved the needle. #### Qualitative sample โ€” best rollout from step 99 A single 47-tool-call match from late training (env_reward `+0.068`, GRPO advantage `+1.353`). Edited for length; `rationale` fields preserved verbatim. | # | Tool | Args (key fields) | Outcome | | --- | ---------------------- | ----------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | 1 | `call_toss` | `call=heads, decision=bat` | Toss heads, agent chose to bat | | 2 | `set_match_plan` | powerplay=`rotate_strike_and_build_score`, middle=`conservative_play_to_save_wickets`, death=`aggressive_boundary_push` | rationale: _"Balanced field and pace bowler require cautious powerplay to avoid early wickets while building momentum"_ | | 3 | `set_strategy` | `phase_intent=rotate_strike_and_build_score, aggression=0.4` | rationale: _"Powerplay requires balance to avoid early wickets while rotating strike and building a steady score"_ | | 4 | `play_delivery` | `shot_intent=boundary, risk=balanced` | **Worked off the hips โ€” a FOUR** | | 5 | `update_match_plan` | `reason="Loss of wicket โ€” shift to low risk_budget"` | **Turned to leg โ€” sharp catch at short fine leg. OUT!** | | 6 | `set_bowling_strategy` | `bowler_type=pace, line=outside_off, length=good` | rationale: _"Maintain consistent stock delivery to exploit line and length against aggressive finisher"_ | | 7 | `bowl_delivery` | `line=outside_off, length=good, delivery_type=stock` | **Driven through the covers โ€” a SIX!** | What this rollout shows: phase-aware match plan, coherent text rationales referencing the _actual_ phase / field / bowler observed, plan adaptation after a wicket loss, 9 of 14 tools used, 47/47 turns valid `` XML. Match was lost (heuristic chased 29/4) โ€” but the _behaviour_ is strategically coherent, exactly what `r_result` alone can't reward and what the dense `r_behavior` rubric exists to train. ### Branch B โ€” per-over results The per-over branch trades horizon depth for trainability: each ball collapses to a single deterministic action, only ~35 strategic decisions per match, stateless reward computation. Curves climb faster because the task is structurally simpler. See [github.com/pratinavseth/cricket-captain-llm/tree/per-over](https://github.com/pratinavseth/cricket-captain-llm/tree/per-over) for that branch's training code. The B port also lives on `main` via `CRICKET_ENV_VARIANT=per_over` + [`configs/cricket_train_qwen3_06b_perover.yaml`](configs/cricket_train_qwen3_06b_perover.yaml) for a controlled side-by-side run on the same model. **Run:** Qwen/Qwen3.5-4B + LoRA (r=16, alpha=32), 70 steps Stage 1 + 70 steps Stage 2, 1ร— H200. Stage 1 final reward: **0.73**. Stage 2 final reward: **0.57**. `frac_reward_zero_std=0` throughout both stages (healthy GRPO gradient signal). Plot generation script: [`scripts/plot_per_over.py`](scripts/plot_per_over.py). | (B) Training summary โ€” Stage 1 vs Stage 2 | | --------------------------------------------- | | ![](docs/plots/per_over/per_over_summary.png) | | (B) Per-step reward | (B) Within-group reward spread | | ------------------------------------------------- | ------------------------------------------------ | | ![](docs/plots/per_over/per_over_reward_mean.png) | ![](docs/plots/per_over/per_over_reward_std.png) | | (B) Gradient norm | (B) Training loss | | ----------------------------------------------- | ------------------------------------------ | | ![](docs/plots/per_over/per_over_grad_norm.png) | ![](docs/plots/per_over/per_over_loss.png) | ### Key engineering findings during Branch A iteration Three structural bugs the rich tool-RL setup surfaced โ€” all caught by reading the per-step traces, all fixed. | Issue surfaced | Fix | Effect | | ------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | | Only ~19 % of rollouts reached `done` naturally; rest aborted mid-match | Restrict the system prompt to a single `...` XML format. The original prompt advertised "XML or JSON โ€” both accepted", but TRL's response-schema parser only recognises the XML wrapper. Bare JSON in any later turn aborted the rollout. | `tools/call_frequency` 9 โ†’ 73; rollouts 5โ€“8ร— longer; matches play out; `r_result` actually fires | | GRPO `reward/std` collapsed toward 0 once matches completed | Remove the `[-1, 1]` reward clip. With completed matches the dense per-ball signal saturated the ceiling, group std โ†’ 0, advantages โ†’ 0, gradients โ†’ 0. Let GRPO standardise the advantage itself. | `reward/std` 0.0 โ†’ 1.5; gradient signal restored | | Composite reward stayed positive even on dominant losses (chase-to-zero with all-out) | Add explicit `outcome_bonus = -1.0` for clear losses (was `0.0`). Reduce always-positive `progress_bonus` cap from 0.25 โ†’ 0.10. Allow first-innings bowling reward to be signed. | Composite spans negative AND positive โ€” model has a real reason to win | These are the kind of failure modes long-horizon agentic RL tends to hide. Logging them honestly is more valuable than a polished but uninterrogated curve. --- ## Quickstart System requirements: Python 3.10+, CUDA 12.x for training. Inference / random baselines / Gradio UI work CPU-only. Versions in [pyproject.toml](pyproject.toml) are pinned to the lowest combination that supports TRL multi-turn `environment_factory` AND vLLM colocate AND transformers v5 chat templates (transformers 5.6.2 + trl 1.2.0 + vllm 0.19.1 + torch 2.10.0). ### Install ```bash git clone cricket-captain-llm cd cricket-captain-llm # Full install (training + eval plots) uv venv .venv --python 3.10 uv pip install --python .venv/bin/python -e ".[train,eval]" source .venv/bin/activate # OR inference-only (no GPU) uv venv .venv --python 3.10 uv pip install --python .venv/bin/python -e . # HuggingFace login (only for gated models / Hub uploads) export HF_TOKEN=hf_... ``` ### Train (warmup โ†’ main) ```bash bash scripts/run_warmup_then_main.sh # Logs: /tmp/train_warmup.log โ†’ /tmp/train_main.log # Final adapter: ./checkpoints/stage2_final/ ``` Internally: - **Warmup** (2-3 over curriculum, 30 steps, ~50โ€“60 min on H200): bootstraps the LoRA adapter from base Qwen3-4B-Instruct-2507. - **Main** (5-over end-to-end, 100 steps, ~5โ€“7 hr): resumes warmup adapter, trains on full 5-over matches with `r_result` as the dominant gradient driver. Final adapter at `./checkpoints/stage2_final/`. Run individually: ```bash PYTORCH_ALLOC_CONF=expandable_segments:True \ python train.py train --config configs/cricket_train_qwen3_warmup.yaml PYTORCH_ALLOC_CONF=expandable_segments:True \ python train.py train --config configs/cricket_train_qwen3.yaml ``` ### Inference / OpenEnv benchmark ```bash # Start the env server (port 8000) PYTHONPATH=. python server/app.py --port 8000 --config configs/extras/default.yaml # Random captain (no API key needed) export CRICKET_CAPTAIN_ENV_URL="ws://localhost:8000" python inference.py --model random --episodes 5 --task easy # Live HF captain via OpenAI-compat router export HF_TOKEN=hf_... MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct python inference.py --episodes 10 --task medium # Trained adapter served locally via model_server.py python model_server.py --checkpoint ./checkpoints/stage2_final --port 8080 & python inference.py --model local --api-base http://localhost:8080/v1 --episodes 20 --task easy ``` `inference.py` emits Round 1 `[START]` / `[STEP]` / `[END]` stdout markers per the OpenEnv benchmark spec. The three task IDs (`easy`/`medium`/`hard`) map to 5/20/50-over matches via `openenv.yaml`. ### Configuration YAML is the single source of truth. Pass `--config configs/cricket_train_qwen3.yaml` and override on CLI as needed. Key files: - [`configs/cricket_train_qwen3_warmup.yaml`](configs/cricket_train_qwen3_warmup.yaml) โ€” warmup (2-3 over, 30 steps). - [`configs/cricket_train_qwen3.yaml`](configs/cricket_train_qwen3.yaml) โ€” main (5-over, 100 steps), resumes from warmup. - [`configs/cricket_train_qwen3_06b.yaml`](configs/cricket_train_qwen3_06b.yaml) โ€” 0.6B 1-hour ablation with KL anchor on. - [`configs/cricket_train_qwen3_06b_perover.yaml`](configs/cricket_train_qwen3_06b_perover.yaml) โ€” Branch B port (per-over) on the same 0.6B model. - [`configs/game_knowledge.yaml`](configs/game_knowledge.yaml) โ€” reward weights and game constants. GPU sizing, opponent modes, smoke tests, and detailed sub-rubric definitions live in [`docs/experiment_workflow.md`](docs/experiment_workflow.md) and [`docs/benchmark_explainer.md`](docs/benchmark_explainer.md). --- ## Hackathon submission materials | Material | Link | | --------------------------------- | -------------------------------------------------------------------- | | **Live HF Space** (env runs here) | https://huggingface.co/spaces/pratinavseth/cricket-captain-llm | | **GitHub repo** | https://github.com/pratinavseth/cricket-captain-llm | | **W&B project** (training runs) | https://wandb.ai/ptnv-s-research/huggingface | | **Trained adapter (HF Hub)** | https://huggingface.co/pratinavseth/cricket-captain-qwen3-06b-stage2 | | **Mini-blog** | [blog.md](blog.md) | | **Demo video (โ‰ค2 min, YouTube)** | https://youtu.be/Cqsq6MaeNUg | ### Submission checklist - [x] OpenEnv (latest release): `openenv-core[core]>=0.2.2` ([pyproject.toml](pyproject.toml)) - [x] Working training script (TRL GRPO): [train.py](train.py) - [x] Reward + training pipeline coherent: 4-rubric composite (`r_result` / `r_cricket` / `r_behavior` / `r_validity`), documented in `server/reward_calculator.py` - [x] HF Space pushed and discoverable - [x] Loss + reward + grad-norm + match-completion plots from real run, with explicit outlier policy - [x] Mini-blog: [blog.md](blog.md) - [x] README motivates problem, explains env, links materials - [x] OpenEnv compliance: `Environment` base class, valid `openenv.yaml`, no reserved tool names - [x] โ‰ค2 min demo video: https://youtu.be/Cqsq6MaeNUg --- ## Project structure ``` cricket-captain-llm/ โ”œโ”€โ”€ Dockerfile Container image (multi-stage, openenv-base) โ”œโ”€โ”€ openenv.yaml OpenEnv manifest (runtime: fastapi, port 8000) โ”œโ”€โ”€ pyproject.toml Pinned deps โ”œโ”€โ”€ client.py EnvClient subclass โ€” WebSocket client for the env โ”œโ”€โ”€ models.py Pydantic types (CricketAction / CricketObservation / CricketState) โ”œโ”€โ”€ train.py GRPO training script (TRL + vLLM colocate + LoRA) โ”œโ”€โ”€ eval.py Reward-curve / coherence-heatmap visualisation from training logs โ”œโ”€โ”€ inference.py Round 1 OpenEnv benchmark runner ([START]/[STEP]/[END] markers) โ”œโ”€โ”€ model_server.py OpenAI-compat server backed by local transformers + PEFT โ”œโ”€โ”€ validate-submission.sh HF Space + Docker + openenv-validate health check โ”œโ”€โ”€ configs/ cricket_train_qwen3{,_warmup,_06b,_06b_perover}.yaml + game_knowledge.yaml โ”œโ”€โ”€ data/ Markov tables, DLS par, format rules, eval packs, player profiles โ”œโ”€โ”€ docs/ โ”‚ โ”œโ”€โ”€ benchmark_explainer.md Full problem + design + reward walkthrough โ”‚ โ”œโ”€โ”€ experiment_workflow.md Step-by-step training + eval recipe โ”‚ โ”œโ”€โ”€ SUBMISSION.md Operator runbook for HF Space + endpoint + secrets โ”‚ โ””โ”€โ”€ plots/ A real / B placeholder training-dynamics PNGs + 4B CSV โ”œโ”€โ”€ scripts/ โ”‚ โ”œโ”€โ”€ run_warmup_then_main.sh Chain warmup โ†’ main โ”‚ โ”œโ”€โ”€ regen_plots_clean_outliers.py Re-render A plots from CSV with [p2, p98] winsorize โ”‚ โ””โ”€โ”€ make_b_placeholder.py Re-render B placeholders โ””โ”€โ”€ server/ โ”œโ”€โ”€ app.py FastAPI entrypoint (create_app + Gradio UI) โ”œโ”€โ”€ cricket_environment.py Env state machine (TOSS โ†’ BAT โ†’ BOWL โ†’ FINISHED) โ”œโ”€โ”€ cricket_environment_per_over.py Per-over (Branch B) variant โ”œโ”€โ”€ markov_engine.py Per-ball outcome sampling (cricsheet-trained) โ”œโ”€โ”€ opponent_policy.py heuristic / cricsheet / llm_live / llm_cached โ”œโ”€โ”€ coherence_grader.py Plan-action coherence rubric โ”œโ”€โ”€ reward_calculator.py 4-rubric composite reward โ””โ”€โ”€ ui.py Custom Gradio dashboard for the Space ``` --- ## Team - **Pratinav Seth** - [GitHub](https://github.com/pratinavseth) | [LinkedIn](https://www.linkedin.com/in/pratinav-seth/) - **Divyansh Kulshreshtha** - [GitHub](https://github.com/divyanshkul) | [LinkedIn](https://www.linkedin.com/in/divyanshkul) - **Siddhant Bharadwaj** - [GitHub](https://github.com/ignoreandfly) | [LinkedIn](https://www.linkedin.com/in/siddhant0701) ## Citation ``` @misc{cricketcaptain2025, title={CricketCaptain-LLM: Adaptive Strategic Decision-Making for Language Model Agents}, year={2025}, note={OpenEnv Hackathon submission} } @article{wdct2025, title={Large Language Models Often Say One Thing and Do Another}, year={2025}, url={https://arxiv.org/abs/2503.07003} } ```