Spaces:

pratinavseth
/

cricket-captain-llm

Running

App Files Files Community

cricket-captain-llm / README.md

pratinavseth

Sync github 89af584 + 4b5b65c: demo video link + blog image-ref formatting fix

b634743 verified 17 days ago

preview code

raw

history blame contribute delete

23.6 kB

metadata

title: CaptainRL — Captain's Cockpit
emoji: 🏏
colorFrom: green
colorTo: blue
sdk: docker
app_port: 8000
base_path: /web
pinned: false
license: mit

Resource	Link
Hugging Face Space URL:	https://huggingface.co/spaces/pratinavseth/cricket-captain-llm
Hugging Face Repo:	https://huggingface.co/spaces/pratinavseth/cricket-captain-llm/tree/main
Demo Video:	https://youtu.be/Cqsq6MaeNUg

CaptainRL — Captain's Cockpit

An OpenEnv benchmark for long-horizon multi-turn agentic RL: train an LLM to captain a full cricket match.

🔗 Live Hugging Face Space: https://huggingface.co/spaces/pratinavseth/cricket-captain-llm 📊 Training runs (W&B): https://wandb.ai/ptnv-s-research/huggingface 🤖 Trained adapter (HF Hub): https://huggingface.co/pratinavseth/cricket-captain-qwen3-06b-stage2 📝 Mini-blog: blog.md 🎬 Demo video (≤2 min): https://youtu.be/Cqsq6MaeNUg

Hackathon theme alignment — Theme #2: (Super) Long-Horizon Planning & Instruction Following. A T20 match is up to ~240 legal balls of strategic decision-making across both offensive and defensive roles — exactly the regime where current LLM agents struggle: deep multi-step reasoning, sparse terminal reward (win/loss arrives 100+ turns after the early decisions that caused it), recovery from early mistakes, and two-sided opponent modelling.

One match = ~180 sequential tool calls, 14 state-conditioned tools, a Markov ball-outcome engine trained on 1.65M cricsheet deliveries, and a 4-rubric composite reward. The Hugging Face Space exposes the OpenEnv server and a Gradio demo UI at /web.

For the full problem statement, env design, reward architecture, and tool list, see docs/benchmark_explainer.md. For the practical training/eval recipe, see docs/experiment_workflow.md.

Train → Eval framing

Multi-turn agentic RL with partial-trajectory training and full-task eval generalisation — the same recipe coding-agent RL papers (SWE-RL, AgentR) use.

	TRAINING (warmup)	TRAINING (main)	EVALUATION
max overs	2-3 (curriculum)	5 (end-to-end)	full T20 (20 over)
steps	30	100	n/a
token budget per rollout	16k (≈ 80–180 turns)	24k (≈ 120–240 turns)	unlimited (full match plays out)
reward signal	composite (`r_result` + `r_cricket` + `r_behavior` + `r_validity`) — see `server/reward_calculator.py`	same	headline metric: win rate vs heuristic baseline
script	`python train.py train --config configs/cricket_train_qwen3_warmup.yaml`	`python train.py train --config configs/cricket_train_qwen3.yaml`	`inference.py` (Round 1 OpenEnv benchmark runner)

The trained policy generalises to full matches at inference because it learns good per-state decisions, not specific trajectory lengths. The whole chain runs via bash scripts/run_warmup_then_main.sh.

Results

Training stack: Qwen3-4B-Instruct-2507 (Branch A) and Qwen3-0.6B (sanity ablation) + LoRA r=64 + TRL GRPO, vLLM colocate, 1× H200.

Two formulations evaluated

We evaluated CricketCaptain under two task formulations to isolate where the difficulty lives:

	Branch A — per-ball multi-turn (`main` branch)	Branch B — per-over heuristic-reward (`per-over` branch)
Captain decisions	every ball (~180/match)	once per over (~35/match)
Effective horizon	full 5-over match	5× shorter
Training loop	TRL `environment_factory` multi-turn rollouts (live env)	Stateless GRPO on snapshot prompts + 12-step rollout reward
Theme #2 fit (long-horizon)	direct match	weaker (smaller horizon, simplified)
Step time	~95 s/step (4B)	~7 s/step
Run code	`train.py` + `server/cricket_environment.py`	`train.py` (per-over branch) + `server/cricket_environment_per_over.py`

Branch A is the headline submission for Theme #2. Branch B is preserved as a comparison: an architecturally simpler formulation that produces cleaner reward curves at the cost of horizon depth.

Branch A — training curves (Qwen3-4B-Instruct-2507, 99 steps)

W&B run: https://wandb.ai/ptnv-s-research/huggingface/runs/aeo4hzs1 · Raw CSV: docs/plots/4B_per_ball_main_run.csv

Outlier policy — per-step GRPO metrics have natural spikes (one all-rollout-aborted step pushes reward_std to 1.14 vs median 0.30; one OOM-recovery step pushes grad_norm to 0.006 vs median 0.000). Each metric is winsorized to [p2, p98] and overlaid with a 5-step centered rolling mean; faint per-step trace stays drawn so variance is honest. 0–4 points clipped per metric out of 60–100 steps. Regen: scripts/regen_plots_clean_outliers.py.

(A) Per-step reward	(A) Training loss (GRPO surrogate)

(A) Gradient magnitude (log)	(A) Match completion rate

Honest read: gradient is real and reward/std shows GRPO has differential signal (group spread 0.3–1.0), but reward/mean is flat across 100 steps — the model isn't beating the heuristic on 5-over yet within this compute budget. The flat curve is the honest "100 steps isn't enough on this difficulty" finding; the engineering findings (below) document the structural fixes that actually moved the needle.

Qualitative sample — best rollout from step 99

A single 47-tool-call match from late training (env_reward +0.068, GRPO advantage +1.353). Edited for length; rationale fields preserved verbatim.

#	Tool	Args (key fields)	Outcome
1	`call_toss`	`call=heads, decision=bat`	Toss heads, agent chose to bat
2	`set_match_plan`	powerplay=`rotate_strike_and_build_score`, middle=`conservative_play_to_save_wickets`, death=`aggressive_boundary_push`	rationale: "Balanced field and pace bowler require cautious powerplay to avoid early wickets while building momentum"
3	`set_strategy`	`phase_intent=rotate_strike_and_build_score, aggression=0.4`	rationale: "Powerplay requires balance to avoid early wickets while rotating strike and building a steady score"
4	`play_delivery`	`shot_intent=boundary, risk=balanced`	Worked off the hips — a FOUR
5	`update_match_plan`	`reason="Loss of wicket — shift to low risk_budget"`	Turned to leg — sharp catch at short fine leg. OUT!
6	`set_bowling_strategy`	`bowler_type=pace, line=outside_off, length=good`	rationale: "Maintain consistent stock delivery to exploit line and length against aggressive finisher"
7	`bowl_delivery`	`line=outside_off, length=good, delivery_type=stock`	Driven through the covers — a SIX!

What this rollout shows: phase-aware match plan, coherent text rationales referencing the actual phase / field / bowler observed, plan adaptation after a wicket loss, 9 of 14 tools used, 47/47 turns valid <tool_call> XML. Match was lost (heuristic chased 29/4) — but the behaviour is strategically coherent, exactly what r_result alone can't reward and what the dense r_behavior rubric exists to train.

Branch B — per-over results

The per-over branch trades horizon depth for trainability: each ball collapses to a single deterministic action, only ~35 strategic decisions per match, stateless reward computation. Curves climb faster because the task is structurally simpler. See github.com/pratinavseth/cricket-captain-llm/tree/per-over for that branch's training code. The B port also lives on main via CRICKET_ENV_VARIANT=per_over + configs/cricket_train_qwen3_06b_perover.yaml for a controlled side-by-side run on the same model.

Run: Qwen/Qwen3.5-4B + LoRA (r=16, alpha=32), 70 steps Stage 1 + 70 steps Stage 2, 1× H200. Stage 1 final reward: 0.73. Stage 2 final reward: 0.57. frac_reward_zero_std=0 throughout both stages (healthy GRPO gradient signal). Plot generation script: scripts/plot_per_over.py.

(B) Training summary — Stage 1 vs Stage 2

(B) Per-step reward	(B) Within-group reward spread

(B) Gradient norm	(B) Training loss

Key engineering findings during Branch A iteration

Three structural bugs the rich tool-RL setup surfaced — all caught by reading the per-step traces, all fixed.

Issue surfaced	Fix	Effect
Only ~19 % of rollouts reached `done` naturally; rest aborted mid-match	Restrict the system prompt to a single `<tool_call>...</tool_call>` XML format. The original prompt advertised "XML or JSON — both accepted", but TRL's response-schema parser only recognises the XML wrapper. Bare JSON in any later turn aborted the rollout.	`tools/call_frequency` 9 → 73; rollouts 5–8× longer; matches play out; `r_result` actually fires
GRPO `reward/std` collapsed toward 0 once matches completed	Remove the `[-1, 1]` reward clip. With completed matches the dense per-ball signal saturated the ceiling, group std → 0, advantages → 0, gradients → 0. Let GRPO standardise the advantage itself.	`reward/std` 0.0 → 1.5; gradient signal restored
Composite reward stayed positive even on dominant losses (chase-to-zero with all-out)	Add explicit `outcome_bonus = -1.0` for clear losses (was `0.0`). Reduce always-positive `progress_bonus` cap from 0.25 → 0.10. Allow first-innings bowling reward to be signed.	Composite spans negative AND positive — model has a real reason to win

These are the kind of failure modes long-horizon agentic RL tends to hide. Logging them honestly is more valuable than a polished but uninterrogated curve.

Quickstart

System requirements: Python 3.10+, CUDA 12.x for training. Inference / random baselines / Gradio UI work CPU-only. Versions in pyproject.toml are pinned to the lowest combination that supports TRL multi-turn environment_factory AND vLLM colocate AND transformers v5 chat templates (transformers 5.6.2 + trl 1.2.0 + vllm 0.19.1 + torch 2.10.0).

Install

git clone <this-repo> cricket-captain-llm
cd cricket-captain-llm

# Full install (training + eval plots)
uv venv .venv --python 3.10
uv pip install --python .venv/bin/python -e ".[train,eval]"
source .venv/bin/activate

# OR inference-only (no GPU)
uv venv .venv --python 3.10
uv pip install --python .venv/bin/python -e .

# HuggingFace login (only for gated models / Hub uploads)
export HF_TOKEN=hf_...

Train (warmup → main)

bash scripts/run_warmup_then_main.sh
# Logs:           /tmp/train_warmup.log → /tmp/train_main.log
# Final adapter:  ./checkpoints/stage2_final/

Internally:

Warmup (2-3 over curriculum, 30 steps, ~50–60 min on H200): bootstraps the LoRA adapter from base Qwen3-4B-Instruct-2507.
Main (5-over end-to-end, 100 steps, ~5–7 hr): resumes warmup adapter, trains on full 5-over matches with r_result as the dominant gradient driver. Final adapter at ./checkpoints/stage2_final/.

Run individually:

PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3_warmup.yaml

PYTORCH_ALLOC_CONF=expandable_segments:True \
  python train.py train --config configs/cricket_train_qwen3.yaml

Inference / OpenEnv benchmark

# Start the env server (port 8000)
PYTHONPATH=. python server/app.py --port 8000 --config configs/extras/default.yaml

# Random captain (no API key needed)
export CRICKET_CAPTAIN_ENV_URL="ws://localhost:8000"
python inference.py --model random --episodes 5 --task easy

# Live HF captain via OpenAI-compat router
export HF_TOKEN=hf_... MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
python inference.py --episodes 10 --task medium

# Trained adapter served locally via model_server.py
python model_server.py --checkpoint ./checkpoints/stage2_final --port 8080 &
python inference.py --model local --api-base http://localhost:8080/v1 --episodes 20 --task easy

inference.py emits Round 1 [START] / [STEP] / [END] stdout markers per the OpenEnv benchmark spec. The three task IDs (easy/medium/hard) map to 5/20/50-over matches via openenv.yaml.

Configuration

YAML is the single source of truth. Pass --config configs/cricket_train_qwen3.yaml and override on CLI as needed. Key files:

configs/cricket_train_qwen3_warmup.yaml — warmup (2-3 over, 30 steps).
configs/cricket_train_qwen3.yaml — main (5-over, 100 steps), resumes from warmup.
configs/cricket_train_qwen3_06b.yaml — 0.6B 1-hour ablation with KL anchor on.
configs/cricket_train_qwen3_06b_perover.yaml — Branch B port (per-over) on the same 0.6B model.
configs/game_knowledge.yaml — reward weights and game constants.

GPU sizing, opponent modes, smoke tests, and detailed sub-rubric definitions live in docs/experiment_workflow.md and docs/benchmark_explainer.md.

Hackathon submission materials

Material	Link
Live HF Space (env runs here)	https://huggingface.co/spaces/pratinavseth/cricket-captain-llm
GitHub repo	https://github.com/pratinavseth/cricket-captain-llm
W&B project (training runs)	https://wandb.ai/ptnv-s-research/huggingface
Trained adapter (HF Hub)	https://huggingface.co/pratinavseth/cricket-captain-qwen3-06b-stage2
Mini-blog	blog.md
Demo video (≤2 min, YouTube)	https://youtu.be/Cqsq6MaeNUg

Submission checklist

OpenEnv (latest release): openenv-core[core]>=0.2.2 (pyproject.toml)
Working training script (TRL GRPO): train.py
Reward + training pipeline coherent: 4-rubric composite (r_result / r_cricket / r_behavior / r_validity), documented in server/reward_calculator.py
HF Space pushed and discoverable
Loss + reward + grad-norm + match-completion plots from real run, with explicit outlier policy
Mini-blog: blog.md
README motivates problem, explains env, links materials
OpenEnv compliance: Environment base class, valid openenv.yaml, no reserved tool names
≤2 min demo video: https://youtu.be/Cqsq6MaeNUg

Project structure

cricket-captain-llm/
├── Dockerfile               Container image (multi-stage, openenv-base)
├── openenv.yaml             OpenEnv manifest (runtime: fastapi, port 8000)
├── pyproject.toml           Pinned deps
├── client.py                EnvClient subclass — WebSocket client for the env
├── models.py                Pydantic types (CricketAction / CricketObservation / CricketState)
├── train.py                 GRPO training script (TRL + vLLM colocate + LoRA)
├── eval.py                  Reward-curve / coherence-heatmap visualisation from training logs
├── inference.py             Round 1 OpenEnv benchmark runner ([START]/[STEP]/[END] markers)
├── model_server.py          OpenAI-compat server backed by local transformers + PEFT
├── validate-submission.sh   HF Space + Docker + openenv-validate health check
├── configs/                 cricket_train_qwen3{,_warmup,_06b,_06b_perover}.yaml + game_knowledge.yaml
├── data/                    Markov tables, DLS par, format rules, eval packs, player profiles
├── docs/
│   ├── benchmark_explainer.md    Full problem + design + reward walkthrough
│   ├── experiment_workflow.md    Step-by-step training + eval recipe
│   ├── SUBMISSION.md             Operator runbook for HF Space + endpoint + secrets
│   └── plots/                    A real / B placeholder training-dynamics PNGs + 4B CSV
├── scripts/
│   ├── run_warmup_then_main.sh         Chain warmup → main
│   ├── regen_plots_clean_outliers.py   Re-render A plots from CSV with [p2, p98] winsorize
│   └── make_b_placeholder.py           Re-render B placeholders
└── server/
    ├── app.py                    FastAPI entrypoint (create_app + Gradio UI)
    ├── cricket_environment.py    Env state machine (TOSS → BAT → BOWL → FINISHED)
    ├── cricket_environment_per_over.py   Per-over (Branch B) variant
    ├── markov_engine.py          Per-ball outcome sampling (cricsheet-trained)
    ├── opponent_policy.py        heuristic / cricsheet / llm_live / llm_cached
    ├── coherence_grader.py       Plan-action coherence rubric
    ├── reward_calculator.py      4-rubric composite reward
    └── ui.py                     Custom Gradio dashboard for the Space

Team

Pratinav Seth - GitHub | LinkedIn
Divyansh Kulshreshtha - GitHub | LinkedIn
Siddhant Bharadwaj - GitHub | LinkedIn

Citation

@misc{cricketcaptain2025,
  title={CricketCaptain-LLM: Adaptive Strategic Decision-Making for Language Model Agents},
  year={2025},
  note={OpenEnv Hackathon submission}
}

@article{wdct2025,
  title={Large Language Models Often Say One Thing and Do Another},
  year={2025},
  url={https://arxiv.org/abs/2503.07003}
}