Spaces:

pratinavseth
/

cricket-captain-llm

Sleeping

cricket-captain-llm / docs /SUBMISSION.md

sync: pull latest from main (model_server.py, captain LLM toggle in ui.py, 0.6B configs, SUBMISSION + RUNTIME_DURABILITY docs)

e70c305 verified 16 days ago

preview code

raw

history blame contribute delete

8.82 kB

	# Submission Operations

	Everything an operator (you, a teammate, a judge) needs to reproduce or
	re-deploy this submission. Source of truth for HF Space secrets, the trained-
	model deployment recipe, and the runtime topology.

	## Topology

	```
	┌─────────────────────────┐
	judge ───► │ HF Space │
	browser │ pratinavseth/ │
	│ cricket-captain-llm │
	│ (Docker, cpu-basic) │
	└────────┬────────────────┘
	│ outbound HTTPS to opponents/captain
	│
	┌────────────────┼────────────────┐
	▼ ▼ ▼
	┌──────────┐ ┌──────────────┐ ┌──────────────────────┐
	│ HF Router│ │ HF Inference │ │ Self-hosted │
	│ (free) │ │ Endpoint │ │ model_server.py │
	│ │ │ (paused) │ │ on H200 + ngrok │
	│ Gemma 4 │ │ │ │ │
	└──────────┘ └──────────────┘ └──────────────────────┘
	│
	└─► trained adapter:
	pratinavseth/
	cricket-captain-warmup-stage2
	```

	The Space's auto-play in the Custom tab calls one of these endpoints
	based on Space secrets. Same code path on the Space — only the URLs change.


	## Required HF Space secrets

	Set at https://huggingface.co/spaces/pratinavseth/cricket-captain-llm/settings
	(Variables and secrets → New secret). All currently set programmatically.

	\| Name \| Current value \| Purpose \|
	\|---\|---\|---\|
	\| `HF_TOKEN` \| `hf_*` (rotating recommended) \| Auth for HF Router calls and self-served endpoint when proxied through ngrok \|
	\| `API_KEY` \| same as `HF_TOKEN` \| Round 1 alias \|
	\| `CRICKET_CAPTAIN_MODEL` \| `google/gemma-4-26B-A4B-it` \| Captain auto-play uses this (will flip to trained adapter id after deployment) \|
	\| `CRICKET_CAPTAIN_API_BASE` \| `https://router.huggingface.co/v1` \| OpenAI-compatible base URL \|
	\| `CRICKET_CAPTAIN_API_KEY` \| same as `HF_TOKEN` \| \|
	\| `CRICKET_OPPONENT_MODE` \| `llm_live` \| Forces env to use live LLM opponent at reset \|
	\| `CRICKET_OPPONENT_MODEL` \| `google/gemma-4-26B-A4B-it` \| Opponent (currently same as captain — Gemma vs Gemma demo) \|
	\| `CRICKET_OPPONENT_API_BASE` \| `https://router.huggingface.co/v1` \| \|
	\| `CRICKET_OPPONENT_API_KEY` \| same as `HF_TOKEN` \| \|
	\| `MODEL_NAME` \| `google/gemma-4-26B-A4B-it` \| Round 1 spec env var \|
	\| `API_BASE_URL` \| `https://router.huggingface.co/v1` \| Round 1 spec env var \|

	After updating any secret: trigger a Space restart (HfApi `restart_space()`
	or Settings → "Restart this Space").

	## Trained-captain deployment recipe (post-main-training)

	Goal: replace the Gemma-vs-Gemma demo with **trained captain (your adapter)
	vs Gemma 4 (HF Router)**. Two viable paths.

	### Path A — Self-hosted model_server.py + ngrok (free)

	```bash
	# 1. After main run completes, push fresh adapter to Hub
	HF_TOKEN=... python -c "
	from huggingface_hub import HfApi
	HfApi(token='...').upload_folder(
	folder_path='./checkpoints/stage2_final',
	repo_id='pratinavseth/cricket-captain-warmup-stage2',
	repo_type='model',
	)
	"

	# 2. Run the OpenAI-compatible adapter server on the GPU box
	.venv-qwen3/bin/python model_server.py \
	--checkpoint ./checkpoints/stage2_final \
	--port 8080

	# 3. Tunnel to a public URL
	ngrok http 8080 # → https://<random>.ngrok.io

	# 4. Update Space secrets:
	# CRICKET_CAPTAIN_MODEL = local
	# CRICKET_CAPTAIN_API_BASE = https://<random>.ngrok.io/v1
	# Restart the Space.
	```

	Trade-offs: free, real trained model, requires the H200 to stay up + tunneled
	while judges click around.

	### Path B — HF Inference Endpoint with custom container (paid)

	The default HF Inference text-generation pipeline container ships with an
	older transformers that hits a known bug on Qwen3-4B-Instruct-2507's
	tokenizer (`AttributeError: 'list' object has no attribute 'keys'` in
	`_set_model_specific_special_tokens`). Workarounds:

	1. Build a custom container based on `ghcr.io/huggingface/text-generation-inference:latest`
	(TGI v3+, supports LoRA via `--lora-adapters`).
	2. Or use vLLM-based custom container.
	3. Re-deploy with `framework="custom"` and `custom_image={"url": ...}`.

	Cost: nvidia-l4 ≈ $0.80/hr while running. **Pause the endpoint when not
	demoing** to stop billing — preserves the model so resume is fast.

	### Path C — Skip trained-captain demo, document via plots

	Use Gemma vs Gemma as the Space demo. Keep the trained-vs-baseline
	evidence in `compare_eval.py` numbers + W&B plots embedded in the README.
	Judges see: "Cricket env runs with two LLMs in real time" + "training plots
	prove the agent learned." Less impressive than a live trained-captain demo
	but submission-complete.

	## Current status (mid-deploy)

	\| Component \| State \|
	\|---\|---\|
	\| GitHub repo `origin/main` \| in sync, latest pushed \|
	\| `pyproject.toml` + `uv.lock` \| in sync; `uv sync --extra train` reproduces exactly \|
	\| HF model `pratinavseth/cricket-captain-warmup-stage2` \| warmup-v7 adapter uploaded; tokenizer_config patched for HF Inference compat \|
	\| HF Inference Endpoint `cricket-captain-v1` \| `paused` — fails on default container, requires custom container or pivot to Path A \|
	\| HF Space `pratinavseth/cricket-captain-llm` \| `RUNNING`, secrets set, configured for Gemma-vs-Gemma until trained-captain wired \|
	\| HF Space visibility \| private — flip to public before submission \|
	\| Main training run \| in progress (~step 47 of 100) \|

	## Background screens (for session durability)

	`screen -ls` shows:

	\| Screen \| Purpose \|
	\|---\|---\|
	\| `cc-keepalive` \| 60-sec heartbeat (Lightning instance idle protection) \|
	\| `cc-monitor` \| `tail -F /tmp/qwen3_main.log` (live training output) \|
	\| `cc-endpoint` \| polls inference endpoint status every 5 min \|

	Attach: `screen -r cc-monitor`. Detach: `Ctrl-A` then `D`.

	⚠️ The training process itself is NOT in screen — it's parented to the
	Bash session that started it. If that session dies, training dies with it
	(SIGHUP). Acceptable for the current run because we're 50% in; on the next
	run, launch via `screen -dmS cc-train .venv-qwen3/bin/python train.py train ...`
	to make it survive any disconnect.

	## Submission checklist mapped to hackathon criteria

	### Round 2 — minimum requirements

	- [x] OpenEnv (latest): `openenv-core[core]>=0.2.2`
	- [x] Working training script (TRL GRPO): `train.py`
	- [x] HF Space deployed: `pratinavseth/cricket-captain-llm`
	- [x] README motivates problem + explains env + links materials
	- [x] README links to HF Space + W&B + (placeholder for blog/video)
	- [ ] Loss + reward plots embedded in README (waiting on main run)
	- [ ] Mini-blog or ≤2 min video (writing after results land)
	- [ ] HF Space made public

	### Round 1 — additional spec

	- [x] 3 graded tasks (`easy` / `medium` / `hard` in `openenv.yaml`)
	- [x] Score in [0, 1] per task (win=1.0 / tie=0.5 / loss=0.0)
	- [x] `[START]` / `[STEP]` / `[END]` STDOUT markers in `inference.py`
	- [x] `API_BASE_URL` / `MODEL_NAME` / `HF_TOKEN` env-var contract
	- [x] Inference runs on vCPU=2 / 8 GB RAM (HF Router default; no local model load)
	- [x] Pydantic typed Action / Observation / State
	- [x] `validate-submission.sh` exists

	### Judging weight (Round 2)

	\| Criterion \| Weight \| Status \|
	\|---\|---\|---\|
	\| Environment Innovation \| 40% \| strong — cricket captaincy is uncommon, two-sided, real-data Markov engine \|
	\| Storytelling \| 30% \| README + Theme #2 alignment good; pending the mini-blog \|
	\| Showing Improvement in Rewards \| 20% \| warmup quartile data already shows monotonic improvement; full plots after main \|
	\| Reward & Pipeline \| 10% \| 4-rubric composite, documented signal flow, no obvious gaming path \|

	## Pending tasks (post-training)

	1. Run `compare_eval.py` baseline vs trained → head-to-head numbers.
	2. Export W&B panels as PNG → `docs/plots/` → embed in README with captions.
	3. Replace README "Results" placeholders with real numbers.
	4. Write the mini-blog (600–800 words HF blog post or ≤2 min YouTube).
	5. Re-upload final main-run adapter to HF Hub.
	6. Wire trained captain into Space (Path A above).
	7. Flip Space visibility to public.
	8. Run `validate-submission.sh` end-to-end.