Spaces:

pratinavseth
/

cricket-captain-llm

Sleeping

cricket-captain-llm / docs /SUBMISSION.md

sync: pull latest from main (model_server.py, captain LLM toggle in ui.py, 0.6B configs, SUBMISSION + RUNTIME_DURABILITY docs)

e70c305 verified 16 days ago

preview code

raw

history blame contribute delete

8.82 kB

Submission Operations

Everything an operator (you, a teammate, a judge) needs to reproduce or re-deploy this submission. Source of truth for HF Space secrets, the trained- model deployment recipe, and the runtime topology.

Topology

                ┌─────────────────────────┐
   judge ───►   │  HF Space               │
   browser      │  pratinavseth/          │
                │  cricket-captain-llm    │
                │  (Docker, cpu-basic)    │
                └────────┬────────────────┘
                         │ outbound HTTPS to opponents/captain
                         │
        ┌────────────────┼────────────────┐
        ▼                ▼                ▼
  ┌──────────┐   ┌──────────────┐  ┌──────────────────────┐
  │ HF Router│   │ HF Inference │  │ Self-hosted          │
  │ (free)   │   │ Endpoint     │  │ model_server.py      │
  │          │   │ (paused)     │  │ on H200 + ngrok      │
  │ Gemma 4  │   │              │  │                      │
  └──────────┘   └──────────────┘  └──────────────────────┘
                                    │
                                    └─► trained adapter:
                                       pratinavseth/
                                       cricket-captain-warmup-stage2

The Space's auto-play in the Custom tab calls one of these endpoints based on Space secrets. Same code path on the Space — only the URLs change.

Required HF Space secrets

Set at https://huggingface.co/spaces/pratinavseth/cricket-captain-llm/settings (Variables and secrets → New secret). All currently set programmatically.

Name	Current value	Purpose
`HF_TOKEN`	`hf_*` (rotating recommended)	Auth for HF Router calls and self-served endpoint when proxied through ngrok
`API_KEY`	same as `HF_TOKEN`	Round 1 alias
`CRICKET_CAPTAIN_MODEL`	`google/gemma-4-26B-A4B-it`	Captain auto-play uses this (will flip to trained adapter id after deployment)
`CRICKET_CAPTAIN_API_BASE`	`https://router.huggingface.co/v1`	OpenAI-compatible base URL
`CRICKET_CAPTAIN_API_KEY`	same as `HF_TOKEN`
`CRICKET_OPPONENT_MODE`	`llm_live`	Forces env to use live LLM opponent at reset
`CRICKET_OPPONENT_MODEL`	`google/gemma-4-26B-A4B-it`	Opponent (currently same as captain — Gemma vs Gemma demo)
`CRICKET_OPPONENT_API_BASE`	`https://router.huggingface.co/v1`
`CRICKET_OPPONENT_API_KEY`	same as `HF_TOKEN`
`MODEL_NAME`	`google/gemma-4-26B-A4B-it`	Round 1 spec env var
`API_BASE_URL`	`https://router.huggingface.co/v1`	Round 1 spec env var

After updating any secret: trigger a Space restart (HfApi restart_space() or Settings → "Restart this Space").

Trained-captain deployment recipe (post-main-training)

Goal: replace the Gemma-vs-Gemma demo with trained captain (your adapter) vs Gemma 4 (HF Router). Two viable paths.

Path A — Self-hosted model_server.py + ngrok (free)

# 1. After main run completes, push fresh adapter to Hub
HF_TOKEN=... python -c "
from huggingface_hub import HfApi
HfApi(token='...').upload_folder(
    folder_path='./checkpoints/stage2_final',
    repo_id='pratinavseth/cricket-captain-warmup-stage2',
    repo_type='model',
)
"

# 2. Run the OpenAI-compatible adapter server on the GPU box
.venv-qwen3/bin/python model_server.py \
    --checkpoint ./checkpoints/stage2_final \
    --port 8080

# 3. Tunnel to a public URL
ngrok http 8080   # → https://<random>.ngrok.io

# 4. Update Space secrets:
#      CRICKET_CAPTAIN_MODEL     = local
#      CRICKET_CAPTAIN_API_BASE  = https://<random>.ngrok.io/v1
#    Restart the Space.

Trade-offs: free, real trained model, requires the H200 to stay up + tunneled while judges click around.

Path B — HF Inference Endpoint with custom container (paid)

The default HF Inference text-generation pipeline container ships with an older transformers that hits a known bug on Qwen3-4B-Instruct-2507's tokenizer (AttributeError: 'list' object has no attribute 'keys' in _set_model_specific_special_tokens). Workarounds:

Build a custom container based on ghcr.io/huggingface/text-generation-inference:latest (TGI v3+, supports LoRA via --lora-adapters).
Or use vLLM-based custom container.
Re-deploy with framework="custom" and custom_image={"url": ...}.

Cost: nvidia-l4 ≈ $0.80/hr while running. Pause the endpoint when not demoing to stop billing — preserves the model so resume is fast.

Path C — Skip trained-captain demo, document via plots

Use Gemma vs Gemma as the Space demo. Keep the trained-vs-baseline evidence in compare_eval.py numbers + W&B plots embedded in the README. Judges see: "Cricket env runs with two LLMs in real time" + "training plots prove the agent learned." Less impressive than a live trained-captain demo but submission-complete.

Current status (mid-deploy)

Component	State
GitHub repo `origin/main`	in sync, latest pushed
`pyproject.toml` + `uv.lock`	in sync; `uv sync --extra train` reproduces exactly
HF model `pratinavseth/cricket-captain-warmup-stage2`	warmup-v7 adapter uploaded; tokenizer_config patched for HF Inference compat
HF Inference Endpoint `cricket-captain-v1`	`paused` — fails on default container, requires custom container or pivot to Path A
HF Space `pratinavseth/cricket-captain-llm`	`RUNNING`, secrets set, configured for Gemma-vs-Gemma until trained-captain wired
HF Space visibility	private — flip to public before submission
Main training run	in progress (~step 47 of 100)

Background screens (for session durability)

screen -ls shows:

Screen	Purpose
`cc-keepalive`	60-sec heartbeat (Lightning instance idle protection)
`cc-monitor`	`tail -F /tmp/qwen3_main.log` (live training output)
`cc-endpoint`	polls inference endpoint status every 5 min

Attach: screen -r cc-monitor. Detach: Ctrl-A then D.

⚠️ The training process itself is NOT in screen — it's parented to the Bash session that started it. If that session dies, training dies with it (SIGHUP). Acceptable for the current run because we're 50% in; on the next run, launch via screen -dmS cc-train .venv-qwen3/bin/python train.py train ... to make it survive any disconnect.

Submission checklist mapped to hackathon criteria

Round 2 — minimum requirements

OpenEnv (latest): openenv-core[core]>=0.2.2
Working training script (TRL GRPO): train.py
HF Space deployed: pratinavseth/cricket-captain-llm
README motivates problem + explains env + links materials
README links to HF Space + W&B + (placeholder for blog/video)
Loss + reward plots embedded in README (waiting on main run)
Mini-blog or ≤2 min video (writing after results land)
HF Space made public

Round 1 — additional spec

3 graded tasks (easy / medium / hard in openenv.yaml)
Score in [0, 1] per task (win=1.0 / tie=0.5 / loss=0.0)
[START] / [STEP] / [END] STDOUT markers in inference.py
API_BASE_URL / MODEL_NAME / HF_TOKEN env-var contract
Inference runs on vCPU=2 / 8 GB RAM (HF Router default; no local model load)
Pydantic typed Action / Observation / State
validate-submission.sh exists

Judging weight (Round 2)

Criterion	Weight	Status
Environment Innovation	40%	strong — cricket captaincy is uncommon, two-sided, real-data Markov engine
Storytelling	30%	README + Theme #2 alignment good; pending the mini-blog
Showing Improvement in Rewards	20%	warmup quartile data already shows monotonic improvement; full plots after main
Reward & Pipeline	10%	4-rubric composite, documented signal flow, no obvious gaming path

Pending tasks (post-training)

Run compare_eval.py baseline vs trained → head-to-head numbers.
Export W&B panels as PNG → docs/plots/ → embed in README with captions.
Replace README "Results" placeholders with real numbers.
Write the mini-blog (600–800 words HF blog post or ≤2 min YouTube).
Re-upload final main-run adapter to HF Hub.
Wire trained captain into Space (Path A above).
Flip Space visibility to public.
Run validate-submission.sh end-to-end.