Spaces:
Sleeping
Submission Operations
Everything an operator (you, a teammate, a judge) needs to reproduce or re-deploy this submission. Source of truth for HF Space secrets, the trained- model deployment recipe, and the runtime topology.
Topology
┌─────────────────────────┐
judge ───► │ HF Space │
browser │ pratinavseth/ │
│ cricket-captain-llm │
│ (Docker, cpu-basic) │
└────────┬────────────────┘
│ outbound HTTPS to opponents/captain
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌──────────────────────┐
│ HF Router│ │ HF Inference │ │ Self-hosted │
│ (free) │ │ Endpoint │ │ model_server.py │
│ │ │ (paused) │ │ on H200 + ngrok │
│ Gemma 4 │ │ │ │ │
└──────────┘ └──────────────┘ └──────────────────────┘
│
└─► trained adapter:
pratinavseth/
cricket-captain-warmup-stage2
The Space's auto-play in the Custom tab calls one of these endpoints based on Space secrets. Same code path on the Space — only the URLs change.
Required HF Space secrets
Set at https://huggingface.co/spaces/pratinavseth/cricket-captain-llm/settings (Variables and secrets → New secret). All currently set programmatically.
| Name | Current value | Purpose |
|---|---|---|
HF_TOKEN |
hf_* (rotating recommended) |
Auth for HF Router calls and self-served endpoint when proxied through ngrok |
API_KEY |
same as HF_TOKEN |
Round 1 alias |
CRICKET_CAPTAIN_MODEL |
google/gemma-4-26B-A4B-it |
Captain auto-play uses this (will flip to trained adapter id after deployment) |
CRICKET_CAPTAIN_API_BASE |
https://router.huggingface.co/v1 |
OpenAI-compatible base URL |
CRICKET_CAPTAIN_API_KEY |
same as HF_TOKEN |
|
CRICKET_OPPONENT_MODE |
llm_live |
Forces env to use live LLM opponent at reset |
CRICKET_OPPONENT_MODEL |
google/gemma-4-26B-A4B-it |
Opponent (currently same as captain — Gemma vs Gemma demo) |
CRICKET_OPPONENT_API_BASE |
https://router.huggingface.co/v1 |
|
CRICKET_OPPONENT_API_KEY |
same as HF_TOKEN |
|
MODEL_NAME |
google/gemma-4-26B-A4B-it |
Round 1 spec env var |
API_BASE_URL |
https://router.huggingface.co/v1 |
Round 1 spec env var |
After updating any secret: trigger a Space restart (HfApi restart_space()
or Settings → "Restart this Space").
Trained-captain deployment recipe (post-main-training)
Goal: replace the Gemma-vs-Gemma demo with trained captain (your adapter) vs Gemma 4 (HF Router). Two viable paths.
Path A — Self-hosted model_server.py + ngrok (free)
# 1. After main run completes, push fresh adapter to Hub
HF_TOKEN=... python -c "
from huggingface_hub import HfApi
HfApi(token='...').upload_folder(
folder_path='./checkpoints/stage2_final',
repo_id='pratinavseth/cricket-captain-warmup-stage2',
repo_type='model',
)
"
# 2. Run the OpenAI-compatible adapter server on the GPU box
.venv-qwen3/bin/python model_server.py \
--checkpoint ./checkpoints/stage2_final \
--port 8080
# 3. Tunnel to a public URL
ngrok http 8080 # → https://<random>.ngrok.io
# 4. Update Space secrets:
# CRICKET_CAPTAIN_MODEL = local
# CRICKET_CAPTAIN_API_BASE = https://<random>.ngrok.io/v1
# Restart the Space.
Trade-offs: free, real trained model, requires the H200 to stay up + tunneled while judges click around.
Path B — HF Inference Endpoint with custom container (paid)
The default HF Inference text-generation pipeline container ships with an
older transformers that hits a known bug on Qwen3-4B-Instruct-2507's
tokenizer (AttributeError: 'list' object has no attribute 'keys' in
_set_model_specific_special_tokens). Workarounds:
- Build a custom container based on
ghcr.io/huggingface/text-generation-inference:latest(TGI v3+, supports LoRA via--lora-adapters). - Or use vLLM-based custom container.
- Re-deploy with
framework="custom"andcustom_image={"url": ...}.
Cost: nvidia-l4 ≈ $0.80/hr while running. Pause the endpoint when not demoing to stop billing — preserves the model so resume is fast.
Path C — Skip trained-captain demo, document via plots
Use Gemma vs Gemma as the Space demo. Keep the trained-vs-baseline
evidence in compare_eval.py numbers + W&B plots embedded in the README.
Judges see: "Cricket env runs with two LLMs in real time" + "training plots
prove the agent learned." Less impressive than a live trained-captain demo
but submission-complete.
Current status (mid-deploy)
| Component | State |
|---|---|
GitHub repo origin/main |
in sync, latest pushed |
pyproject.toml + uv.lock |
in sync; uv sync --extra train reproduces exactly |
HF model pratinavseth/cricket-captain-warmup-stage2 |
warmup-v7 adapter uploaded; tokenizer_config patched for HF Inference compat |
HF Inference Endpoint cricket-captain-v1 |
paused — fails on default container, requires custom container or pivot to Path A |
HF Space pratinavseth/cricket-captain-llm |
RUNNING, secrets set, configured for Gemma-vs-Gemma until trained-captain wired |
| HF Space visibility | private — flip to public before submission |
| Main training run | in progress (~step 47 of 100) |
Background screens (for session durability)
screen -ls shows:
| Screen | Purpose |
|---|---|
cc-keepalive |
60-sec heartbeat (Lightning instance idle protection) |
cc-monitor |
tail -F /tmp/qwen3_main.log (live training output) |
cc-endpoint |
polls inference endpoint status every 5 min |
Attach: screen -r cc-monitor. Detach: Ctrl-A then D.
⚠️ The training process itself is NOT in screen — it's parented to the
Bash session that started it. If that session dies, training dies with it
(SIGHUP). Acceptable for the current run because we're 50% in; on the next
run, launch via screen -dmS cc-train .venv-qwen3/bin/python train.py train ...
to make it survive any disconnect.
Submission checklist mapped to hackathon criteria
Round 2 — minimum requirements
- OpenEnv (latest):
openenv-core[core]>=0.2.2 - Working training script (TRL GRPO):
train.py - HF Space deployed:
pratinavseth/cricket-captain-llm - README motivates problem + explains env + links materials
- README links to HF Space + W&B + (placeholder for blog/video)
- Loss + reward plots embedded in README (waiting on main run)
- Mini-blog or ≤2 min video (writing after results land)
- HF Space made public
Round 1 — additional spec
- 3 graded tasks (
easy/medium/hardinopenenv.yaml) - Score in [0, 1] per task (win=1.0 / tie=0.5 / loss=0.0)
-
[START]/[STEP]/[END]STDOUT markers ininference.py -
API_BASE_URL/MODEL_NAME/HF_TOKENenv-var contract - Inference runs on vCPU=2 / 8 GB RAM (HF Router default; no local model load)
- Pydantic typed Action / Observation / State
-
validate-submission.shexists
Judging weight (Round 2)
| Criterion | Weight | Status |
|---|---|---|
| Environment Innovation | 40% | strong — cricket captaincy is uncommon, two-sided, real-data Markov engine |
| Storytelling | 30% | README + Theme #2 alignment good; pending the mini-blog |
| Showing Improvement in Rewards | 20% | warmup quartile data already shows monotonic improvement; full plots after main |
| Reward & Pipeline | 10% | 4-rubric composite, documented signal flow, no obvious gaming path |
Pending tasks (post-training)
- Run
compare_eval.pybaseline vs trained → head-to-head numbers. - Export W&B panels as PNG →
docs/plots/→ embed in README with captions. - Replace README "Results" placeholders with real numbers.
- Write the mini-blog (600–800 words HF blog post or ≤2 min YouTube).
- Re-upload final main-run adapter to HF Hub.
- Wire trained captain into Space (Path A above).
- Flip Space visibility to public.
- Run
validate-submission.shend-to-end.