Spaces:
Sleeping
Sleeping
sync: pull latest from main (model_server.py, captain LLM toggle in ui.py, 0.6B configs, SUBMISSION + RUNTIME_DURABILITY docs)
e70c305 verified | # Submission Operations | |
| Everything an operator (you, a teammate, a judge) needs to reproduce or | |
| re-deploy this submission. Source of truth for HF Space secrets, the trained- | |
| model deployment recipe, and the runtime topology. | |
| ## Topology | |
| ``` | |
| ┌─────────────────────────┐ | |
| judge ───► │ HF Space │ | |
| browser │ pratinavseth/ │ | |
| │ cricket-captain-llm │ | |
| │ (Docker, cpu-basic) │ | |
| └────────┬────────────────┘ | |
| │ outbound HTTPS to opponents/captain | |
| │ | |
| ┌────────────────┼────────────────┐ | |
| ▼ ▼ ▼ | |
| ┌──────────┐ ┌──────────────┐ ┌──────────────────────┐ | |
| │ HF Router│ │ HF Inference │ │ Self-hosted │ | |
| │ (free) │ │ Endpoint │ │ model_server.py │ | |
| │ │ │ (paused) │ │ on H200 + ngrok │ | |
| │ Gemma 4 │ │ │ │ │ | |
| └──────────┘ └──────────────┘ └──────────────────────┘ | |
| │ | |
| └─► trained adapter: | |
| pratinavseth/ | |
| cricket-captain-warmup-stage2 | |
| ``` | |
| The Space's auto-play in the **Custom** tab calls one of these endpoints | |
| based on Space secrets. Same code path on the Space — only the URLs change. | |
| ## Required HF Space secrets | |
| Set at https://huggingface.co/spaces/pratinavseth/cricket-captain-llm/settings | |
| (Variables and secrets → New secret). All currently set programmatically. | |
| | Name | Current value | Purpose | | |
| |---|---|---| | |
| | `HF_TOKEN` | `hf_*` (rotating recommended) | Auth for HF Router calls and self-served endpoint when proxied through ngrok | | |
| | `API_KEY` | same as `HF_TOKEN` | Round 1 alias | | |
| | `CRICKET_CAPTAIN_MODEL` | `google/gemma-4-26B-A4B-it` | Captain auto-play uses this (will flip to trained adapter id after deployment) | | |
| | `CRICKET_CAPTAIN_API_BASE` | `https://router.huggingface.co/v1` | OpenAI-compatible base URL | | |
| | `CRICKET_CAPTAIN_API_KEY` | same as `HF_TOKEN` | | | |
| | `CRICKET_OPPONENT_MODE` | `llm_live` | Forces env to use live LLM opponent at reset | | |
| | `CRICKET_OPPONENT_MODEL` | `google/gemma-4-26B-A4B-it` | Opponent (currently same as captain — Gemma vs Gemma demo) | | |
| | `CRICKET_OPPONENT_API_BASE` | `https://router.huggingface.co/v1` | | | |
| | `CRICKET_OPPONENT_API_KEY` | same as `HF_TOKEN` | | | |
| | `MODEL_NAME` | `google/gemma-4-26B-A4B-it` | Round 1 spec env var | | |
| | `API_BASE_URL` | `https://router.huggingface.co/v1` | Round 1 spec env var | | |
| After updating any secret: trigger a Space restart (HfApi `restart_space()` | |
| or Settings → "Restart this Space"). | |
| ## Trained-captain deployment recipe (post-main-training) | |
| Goal: replace the Gemma-vs-Gemma demo with **trained captain (your adapter) | |
| vs Gemma 4 (HF Router)**. Two viable paths. | |
| ### Path A — Self-hosted model_server.py + ngrok (free) | |
| ```bash | |
| # 1. After main run completes, push fresh adapter to Hub | |
| HF_TOKEN=... python -c " | |
| from huggingface_hub import HfApi | |
| HfApi(token='...').upload_folder( | |
| folder_path='./checkpoints/stage2_final', | |
| repo_id='pratinavseth/cricket-captain-warmup-stage2', | |
| repo_type='model', | |
| ) | |
| " | |
| # 2. Run the OpenAI-compatible adapter server on the GPU box | |
| .venv-qwen3/bin/python model_server.py \ | |
| --checkpoint ./checkpoints/stage2_final \ | |
| --port 8080 | |
| # 3. Tunnel to a public URL | |
| ngrok http 8080 # → https://<random>.ngrok.io | |
| # 4. Update Space secrets: | |
| # CRICKET_CAPTAIN_MODEL = local | |
| # CRICKET_CAPTAIN_API_BASE = https://<random>.ngrok.io/v1 | |
| # Restart the Space. | |
| ``` | |
| Trade-offs: free, real trained model, requires the H200 to stay up + tunneled | |
| while judges click around. | |
| ### Path B — HF Inference Endpoint with custom container (paid) | |
| The default HF Inference text-generation pipeline container ships with an | |
| older transformers that hits a known bug on Qwen3-4B-Instruct-2507's | |
| tokenizer (`AttributeError: 'list' object has no attribute 'keys'` in | |
| `_set_model_specific_special_tokens`). Workarounds: | |
| 1. Build a custom container based on `ghcr.io/huggingface/text-generation-inference:latest` | |
| (TGI v3+, supports LoRA via `--lora-adapters`). | |
| 2. Or use vLLM-based custom container. | |
| 3. Re-deploy with `framework="custom"` and `custom_image={"url": ...}`. | |
| Cost: nvidia-l4 ≈ $0.80/hr while running. **Pause the endpoint when not | |
| demoing** to stop billing — preserves the model so resume is fast. | |
| ### Path C — Skip trained-captain demo, document via plots | |
| Use Gemma vs Gemma as the Space demo. Keep the *trained-vs-baseline* | |
| evidence in `compare_eval.py` numbers + W&B plots embedded in the README. | |
| Judges see: "Cricket env runs with two LLMs in real time" + "training plots | |
| prove the agent learned." Less impressive than a live trained-captain demo | |
| but submission-complete. | |
| ## Current status (mid-deploy) | |
| | Component | State | | |
| |---|---| | |
| | GitHub repo `origin/main` | in sync, latest pushed | | |
| | `pyproject.toml` + `uv.lock` | in sync; `uv sync --extra train` reproduces exactly | | |
| | HF model `pratinavseth/cricket-captain-warmup-stage2` | warmup-v7 adapter uploaded; tokenizer_config patched for HF Inference compat | | |
| | HF Inference Endpoint `cricket-captain-v1` | `paused` — fails on default container, requires custom container or pivot to Path A | | |
| | HF Space `pratinavseth/cricket-captain-llm` | `RUNNING`, secrets set, configured for Gemma-vs-Gemma until trained-captain wired | | |
| | HF Space visibility | **private** — flip to public before submission | | |
| | Main training run | in progress (~step 47 of 100) | | |
| ## Background screens (for session durability) | |
| `screen -ls` shows: | |
| | Screen | Purpose | | |
| |---|---| | |
| | `cc-keepalive` | 60-sec heartbeat (Lightning instance idle protection) | | |
| | `cc-monitor` | `tail -F /tmp/qwen3_main.log` (live training output) | | |
| | `cc-endpoint` | polls inference endpoint status every 5 min | | |
| Attach: `screen -r cc-monitor`. Detach: `Ctrl-A` then `D`. | |
| ⚠️ The training process *itself* is NOT in screen — it's parented to the | |
| Bash session that started it. If that session dies, training dies with it | |
| (SIGHUP). Acceptable for the current run because we're 50% in; on the next | |
| run, launch via `screen -dmS cc-train .venv-qwen3/bin/python train.py train ...` | |
| to make it survive any disconnect. | |
| ## Submission checklist mapped to hackathon criteria | |
| ### Round 2 — minimum requirements | |
| - [x] OpenEnv (latest): `openenv-core[core]>=0.2.2` | |
| - [x] Working training script (TRL GRPO): `train.py` | |
| - [x] HF Space deployed: `pratinavseth/cricket-captain-llm` | |
| - [x] README motivates problem + explains env + links materials | |
| - [x] README links to HF Space + W&B + (placeholder for blog/video) | |
| - [ ] Loss + reward plots embedded in README (waiting on main run) | |
| - [ ] Mini-blog or ≤2 min video (writing after results land) | |
| - [ ] HF Space made public | |
| ### Round 1 — additional spec | |
| - [x] 3 graded tasks (`easy` / `medium` / `hard` in `openenv.yaml`) | |
| - [x] Score in [0, 1] per task (win=1.0 / tie=0.5 / loss=0.0) | |
| - [x] `[START]` / `[STEP]` / `[END]` STDOUT markers in `inference.py` | |
| - [x] `API_BASE_URL` / `MODEL_NAME` / `HF_TOKEN` env-var contract | |
| - [x] Inference runs on vCPU=2 / 8 GB RAM (HF Router default; no local model load) | |
| - [x] Pydantic typed Action / Observation / State | |
| - [x] `validate-submission.sh` exists | |
| ### Judging weight (Round 2) | |
| | Criterion | Weight | Status | | |
| |---|---|---| | |
| | Environment Innovation | 40% | strong — cricket captaincy is uncommon, two-sided, real-data Markov engine | | |
| | Storytelling | 30% | README + Theme #2 alignment good; pending the mini-blog | | |
| | Showing Improvement in Rewards | 20% | warmup quartile data already shows monotonic improvement; full plots after main | | |
| | Reward & Pipeline | 10% | 4-rubric composite, documented signal flow, no obvious gaming path | | |
| ## Pending tasks (post-training) | |
| 1. Run `compare_eval.py` baseline vs trained → head-to-head numbers. | |
| 2. Export W&B panels as PNG → `docs/plots/` → embed in README with captions. | |
| 3. Replace README "Results" placeholders with real numbers. | |
| 4. Write the mini-blog (600–800 words HF blog post or ≤2 min YouTube). | |
| 5. Re-upload final main-run adapter to HF Hub. | |
| 6. Wire trained captain into Space (Path A above). | |
| 7. Flip Space visibility to public. | |
| 8. Run `validate-submission.sh` end-to-end. | |