# Submission Operations

Everything an operator (you, a teammate, a judge) needs to reproduce or
re-deploy this submission. Source of truth for HF Space secrets, the trained-
model deployment recipe, and the runtime topology.

## Topology

```
                ┌─────────────────────────┐
   judge ───►   │  HF Space               │
   browser      │  pratinavseth/          │
                │  cricket-captain-llm    │
                │  (Docker, cpu-basic)    │
                └────────┬────────────────┘
                         │ outbound HTTPS to opponents/captain
                         │
        ┌────────────────┼────────────────┐
        ▼                ▼                ▼
  ┌──────────┐   ┌──────────────┐  ┌──────────────────────┐
  │ HF Router│   │ HF Inference │  │ Self-hosted          │
  │ (free)   │   │ Endpoint     │  │ model_server.py      │
  │          │   │ (paused)     │  │ on H200 + ngrok      │
  │ Gemma 4  │   │              │  │                      │
  └──────────┘   └──────────────┘  └──────────────────────┘
                                    │
                                    └─► trained adapter:
                                       pratinavseth/
                                       cricket-captain-warmup-stage2
```

The Space's auto-play in the **Custom** tab calls one of these endpoints
based on Space secrets. Same code path on the Space — only the URLs change.


## Required HF Space secrets

Set at https://huggingface.co/spaces/pratinavseth/cricket-captain-llm/settings
(Variables and secrets → New secret). All currently set programmatically.

| Name | Current value | Purpose |
|---|---|---|
| `HF_TOKEN` | `hf_*` (rotating recommended) | Auth for HF Router calls and self-served endpoint when proxied through ngrok |
| `API_KEY` | same as `HF_TOKEN` | Round 1 alias |
| `CRICKET_CAPTAIN_MODEL` | `google/gemma-4-26B-A4B-it` | Captain auto-play uses this (will flip to trained adapter id after deployment) |
| `CRICKET_CAPTAIN_API_BASE` | `https://router.huggingface.co/v1` | OpenAI-compatible base URL |
| `CRICKET_CAPTAIN_API_KEY` | same as `HF_TOKEN` | |
| `CRICKET_OPPONENT_MODE` | `llm_live` | Forces env to use live LLM opponent at reset |
| `CRICKET_OPPONENT_MODEL` | `google/gemma-4-26B-A4B-it` | Opponent (currently same as captain — Gemma vs Gemma demo) |
| `CRICKET_OPPONENT_API_BASE` | `https://router.huggingface.co/v1` | |
| `CRICKET_OPPONENT_API_KEY` | same as `HF_TOKEN` | |
| `MODEL_NAME` | `google/gemma-4-26B-A4B-it` | Round 1 spec env var |
| `API_BASE_URL` | `https://router.huggingface.co/v1` | Round 1 spec env var |

After updating any secret: trigger a Space restart (HfApi `restart_space()`
or Settings → "Restart this Space").

## Trained-captain deployment recipe (post-main-training)

Goal: replace the Gemma-vs-Gemma demo with **trained captain (your adapter)
vs Gemma 4 (HF Router)**. Two viable paths.

### Path A — Self-hosted model_server.py + ngrok (free)

```bash
# 1. After main run completes, push fresh adapter to Hub
HF_TOKEN=... python -c "
from huggingface_hub import HfApi
HfApi(token='...').upload_folder(
    folder_path='./checkpoints/stage2_final',
    repo_id='pratinavseth/cricket-captain-warmup-stage2',
    repo_type='model',
)
"

# 2. Run the OpenAI-compatible adapter server on the GPU box
.venv-qwen3/bin/python model_server.py \
    --checkpoint ./checkpoints/stage2_final \
    --port 8080

# 3. Tunnel to a public URL
ngrok http 8080   # → https://<random>.ngrok.io

# 4. Update Space secrets:
#      CRICKET_CAPTAIN_MODEL     = local
#      CRICKET_CAPTAIN_API_BASE  = https://<random>.ngrok.io/v1
#    Restart the Space.
```

Trade-offs: free, real trained model, requires the H200 to stay up + tunneled
while judges click around.

### Path B — HF Inference Endpoint with custom container (paid)

The default HF Inference text-generation pipeline container ships with an
older transformers that hits a known bug on Qwen3-4B-Instruct-2507's
tokenizer (`AttributeError: 'list' object has no attribute 'keys'` in
`_set_model_specific_special_tokens`). Workarounds:

1. Build a custom container based on `ghcr.io/huggingface/text-generation-inference:latest`
   (TGI v3+, supports LoRA via `--lora-adapters`).
2. Or use vLLM-based custom container.
3. Re-deploy with `framework="custom"` and `custom_image={"url": ...}`.

Cost: nvidia-l4 ≈ $0.80/hr while running. **Pause the endpoint when not
demoing** to stop billing — preserves the model so resume is fast.

### Path C — Skip trained-captain demo, document via plots

Use Gemma vs Gemma as the Space demo. Keep the *trained-vs-baseline*
evidence in `compare_eval.py` numbers + W&B plots embedded in the README.
Judges see: "Cricket env runs with two LLMs in real time" + "training plots
prove the agent learned." Less impressive than a live trained-captain demo
but submission-complete.

## Current status (mid-deploy)

| Component | State |
|---|---|
| GitHub repo `origin/main` | in sync, latest pushed |
| `pyproject.toml` + `uv.lock` | in sync; `uv sync --extra train` reproduces exactly |
| HF model `pratinavseth/cricket-captain-warmup-stage2` | warmup-v7 adapter uploaded; tokenizer_config patched for HF Inference compat |
| HF Inference Endpoint `cricket-captain-v1` | `paused` — fails on default container, requires custom container or pivot to Path A |
| HF Space `pratinavseth/cricket-captain-llm` | `RUNNING`, secrets set, configured for Gemma-vs-Gemma until trained-captain wired |
| HF Space visibility | **private** — flip to public before submission |
| Main training run | in progress (~step 47 of 100) |

## Background screens (for session durability)

`screen -ls` shows:

| Screen | Purpose |
|---|---|
| `cc-keepalive` | 60-sec heartbeat (Lightning instance idle protection) |
| `cc-monitor` | `tail -F /tmp/qwen3_main.log` (live training output) |
| `cc-endpoint` | polls inference endpoint status every 5 min |

Attach: `screen -r cc-monitor`. Detach: `Ctrl-A` then `D`.

⚠️ The training process *itself* is NOT in screen — it's parented to the
Bash session that started it. If that session dies, training dies with it
(SIGHUP). Acceptable for the current run because we're 50% in; on the next
run, launch via `screen -dmS cc-train .venv-qwen3/bin/python train.py train ...`
to make it survive any disconnect.

## Submission checklist mapped to hackathon criteria

### Round 2 — minimum requirements

- [x] OpenEnv (latest): `openenv-core[core]>=0.2.2`
- [x] Working training script (TRL GRPO): `train.py`
- [x] HF Space deployed: `pratinavseth/cricket-captain-llm`
- [x] README motivates problem + explains env + links materials
- [x] README links to HF Space + W&B + (placeholder for blog/video)
- [ ] Loss + reward plots embedded in README (waiting on main run)
- [ ] Mini-blog or ≤2 min video (writing after results land)
- [ ] HF Space made public

### Round 1 — additional spec

- [x] 3 graded tasks (`easy` / `medium` / `hard` in `openenv.yaml`)
- [x] Score in [0, 1] per task (win=1.0 / tie=0.5 / loss=0.0)
- [x] `[START]` / `[STEP]` / `[END]` STDOUT markers in `inference.py`
- [x] `API_BASE_URL` / `MODEL_NAME` / `HF_TOKEN` env-var contract
- [x] Inference runs on vCPU=2 / 8 GB RAM (HF Router default; no local model load)
- [x] Pydantic typed Action / Observation / State
- [x] `validate-submission.sh` exists

### Judging weight (Round 2)

| Criterion | Weight | Status |
|---|---|---|
| Environment Innovation | 40% | strong — cricket captaincy is uncommon, two-sided, real-data Markov engine |
| Storytelling | 30% | README + Theme #2 alignment good; pending the mini-blog |
| Showing Improvement in Rewards | 20% | warmup quartile data already shows monotonic improvement; full plots after main |
| Reward & Pipeline | 10% | 4-rubric composite, documented signal flow, no obvious gaming path |

## Pending tasks (post-training)

1. Run `compare_eval.py` baseline vs trained → head-to-head numbers.
2. Export W&B panels as PNG → `docs/plots/` → embed in README with captions.
3. Replace README "Results" placeholders with real numbers.
4. Write the mini-blog (600–800 words HF blog post or ≤2 min YouTube).
5. Re-upload final main-run adapter to HF Hub.
6. Wire trained captain into Space (Path A above).
7. Flip Space visibility to public.
8. Run `validate-submission.sh` end-to-end.