# PhysiX RLVR — Cloud Training Launcher

This folder contains the scripts that launch SFT → GRPO training for the
[PhysiX OpenEnv](../) on **Hugging Face Jobs**, plus a self-contained
**Colab notebook** judges can re-run.

## Files

| File | What it does |
|------|--------------|
| [`physix_train_colab.ipynb`](physix_train_colab.ipynb) | End-to-end SFT → GRPO in one notebook. Built on **OpenEnv + Unsloth + TRL**. T4/L4 for `1.5b` profile, L4/A100 for `3b`. |
| [`submit.py`](submit.py) | Submit a job to HF Jobs via `HfApi.run_uv_job`. |
| [`job_train.py`](job_train.py) | Training driver across the 3 trained systems. Runs *inside* the HF Jobs container. PEP 723 inline deps. |
| [`job_train_single.py`](job_train_single.py) | Single-system variant (defaults to `damped_spring`) — focused reward signal, easier to read curves. |
| [`sync-plots.sh`](sync-plots.sh) | Pull committed loss/reward PNGs from the model repo into `../docs/plots/` so they ship with the env Space. |

## Required secrets

| Secret | Source |
|--------|--------|
| `HF_TOKEN` | [hf.co/settings/tokens](https://huggingface.co/settings/tokens) (write) |
| `WANDB_API_KEY` | [wandb.ai/authorize](https://wandb.ai/authorize) (optional — Colab notebook can run with W&B disabled) |

## Submit a cloud job

```bash
export HF_TOKEN=hf_...
export WANDB_API_KEY=wandb_v1_...
python submit.py
```

Defaults: l40sx1, 3 h timeout. Source is fetched at job-start by
`_stage_physix_live()` directly from this Hugging Face Space repo.

## Run in Colab

Open [`physix_train_colab.ipynb`](physix_train_colab.ipynb) on a Colab
GPU runtime. The notebook installs the same dependency set as the cloud
job, fetches the source from this Hugging Face Space, runs SFT then
GRPO, and plots loss + reward curves at the end.

## Pipeline cost (l40sx1, 3B profile)

| Step | Time | Cost |
|------|------|------|
| SFT warm-start (3 epochs, 192 examples)            | ~5 min  | ~$0.15 |
| GRPO (200 steps, 4 generations, max_completion=384) | ~40 min | ~$1.20 |
| Push merged 16-bit model to Hub                    | ~1 min  | ~$0.04 |

W&B project: [pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live).

## W&B metrics to watch

| Metric | Expected trajectory |
|--------|---------------------|
| `train/loss` (GRPO) | **stays near zero by design** — only the KL term, not the policy gradient |
| `train/reward` (aggregate) | climbs from ~0.5 to ~1.5–2.0 |
| `train/rewards/reward_match/mean` | climbs from ~0.05 to ~0.3–0.5 |
| `train/rewards/reward_format/mean` | climbs from ~0.5 to ~0.9+ |
| `train/rewards/reward_match_dense/mean` | climbs alongside `reward_match` |
| `train/kl` | rises slowly, stays below ~0.5 |
| `train/grad_norm` | bounded, typically 0.3–1.5 |