Spaces:
Sleeping
PhysiX RLVR — Cloud Training Launcher
This folder contains the scripts that launch SFT → GRPO training for the PhysiX OpenEnv on Hugging Face Jobs, plus a self-contained Colab notebook judges can re-run.
This used to be a separate
physix-train/repo / training Space (Dockerfile +train.sh). We migrated to HF Jobs because it queues, doesn't pay for idle time, and reuses the upstream Unsloth image directly. The Docker artifacts have been removed and the launcher moved into the env repo, so there's now one repo, one Space.
Files
| File | What it does |
|---|---|
physix_train_colab.ipynb |
End-to-end SFT → GRPO in one notebook. Built on OpenEnv + Unsloth + TRL. T4/L4 for 1.5b profile, A100 for 3b. |
submit.py |
Submit a job to HF Jobs via HfApi.run_uv_job (the CLI hangs intermittently on whoami; this path is reliable). |
job_train.py |
Multi-system training driver (6 in-distribution systems). Runs inside the HF Jobs container. PEP 723 inline deps. |
job_train_single.py |
Single-system variant (defaults to damped_spring) — focused reward signal, easier to read curves. |
sync-plots.sh |
Pull committed loss/reward PNGs from the model repo into ../docs/plots/ so they ship with the env Space. |
Required secrets
| Secret | Source |
|---|---|
HF_TOKEN |
hf.co/settings/tokens (write) |
WANDB_API_KEY |
wandb.ai/authorize (optional — Colab notebook can run with W&B disabled) |
Submit a cloud job
export HF_TOKEN=hf_...
export WANDB_API_KEY=wandb_v1_...
python submit.py
Defaults: l40sx1 ($1.80/hr), 3 h timeout, source mounted from
hf://datasets/Pratyush-01/physix-live-src:/physix-live.
Run in Colab
Open physix_train_colab.ipynb on a Colab
GPU runtime. The notebook installs the same dependency set as the cloud
job, fetches the source from the HF dataset, runs SFT then GRPO, and
plots loss + reward + per-component reward curves at the end.
Pipeline cost (l40sx1, 3B profile)
| Step | Time | Cost |
|---|---|---|
| SFT warm-start (3 epochs, 192 examples) | ~5 min | ~$0.15 |
| GRPO (200 steps, 4 generations, max_completion=384) | ~40 min | ~$1.20 |
| Push merged 16-bit model to Hub | ~1 min | ~$0.04 |
W&B project: pratyush01/physix-live.
W&B metrics to watch
| Metric | Expected trajectory |
|---|---|
train/loss (GRPO) |
stays near zero by design — only the KL term, not the policy gradient |
train/reward (aggregate) |
climbs from ~0.5 to ~1.5–2.0 |
train/rewards/reward_match/mean |
climbs from ~0.05 to ~0.3–0.5 |
train/rewards/reward_format/mean |
climbs from ~0.5 to ~0.9+ |
train/rewards/reward_match_dense/mean |
climbs alongside reward_match |
train/kl |
rises slowly, stays below ~0.5 |
train/grad_norm |
bounded, typically 0.3–1.5 |