Spaces:
Sleeping
Sleeping
cleanup: trim verbose comments, drop dead code, fix stale tests, proper Dockerfile + .gitignore
7f40db3 verified PhysiX RLVR — Cloud Training Launcher
This folder contains the scripts that launch SFT → GRPO training for the PhysiX OpenEnv on Hugging Face Jobs, plus a self-contained Colab notebook judges can re-run.
Files
| File | What it does |
|---|---|
physix_train_colab.ipynb |
End-to-end SFT → GRPO in one notebook. Built on OpenEnv + Unsloth + TRL. T4/L4 for 1.5b profile, L4/A100 for 3b. |
submit.py |
Submit a job to HF Jobs via HfApi.run_uv_job. |
job_train.py |
Training driver across the 3 trained systems. Runs inside the HF Jobs container. PEP 723 inline deps. |
job_train_single.py |
Single-system variant (defaults to damped_spring) — focused reward signal, easier to read curves. |
sync-plots.sh |
Pull committed loss/reward PNGs from the model repo into ../docs/plots/ so they ship with the env Space. |
Required secrets
| Secret | Source |
|---|---|
HF_TOKEN |
hf.co/settings/tokens (write) |
WANDB_API_KEY |
wandb.ai/authorize (optional — Colab notebook can run with W&B disabled) |
Submit a cloud job
export HF_TOKEN=hf_...
export WANDB_API_KEY=wandb_v1_...
python submit.py
Defaults: l40sx1, 3 h timeout. Source is fetched at job-start by
_stage_physix_live() directly from this Hugging Face Space repo.
Run in Colab
Open physix_train_colab.ipynb on a Colab
GPU runtime. The notebook installs the same dependency set as the cloud
job, fetches the source from this Hugging Face Space, runs SFT then
GRPO, and plots loss + reward curves at the end.
Pipeline cost (l40sx1, 3B profile)
| Step | Time | Cost |
|---|---|---|
| SFT warm-start (3 epochs, 192 examples) | ~5 min | ~$0.15 |
| GRPO (200 steps, 4 generations, max_completion=384) | ~40 min | ~$1.20 |
| Push merged 16-bit model to Hub | ~1 min | ~$0.04 |
W&B project: pratyush01/physix-live.
W&B metrics to watch
| Metric | Expected trajectory |
|---|---|
train/loss (GRPO) |
stays near zero by design — only the KL term, not the policy gradient |
train/reward (aggregate) |
climbs from ~0.5 to ~1.5–2.0 |
train/rewards/reward_match/mean |
climbs from ~0.05 to ~0.3–0.5 |
train/rewards/reward_format/mean |
climbs from ~0.5 to ~0.9+ |
train/rewards/reward_match_dense/mean |
climbs alongside reward_match |
train/kl |
rises slowly, stays below ~0.5 |
train/grad_norm |
bounded, typically 0.3–1.5 |