Spaces:
Sleeping
Sleeping
| # PhysiX RLVR — Cloud Training Launcher | |
| This folder contains the scripts that launch SFT → GRPO training for the | |
| [PhysiX OpenEnv](../) on **Hugging Face Jobs**, plus a self-contained | |
| **Colab notebook** judges can re-run. | |
| > This used to be a separate `physix-train/` repo / training Space | |
| > (Dockerfile + `train.sh`). We migrated to HF Jobs because it queues, | |
| > doesn't pay for idle time, and reuses the upstream Unsloth image | |
| > directly. The Docker artifacts have been removed and the launcher | |
| > moved into the env repo, so there's now one repo, one Space. | |
| ## Files | |
| | File | What it does | | |
| |------|--------------| | |
| | [`physix_train_colab.ipynb`](physix_train_colab.ipynb) | End-to-end SFT → GRPO in one notebook. Built on **OpenEnv + Unsloth + TRL**. T4/L4 for `1.5b` profile, A100 for `3b`. | | |
| | [`submit.py`](submit.py) | Submit a job to HF Jobs via `HfApi.run_uv_job` (the CLI hangs intermittently on whoami; this path is reliable). | | |
| | [`job_train.py`](job_train.py) | Multi-system training driver (6 in-distribution systems). Runs *inside* the HF Jobs container. PEP 723 inline deps. | | |
| | [`job_train_single.py`](job_train_single.py) | Single-system variant (defaults to `damped_spring`) — focused reward signal, easier to read curves. | | |
| | [`sync-plots.sh`](sync-plots.sh) | Pull committed loss/reward PNGs from the model repo into `../docs/plots/` so they ship with the env Space. | | |
| ## Required secrets | |
| | Secret | Source | | |
| |--------|--------| | |
| | `HF_TOKEN` | [hf.co/settings/tokens](https://huggingface.co/settings/tokens) (write) | | |
| | `WANDB_API_KEY` | [wandb.ai/authorize](https://wandb.ai/authorize) (optional — Colab notebook can run with W&B disabled) | | |
| ## Submit a cloud job | |
| ```bash | |
| export HF_TOKEN=hf_... | |
| export WANDB_API_KEY=wandb_v1_... | |
| python submit.py | |
| ``` | |
| Defaults: l40sx1 ($1.80/hr), 3 h timeout, source mounted from | |
| `hf://datasets/Pratyush-01/physix-live-src:/physix-live`. | |
| ## Run in Colab | |
| Open [`physix_train_colab.ipynb`](physix_train_colab.ipynb) on a Colab | |
| GPU runtime. The notebook installs the same dependency set as the cloud | |
| job, fetches the source from the HF dataset, runs SFT then GRPO, and | |
| plots loss + reward + per-component reward curves at the end. | |
| ## Pipeline cost (l40sx1, 3B profile) | |
| | Step | Time | Cost | | |
| |------|------|------| | |
| | SFT warm-start (3 epochs, 192 examples) | ~5 min | ~$0.15 | | |
| | GRPO (200 steps, 4 generations, max_completion=384) | ~40 min | ~$1.20 | | |
| | Push merged 16-bit model to Hub | ~1 min | ~$0.04 | | |
| W&B project: [pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live). | |
| ## W&B metrics to watch | |
| | Metric | Expected trajectory | | |
| |--------|---------------------| | |
| | `train/loss` (GRPO) | **stays near zero by design** — only the KL term, not the policy gradient | | |
| | `train/reward` (aggregate) | climbs from ~0.5 to ~1.5–2.0 | | |
| | `train/rewards/reward_match/mean` | climbs from ~0.05 to ~0.3–0.5 | | |
| | `train/rewards/reward_format/mean` | climbs from ~0.5 to ~0.9+ | | |
| | `train/rewards/reward_match_dense/mean` | climbs alongside `reward_match` | | |
| | `train/kl` | rises slowly, stays below ~0.5 | | |
| | `train/grad_norm` | bounded, typically 0.3–1.5 | | |