physix / train /README.md
Pratyush-01's picture
Upload folder using huggingface_hub
0e24aff verified

PhysiX RLVR — Cloud Training Launcher

This folder contains the scripts that launch SFT → GRPO training for the PhysiX OpenEnv on Hugging Face Jobs, plus a self-contained Colab notebook judges can re-run.

This used to be a separate physix-train/ repo / training Space (Dockerfile + train.sh). We migrated to HF Jobs because it queues, doesn't pay for idle time, and reuses the upstream Unsloth image directly. The Docker artifacts have been removed and the launcher moved into the env repo, so there's now one repo, one Space.

Files

File What it does
physix_train_colab.ipynb End-to-end SFT → GRPO in one notebook. Built on OpenEnv + Unsloth + TRL. T4/L4 for 1.5b profile, A100 for 3b.
submit.py Submit a job to HF Jobs via HfApi.run_uv_job (the CLI hangs intermittently on whoami; this path is reliable).
job_train.py Multi-system training driver (6 in-distribution systems). Runs inside the HF Jobs container. PEP 723 inline deps.
job_train_single.py Single-system variant (defaults to damped_spring) — focused reward signal, easier to read curves.
sync-plots.sh Pull committed loss/reward PNGs from the model repo into ../docs/plots/ so they ship with the env Space.

Required secrets

Secret Source
HF_TOKEN hf.co/settings/tokens (write)
WANDB_API_KEY wandb.ai/authorize (optional — Colab notebook can run with W&B disabled)

Submit a cloud job

export HF_TOKEN=hf_...
export WANDB_API_KEY=wandb_v1_...
python submit.py

Defaults: l40sx1 ($1.80/hr), 3 h timeout, source mounted from hf://datasets/Pratyush-01/physix-live-src:/physix-live.

Run in Colab

Open physix_train_colab.ipynb on a Colab GPU runtime. The notebook installs the same dependency set as the cloud job, fetches the source from the HF dataset, runs SFT then GRPO, and plots loss + reward + per-component reward curves at the end.

Pipeline cost (l40sx1, 3B profile)

Step Time Cost
SFT warm-start (3 epochs, 192 examples) ~5 min ~$0.15
GRPO (200 steps, 4 generations, max_completion=384) ~40 min ~$1.20
Push merged 16-bit model to Hub ~1 min ~$0.04

W&B project: pratyush01/physix-live.

W&B metrics to watch

Metric Expected trajectory
train/loss (GRPO) stays near zero by design — only the KL term, not the policy gradient
train/reward (aggregate) climbs from ~0.5 to ~1.5–2.0
train/rewards/reward_match/mean climbs from ~0.05 to ~0.3–0.5
train/rewards/reward_format/mean climbs from ~0.5 to ~0.9+
train/rewards/reward_match_dense/mean climbs alongside reward_match
train/kl rises slowly, stays below ~0.5
train/grad_norm bounded, typically 0.3–1.5