physix-live / train /README.md
Pratyush-01's picture
cleanup: trim verbose comments, drop dead code, fix stale tests, proper Dockerfile + .gitignore
7f40db3 verified

PhysiX RLVR — Cloud Training Launcher

This folder contains the scripts that launch SFT → GRPO training for the PhysiX OpenEnv on Hugging Face Jobs, plus a self-contained Colab notebook judges can re-run.

Files

File What it does
physix_train_colab.ipynb End-to-end SFT → GRPO in one notebook. Built on OpenEnv + Unsloth + TRL. T4/L4 for 1.5b profile, L4/A100 for 3b.
submit.py Submit a job to HF Jobs via HfApi.run_uv_job.
job_train.py Training driver across the 3 trained systems. Runs inside the HF Jobs container. PEP 723 inline deps.
job_train_single.py Single-system variant (defaults to damped_spring) — focused reward signal, easier to read curves.
sync-plots.sh Pull committed loss/reward PNGs from the model repo into ../docs/plots/ so they ship with the env Space.

Required secrets

Secret Source
HF_TOKEN hf.co/settings/tokens (write)
WANDB_API_KEY wandb.ai/authorize (optional — Colab notebook can run with W&B disabled)

Submit a cloud job

export HF_TOKEN=hf_...
export WANDB_API_KEY=wandb_v1_...
python submit.py

Defaults: l40sx1, 3 h timeout. Source is fetched at job-start by _stage_physix_live() directly from this Hugging Face Space repo.

Run in Colab

Open physix_train_colab.ipynb on a Colab GPU runtime. The notebook installs the same dependency set as the cloud job, fetches the source from this Hugging Face Space, runs SFT then GRPO, and plots loss + reward curves at the end.

Pipeline cost (l40sx1, 3B profile)

Step Time Cost
SFT warm-start (3 epochs, 192 examples) ~5 min ~$0.15
GRPO (200 steps, 4 generations, max_completion=384) ~40 min ~$1.20
Push merged 16-bit model to Hub ~1 min ~$0.04

W&B project: pratyush01/physix-live.

W&B metrics to watch

Metric Expected trajectory
train/loss (GRPO) stays near zero by design — only the KL term, not the policy gradient
train/reward (aggregate) climbs from ~0.5 to ~1.5–2.0
train/rewards/reward_match/mean climbs from ~0.05 to ~0.3–0.5
train/rewards/reward_format/mean climbs from ~0.5 to ~0.9+
train/rewards/reward_match_dense/mean climbs alongside reward_match
train/kl rises slowly, stays below ~0.5
train/grad_norm bounded, typically 0.3–1.5