Spaces:

Pratyush-01
/

physix

Sleeping

App Files Files Community

physix / train /README.md

Pratyush-01

Upload folder using huggingface_hub

0e24aff verified 12 days ago

preview code

raw

history blame contribute delete

3.17 kB

PhysiX RLVR — Cloud Training Launcher

This folder contains the scripts that launch SFT → GRPO training for the PhysiX OpenEnv on Hugging Face Jobs, plus a self-contained Colab notebook judges can re-run.

This used to be a separate physix-train/ repo / training Space (Dockerfile + train.sh). We migrated to HF Jobs because it queues, doesn't pay for idle time, and reuses the upstream Unsloth image directly. The Docker artifacts have been removed and the launcher moved into the env repo, so there's now one repo, one Space.

Files

File	What it does
`physix_train_colab.ipynb`	End-to-end SFT → GRPO in one notebook. Built on OpenEnv + Unsloth + TRL. T4/L4 for `1.5b` profile, A100 for `3b`.
`submit.py`	Submit a job to HF Jobs via `HfApi.run_uv_job` (the CLI hangs intermittently on whoami; this path is reliable).
`job_train.py`	Multi-system training driver (6 in-distribution systems). Runs inside the HF Jobs container. PEP 723 inline deps.
`job_train_single.py`	Single-system variant (defaults to `damped_spring`) — focused reward signal, easier to read curves.
`sync-plots.sh`	Pull committed loss/reward PNGs from the model repo into `../docs/plots/` so they ship with the env Space.

Required secrets

Secret	Source
`HF_TOKEN`	hf.co/settings/tokens (write)
`WANDB_API_KEY`	wandb.ai/authorize (optional — Colab notebook can run with W&B disabled)

Submit a cloud job

export HF_TOKEN=hf_...
export WANDB_API_KEY=wandb_v1_...
python submit.py

Defaults: l40sx1 ($1.80/hr), 3 h timeout, source mounted from hf://datasets/Pratyush-01/physix-live-src:/physix-live.

Run in Colab

Open physix_train_colab.ipynb on a Colab GPU runtime. The notebook installs the same dependency set as the cloud job, fetches the source from the HF dataset, runs SFT then GRPO, and plots loss + reward + per-component reward curves at the end.

Pipeline cost (l40sx1, 3B profile)

Step	Time	Cost
SFT warm-start (3 epochs, 192 examples)	~5 min	~$0.15
GRPO (200 steps, 4 generations, max_completion=384)	~40 min	~$1.20
Push merged 16-bit model to Hub	~1 min	~$0.04

W&B project: pratyush01/physix-live.

W&B metrics to watch

Metric	Expected trajectory
`train/loss` (GRPO)	stays near zero by design — only the KL term, not the policy gradient
`train/reward` (aggregate)	climbs from ~0.5 to ~1.5–2.0
`train/rewards/reward_match/mean`	climbs from ~0.05 to ~0.3–0.5
`train/rewards/reward_format/mean`	climbs from ~0.5 to ~0.9+
`train/rewards/reward_match_dense/mean`	climbs alongside `reward_match`
`train/kl`	rises slowly, stays below ~0.5
`train/grad_norm`	bounded, typically 0.3–1.5