Spaces:

Pratyush-01
/

physix-live

Sleeping

App Files Files Community

physix-live / train /README.md

Pratyush-01

cleanup: trim verbose comments, drop dead code, fix stale tests, proper Dockerfile + .gitignore

7f40db3 verified 12 days ago

preview code

raw

history blame contribute delete

2.75 kB

PhysiX RLVR — Cloud Training Launcher

This folder contains the scripts that launch SFT → GRPO training for the PhysiX OpenEnv on Hugging Face Jobs, plus a self-contained Colab notebook judges can re-run.

Files

File	What it does
`physix_train_colab.ipynb`	End-to-end SFT → GRPO in one notebook. Built on OpenEnv + Unsloth + TRL. T4/L4 for `1.5b` profile, L4/A100 for `3b`.
`submit.py`	Submit a job to HF Jobs via `HfApi.run_uv_job`.
`job_train.py`	Training driver across the 3 trained systems. Runs inside the HF Jobs container. PEP 723 inline deps.
`job_train_single.py`	Single-system variant (defaults to `damped_spring`) — focused reward signal, easier to read curves.
`sync-plots.sh`	Pull committed loss/reward PNGs from the model repo into `../docs/plots/` so they ship with the env Space.

Required secrets

Secret	Source
`HF_TOKEN`	hf.co/settings/tokens (write)
`WANDB_API_KEY`	wandb.ai/authorize (optional — Colab notebook can run with W&B disabled)

Submit a cloud job

export HF_TOKEN=hf_...
export WANDB_API_KEY=wandb_v1_...
python submit.py

Defaults: l40sx1, 3 h timeout. Source is fetched at job-start by _stage_physix_live() directly from this Hugging Face Space repo.

Run in Colab

Open physix_train_colab.ipynb on a Colab GPU runtime. The notebook installs the same dependency set as the cloud job, fetches the source from this Hugging Face Space, runs SFT then GRPO, and plots loss + reward curves at the end.

Pipeline cost (l40sx1, 3B profile)

Step	Time	Cost
SFT warm-start (3 epochs, 192 examples)	~5 min	~$0.15
GRPO (200 steps, 4 generations, max_completion=384)	~40 min	~$1.20
Push merged 16-bit model to Hub	~1 min	~$0.04

W&B project: pratyush01/physix-live.

W&B metrics to watch

Metric	Expected trajectory
`train/loss` (GRPO)	stays near zero by design — only the KL term, not the policy gradient
`train/reward` (aggregate)	climbs from ~0.5 to ~1.5–2.0
`train/rewards/reward_match/mean`	climbs from ~0.05 to ~0.3–0.5
`train/rewards/reward_format/mean`	climbs from ~0.5 to ~0.9+
`train/rewards/reward_match_dense/mean`	climbs alongside `reward_match`
`train/kl`	rises slowly, stays below ~0.5
`train/grad_norm`	bounded, typically 0.3–1.5