Spaces:

Pratyush-01
/

physix

Sleeping

App Files Files Community

physix / train /README.md

Pratyush-01

Upload folder using huggingface_hub

0e24aff verified 13 days ago

preview code

raw

history blame contribute delete

3.17 kB

	# PhysiX RLVR — Cloud Training Launcher

	This folder contains the scripts that launch SFT → GRPO training for the
	[PhysiX OpenEnv](../) on Hugging Face Jobs, plus a self-contained
	Colab notebook judges can re-run.

	> This used to be a separate `physix-train/` repo / training Space
	> (Dockerfile + `train.sh`). We migrated to HF Jobs because it queues,
	> doesn't pay for idle time, and reuses the upstream Unsloth image
	> directly. The Docker artifacts have been removed and the launcher
	> moved into the env repo, so there's now one repo, one Space.

	## Files

	\| File \| What it does \|
	\|------\|--------------\|
	\| [`physix_train_colab.ipynb`](physix_train_colab.ipynb) \| End-to-end SFT → GRPO in one notebook. Built on OpenEnv + Unsloth + TRL. T4/L4 for `1.5b` profile, A100 for `3b`. \|
	\| [`submit.py`](submit.py) \| Submit a job to HF Jobs via `HfApi.run_uv_job` (the CLI hangs intermittently on whoami; this path is reliable). \|
	\| [`job_train.py`](job_train.py) \| Multi-system training driver (6 in-distribution systems). Runs inside the HF Jobs container. PEP 723 inline deps. \|
	\| [`job_train_single.py`](job_train_single.py) \| Single-system variant (defaults to `damped_spring`) — focused reward signal, easier to read curves. \|
	\| [`sync-plots.sh`](sync-plots.sh) \| Pull committed loss/reward PNGs from the model repo into `../docs/plots/` so they ship with the env Space. \|

	## Required secrets

	\| Secret \| Source \|
	\|--------\|--------\|
	\| `HF_TOKEN` \| [hf.co/settings/tokens](https://huggingface.co/settings/tokens) (write) \|
	\| `WANDB_API_KEY` \| [wandb.ai/authorize](https://wandb.ai/authorize) (optional — Colab notebook can run with W&B disabled) \|

	## Submit a cloud job

	```bash
	export HF_TOKEN=hf_...
	export WANDB_API_KEY=wandb_v1_...
	python submit.py
	```

	Defaults: l40sx1 ($1.80/hr), 3 h timeout, source mounted from
	`hf://datasets/Pratyush-01/physix-live-src:/physix-live`.

	## Run in Colab

	Open [`physix_train_colab.ipynb`](physix_train_colab.ipynb) on a Colab
	GPU runtime. The notebook installs the same dependency set as the cloud
	job, fetches the source from the HF dataset, runs SFT then GRPO, and
	plots loss + reward + per-component reward curves at the end.

	## Pipeline cost (l40sx1, 3B profile)

	\| Step \| Time \| Cost \|
	\|------\|------\|------\|
	\| SFT warm-start (3 epochs, 192 examples) \| ~5 min \| ~$0.15 \|
	\| GRPO (200 steps, 4 generations, max_completion=384) \| ~40 min \| ~$1.20 \|
	\| Push merged 16-bit model to Hub \| ~1 min \| ~$0.04 \|

	W&B project: [pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live).

	## W&B metrics to watch

	\| Metric \| Expected trajectory \|
	\|--------\|---------------------\|
	\| `train/loss` (GRPO) \| stays near zero by design — only the KL term, not the policy gradient \|
	\| `train/reward` (aggregate) \| climbs from ~0.5 to ~1.5–2.0 \|
	\| `train/rewards/reward_match/mean` \| climbs from ~0.05 to ~0.3–0.5 \|
	\| `train/rewards/reward_format/mean` \| climbs from ~0.5 to ~0.9+ \|
	\| `train/rewards/reward_match_dense/mean` \| climbs alongside `reward_match` \|
	\| `train/kl` \| rises slowly, stays below ~0.5 \|
	\| `train/grad_norm` \| bounded, typically 0.3–1.5 \|

	# PhysiX RLVR — Cloud Training Launcher

	This folder contains the scripts that launch SFT → GRPO training for the
	[PhysiX OpenEnv](../) on Hugging Face Jobs, plus a self-contained
	Colab notebook judges can re-run.

	> This used to be a separate `physix-train/` repo / training Space
	> (Dockerfile + `train.sh`). We migrated to HF Jobs because it queues,
	> doesn't pay for idle time, and reuses the upstream Unsloth image
	> directly. The Docker artifacts have been removed and the launcher
	> moved into the env repo, so there's now one repo, one Space.

	## Files

	\| File \| What it does \|
	\|------\|--------------\|
	\| [`physix_train_colab.ipynb`](physix_train_colab.ipynb) \| End-to-end SFT → GRPO in one notebook. Built on OpenEnv + Unsloth + TRL. T4/L4 for `1.5b` profile, A100 for `3b`. \|
	\| [`submit.py`](submit.py) \| Submit a job to HF Jobs via `HfApi.run_uv_job` (the CLI hangs intermittently on whoami; this path is reliable). \|
	\| [`job_train.py`](job_train.py) \| Multi-system training driver (6 in-distribution systems). Runs inside the HF Jobs container. PEP 723 inline deps. \|
	\| [`job_train_single.py`](job_train_single.py) \| Single-system variant (defaults to `damped_spring`) — focused reward signal, easier to read curves. \|
	\| [`sync-plots.sh`](sync-plots.sh) \| Pull committed loss/reward PNGs from the model repo into `../docs/plots/` so they ship with the env Space. \|

	## Required secrets

	\| Secret \| Source \|
	\|--------\|--------\|
	\| `HF_TOKEN` \| [hf.co/settings/tokens](https://huggingface.co/settings/tokens) (write) \|
	\| `WANDB_API_KEY` \| [wandb.ai/authorize](https://wandb.ai/authorize) (optional — Colab notebook can run with W&B disabled) \|

	## Submit a cloud job

	```bash
	export HF_TOKEN=hf_...
	export WANDB_API_KEY=wandb_v1_...
	python submit.py
	```

	Defaults: l40sx1 ($1.80/hr), 3 h timeout, source mounted from
	`hf://datasets/Pratyush-01/physix-live-src:/physix-live`.

	## Run in Colab

	Open [`physix_train_colab.ipynb`](physix_train_colab.ipynb) on a Colab
	GPU runtime. The notebook installs the same dependency set as the cloud
	job, fetches the source from the HF dataset, runs SFT then GRPO, and
	plots loss + reward + per-component reward curves at the end.

	## Pipeline cost (l40sx1, 3B profile)

	\| Step \| Time \| Cost \|
	\|------\|------\|------\|
	\| SFT warm-start (3 epochs, 192 examples) \| ~5 min \| ~$0.15 \|
	\| GRPO (200 steps, 4 generations, max_completion=384) \| ~40 min \| ~$1.20 \|
	\| Push merged 16-bit model to Hub \| ~1 min \| ~$0.04 \|

	W&B project: [pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live).

	## W&B metrics to watch

	\| Metric \| Expected trajectory \|
	\|--------\|---------------------\|
	\| `train/loss` (GRPO) \| stays near zero by design — only the KL term, not the policy gradient \|
	\| `train/reward` (aggregate) \| climbs from ~0.5 to ~1.5–2.0 \|
	\| `train/rewards/reward_match/mean` \| climbs from ~0.05 to ~0.3–0.5 \|
	\| `train/rewards/reward_format/mean` \| climbs from ~0.5 to ~0.9+ \|
	\| `train/rewards/reward_match_dense/mean` \| climbs alongside `reward_match` \|
	\| `train/kl` \| rises slowly, stays below ~0.5 \|
	\| `train/grad_norm` \| bounded, typically 0.3–1.5 \|