Spaces:

Pratyush-01
/

physix-live

Sleeping

App Files Files Community

physix-live / train /README.md

Pratyush-01

cleanup: trim verbose comments, drop dead code, fix stale tests, proper Dockerfile + .gitignore

7f40db3 verified 12 days ago

preview code

raw

history blame contribute delete

2.75 kB

	# PhysiX RLVR — Cloud Training Launcher

	This folder contains the scripts that launch SFT → GRPO training for the
	[PhysiX OpenEnv](../) on Hugging Face Jobs, plus a self-contained
	Colab notebook judges can re-run.

	## Files

	\| File \| What it does \|
	\|------\|--------------\|
	\| [`physix_train_colab.ipynb`](physix_train_colab.ipynb) \| End-to-end SFT → GRPO in one notebook. Built on OpenEnv + Unsloth + TRL. T4/L4 for `1.5b` profile, L4/A100 for `3b`. \|
	\| [`submit.py`](submit.py) \| Submit a job to HF Jobs via `HfApi.run_uv_job`. \|
	\| [`job_train.py`](job_train.py) \| Training driver across the 3 trained systems. Runs inside the HF Jobs container. PEP 723 inline deps. \|
	\| [`job_train_single.py`](job_train_single.py) \| Single-system variant (defaults to `damped_spring`) — focused reward signal, easier to read curves. \|
	\| [`sync-plots.sh`](sync-plots.sh) \| Pull committed loss/reward PNGs from the model repo into `../docs/plots/` so they ship with the env Space. \|

	## Required secrets

	\| Secret \| Source \|
	\|--------\|--------\|
	\| `HF_TOKEN` \| [hf.co/settings/tokens](https://huggingface.co/settings/tokens) (write) \|
	\| `WANDB_API_KEY` \| [wandb.ai/authorize](https://wandb.ai/authorize) (optional — Colab notebook can run with W&B disabled) \|

	## Submit a cloud job

	```bash
	export HF_TOKEN=hf_...
	export WANDB_API_KEY=wandb_v1_...
	python submit.py
	```

	Defaults: l40sx1, 3 h timeout. Source is fetched at job-start by
	`_stage_physix_live()` directly from this Hugging Face Space repo.

	## Run in Colab

	Open [`physix_train_colab.ipynb`](physix_train_colab.ipynb) on a Colab
	GPU runtime. The notebook installs the same dependency set as the cloud
	job, fetches the source from this Hugging Face Space, runs SFT then
	GRPO, and plots loss + reward curves at the end.

	## Pipeline cost (l40sx1, 3B profile)

	\| Step \| Time \| Cost \|
	\|------\|------\|------\|
	\| SFT warm-start (3 epochs, 192 examples) \| ~5 min \| ~$0.15 \|
	\| GRPO (200 steps, 4 generations, max_completion=384) \| ~40 min \| ~$1.20 \|
	\| Push merged 16-bit model to Hub \| ~1 min \| ~$0.04 \|

	W&B project: [pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live).

	## W&B metrics to watch

	\| Metric \| Expected trajectory \|
	\|--------\|---------------------\|
	\| `train/loss` (GRPO) \| stays near zero by design — only the KL term, not the policy gradient \|
	\| `train/reward` (aggregate) \| climbs from ~0.5 to ~1.5–2.0 \|
	\| `train/rewards/reward_match/mean` \| climbs from ~0.05 to ~0.3–0.5 \|
	\| `train/rewards/reward_format/mean` \| climbs from ~0.5 to ~0.9+ \|
	\| `train/rewards/reward_match_dense/mean` \| climbs alongside `reward_match` \|
	\| `train/kl` \| rises slowly, stays below ~0.5 \|
	\| `train/grad_norm` \| bounded, typically 0.3–1.5 \|

	# PhysiX RLVR — Cloud Training Launcher

	This folder contains the scripts that launch SFT → GRPO training for the
	[PhysiX OpenEnv](../) on Hugging Face Jobs, plus a self-contained
	Colab notebook judges can re-run.

	## Files

	\| File \| What it does \|
	\|------\|--------------\|
	\| [`physix_train_colab.ipynb`](physix_train_colab.ipynb) \| End-to-end SFT → GRPO in one notebook. Built on OpenEnv + Unsloth + TRL. T4/L4 for `1.5b` profile, L4/A100 for `3b`. \|
	\| [`submit.py`](submit.py) \| Submit a job to HF Jobs via `HfApi.run_uv_job`. \|
	\| [`job_train.py`](job_train.py) \| Training driver across the 3 trained systems. Runs inside the HF Jobs container. PEP 723 inline deps. \|
	\| [`job_train_single.py`](job_train_single.py) \| Single-system variant (defaults to `damped_spring`) — focused reward signal, easier to read curves. \|
	\| [`sync-plots.sh`](sync-plots.sh) \| Pull committed loss/reward PNGs from the model repo into `../docs/plots/` so they ship with the env Space. \|

	## Required secrets

	\| Secret \| Source \|
	\|--------\|--------\|
	\| `HF_TOKEN` \| [hf.co/settings/tokens](https://huggingface.co/settings/tokens) (write) \|
	\| `WANDB_API_KEY` \| [wandb.ai/authorize](https://wandb.ai/authorize) (optional — Colab notebook can run with W&B disabled) \|

	## Submit a cloud job

	```bash
	export HF_TOKEN=hf_...
	export WANDB_API_KEY=wandb_v1_...
	python submit.py
	```

	Defaults: l40sx1, 3 h timeout. Source is fetched at job-start by
	`_stage_physix_live()` directly from this Hugging Face Space repo.

	## Run in Colab

	Open [`physix_train_colab.ipynb`](physix_train_colab.ipynb) on a Colab
	GPU runtime. The notebook installs the same dependency set as the cloud
	job, fetches the source from this Hugging Face Space, runs SFT then
	GRPO, and plots loss + reward curves at the end.

	## Pipeline cost (l40sx1, 3B profile)

	\| Step \| Time \| Cost \|
	\|------\|------\|------\|
	\| SFT warm-start (3 epochs, 192 examples) \| ~5 min \| ~$0.15 \|
	\| GRPO (200 steps, 4 generations, max_completion=384) \| ~40 min \| ~$1.20 \|
	\| Push merged 16-bit model to Hub \| ~1 min \| ~$0.04 \|

	W&B project: [pratyush01/physix-live](https://wandb.ai/pratyush01/physix-live).

	## W&B metrics to watch

	\| Metric \| Expected trajectory \|
	\|--------\|---------------------\|
	\| `train/loss` (GRPO) \| stays near zero by design — only the KL term, not the policy gradient \|
	\| `train/reward` (aggregate) \| climbs from ~0.5 to ~1.5–2.0 \|
	\| `train/rewards/reward_match/mean` \| climbs from ~0.05 to ~0.3–0.5 \|
	\| `train/rewards/reward_format/mean` \| climbs from ~0.5 to ~0.9+ \|
	\| `train/rewards/reward_match_dense/mean` \| climbs alongside `reward_match` \|
	\| `train/kl` \| rises slowly, stays below ~0.5 \|
	\| `train/grad_norm` \| bounded, typically 0.3–1.5 \|