Spaces:

TheJackBright
/

polyguard-openenv

Running

App Files Files Community

polyguard-openenv / docs /training.md

TheJackBright

Deploy PolyGuard OpenEnv Space

877add7 verified 12 days ago

preview code

raw

history blame contribute delete

2.24 kB

	# Training

	## End-to-End Loop

	1. Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
	2. Train SFT adapter with TRL and optional Unsloth.
	3. Train GRPO policy with environment-backed verifier reward.
	4. Run policy-stack ablations.
	5. Merge/export adapters safely.
	6. Validate post-save inference from saved artifacts.
	7. Generate plots and benchmark reports.

	## TRL Source Of Truth

	- https://huggingface.co/docs/trl/index
	- https://huggingface.co/docs/trl/grpo_trainer
	- https://huggingface.co/docs/trl/openenv

	Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via `--allow-fallback` or `POLYGUARD_ALLOW_TRAIN_FALLBACK=true`.

	## Local Smoke Commands

	```bash
	.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
	.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth
	.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
	.venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6
	.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
	.venv/bin/python scripts/test_inference_postsave.py --samples 3
	```

	## Final Judge-Ready Criteria

	The final accepted reports must satisfy:

	- `outputs/reports/sft_trl_run.json`: backend is `trl_unsloth` or `trl_transformers`.
	- `outputs/reports/grpo_trl_run.json`: `status == "ok"`, accepted backend, non-empty `artifact_path`.
	- `outputs/reports/postsave_inference.json`: `model_source` is not `fallback_policy`.
	- `outputs/reports/improvement_report.json`: `improved == true`.

	Run the strict gate after replacing smoke artifacts:

	```bash
	POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py
	```

	## Scaling Guidance

	Start with small profiles and short max steps. After reset/step/reward/logging is stable, increase prompt count, GRPO steps, generation count, and environment diversity. Inspect sampled generations for reward hacking before scaling.