polyguard-openenv / docs /training.md
TheJackBright's picture
Deploy PolyGuard OpenEnv Space
877add7 verified
# Training
## End-to-End Loop
1. Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
2. Train SFT adapter with TRL and optional Unsloth.
3. Train GRPO policy with environment-backed verifier reward.
4. Run policy-stack ablations.
5. Merge/export adapters safely.
6. Validate post-save inference from saved artifacts.
7. Generate plots and benchmark reports.
## TRL Source Of Truth
- https://huggingface.co/docs/trl/index
- https://huggingface.co/docs/trl/grpo_trainer
- https://huggingface.co/docs/trl/openenv
Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via `--allow-fallback` or `POLYGUARD_ALLOW_TRAIN_FALLBACK=true`.
## Local Smoke Commands
```bash
.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3
```
## Final Judge-Ready Criteria
The final accepted reports must satisfy:
- `outputs/reports/sft_trl_run.json`: backend is `trl_unsloth` or `trl_transformers`.
- `outputs/reports/grpo_trl_run.json`: `status == "ok"`, accepted backend, non-empty `artifact_path`.
- `outputs/reports/postsave_inference.json`: `model_source` is not `fallback_policy`.
- `outputs/reports/improvement_report.json`: `improved == true`.
Run the strict gate after replacing smoke artifacts:
```bash
POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py
```
## Scaling Guidance
Start with small profiles and short max steps. After reset/step/reward/logging is stable, increase prompt count, GRPO steps, generation count, and environment diversity. Inspect sampled generations for reward hacking before scaling.