| # Training |
|
|
| ## End-to-End Loop |
|
|
| 1. Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback. |
| 2. Train SFT adapter with TRL and optional Unsloth. |
| 3. Train GRPO policy with environment-backed verifier reward. |
| 4. Run policy-stack ablations. |
| 5. Merge/export adapters safely. |
| 6. Validate post-save inference from saved artifacts. |
| 7. Generate plots and benchmark reports. |
|
|
| ## TRL Source Of Truth |
|
|
| - https://huggingface.co/docs/trl/index |
| - https://huggingface.co/docs/trl/grpo_trainer |
| - https://huggingface.co/docs/trl/openenv |
| |
| Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via `--allow-fallback` or `POLYGUARD_ALLOW_TRAIN_FALLBACK=true`. |
|
|
| ## Shared Submission Artifacts |
|
|
| The environment files, training scripts, notebooks, and logs/results required |
| for review are indexed in [Submission Artifact Index](submission_artifacts.md). |
|
|
| Key shared files: |
|
|
| - Environment/runtime: `openenv.yaml`, `pyproject.toml`, `uv.lock`, `requirements*.txt`, `Dockerfile*`, `app/env/`, `server/app.py`, and `app/hf_space/Dockerfile`. |
| - Training scripts: `scripts/train_sft_trl.py`, `scripts/train_grpo_trl.py`, `scripts/deploy_training_space.py`, `app/hf_space/training_runner.py`, and `app/training/`. |
| - Training notebooks: `PolyGuard_SFT_GRPO_One_Run_Runner.ipynb` and `notebooks/09_training_loop.ipynb`. |
| - Training logs/results: `docs/results/final_submission_evidence/reports/`, `docs/results/sweeps/`, `docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/`, and `docs/results/qwen_completed_runs/reports/`. |
|
|
| ## Local Smoke Commands |
|
|
| ```bash |
| .venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf |
| .venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --report-path outputs/reports/sft_trl_run.json --use-unsloth |
| .venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth |
| .venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6 |
| .venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged |
| .venv/bin/python scripts/test_inference_postsave.py --samples 3 |
| ``` |
|
|
| ## Full HF Space Sweep |
|
|
| The final GPU path is a Hugging Face Docker Space, not local Ollama or local GPU training. |
|
|
| The root-level one-run notebook is: |
|
|
| ```text |
| PolyGuard_SFT_GRPO_One_Run_Runner.ipynb |
| ``` |
|
|
| Run it top to bottom for the complete data build, SFT baseline, GRPO training, |
| artifact pull, post-save inference validation, report/chart generation, and |
| product HF Space deployment path. Any required Hugging Face credentials are |
| provided by the runner environment or Space secret, not stored in the repo. |
|
|
| The training runner builds the full corpus with `--profile massive --with-local --with-synthetic --with-hf`, trains SFT as the baseline and GRPO as the improved environment-backed policy for each Qwen model, then writes isolated sweep artifacts under `outputs/reports/sweeps/<model>/` and `checkpoints/sweeps/<model>/`. |
|
|
| The final public evidence is no longer the intermediate Space status. Use |
| `docs/results/final_submission_evidence/` for the completed Qwen 0.5B/1.5B SFT |
| reports and the completed Qwen 3B SFT+GRPO reports, charts, post-save |
| inference, ablations, and artifact manifest. |
|
|
| Final comparison and safety artifacts: |
|
|
| - `hf_sweep_summary.json` |
| - `anti_hacking_overfit_report.json` |
| - `sft_vs_grpo_reward.png` |
| - `sft_loss_curves.png` |
| - `grpo_reward_curves.png` |
| - `qwen_model_grpo_reward.png` |
| - `reward_component_bars.png` |
| - `anti_cheat_failure_rates.png` |
| - `train_holdout_gap.png` |
| - `inference_validity_reward.png` |
| - `inference_latency_validity.png` |
|
|
| Completed runs must use `trl_unsloth` or `trl_transformers`; fallback SFT/GRPO or fallback post-save inference fails the pull-time checks. |
|
|
| ## Active Product Model |
|
|
| After a sweep run has been pulled, activate it for the API/UI: |
|
|
| ```bash |
| .venv/bin/python scripts/activate_sweep_model.py \ |
| --source sweep \ |
| --run-id qwen-qwen2-5-0-5b-instruct \ |
| --preferred-artifact grpo_adapter |
| ``` |
|
|
| While the remote full sweep is still running, the app can be tested with the local Qwen 0.5B smoke artifact: |
|
|
| ```bash |
| .venv/bin/python scripts/activate_sweep_model.py \ |
| --source top-level \ |
| --run-id qwen-qwen2-5-0-5b-instruct \ |
| --preferred-artifact grpo_adapter \ |
| --label local-qwen-0.5b-active-smoke |
| ``` |
|
|
| This writes `checkpoints/active/active_model_manifest.json`, mirrors the manifest to `docs/results/active_model_manifest.json`, and lets `/policy/model_status` report which artifact is active. The provider load order is GRPO adapter first, merged SFT artifact second, then SFT adapter. |
|
|
| ## Final Judge-Ready Criteria |
|
|
| The final accepted reports must satisfy: |
|
|
| - `outputs/reports/sft_trl_run.json`: backend is `trl_unsloth` or `trl_transformers`. |
| - `outputs/reports/grpo_trl_run.json`: `status == "ok"`, accepted backend, non-empty `artifact_path`. |
| - `outputs/reports/postsave_inference.json`: `model_source` is not `fallback_policy`. |
| - `outputs/reports/improvement_report.json`: `improved == true`. |
|
|
| Run the strict gate after replacing smoke artifacts: |
|
|
| ```bash |
| POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py |
| ``` |
|
|
| ## Scaling Guidance |
|
|
| Start with small profiles and short max steps. After reset/step/reward/logging is stable, use `max_steps <= 0` for full-epoch SFT/GRPO over the selected corpus. Inspect sampled generations, candidate diversity, legality, train-holdout reward gap, and anti-cheat rates before treating a run as final. |
|
|