| # Training |
|
|
| ## End-to-End Loop |
|
|
| 1. Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback. |
| 2. Train SFT adapter with TRL and optional Unsloth. |
| 3. Train GRPO policy with environment-backed verifier reward. |
| 4. Run policy-stack ablations. |
| 5. Merge/export adapters safely. |
| 6. Validate post-save inference from saved artifacts. |
| 7. Generate plots and benchmark reports. |
|
|
| ## TRL Source Of Truth |
|
|
| - https://huggingface.co/docs/trl/index |
| - https://huggingface.co/docs/trl/grpo_trainer |
| - https://huggingface.co/docs/trl/openenv |
| |
| Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via `--allow-fallback` or `POLYGUARD_ALLOW_TRAIN_FALLBACK=true`. |
|
|
| ## Local Smoke Commands |
|
|
| ```bash |
| .venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf |
| .venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --report-path outputs/reports/sft_trl_run.json --use-unsloth |
| .venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth |
| .venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6 |
| .venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged |
| .venv/bin/python scripts/test_inference_postsave.py --samples 3 |
| ``` |
|
|
| ## Full HF Space Sweep |
|
|
| The final GPU path is a Hugging Face Docker Space, not local Ollama or local GPU training. |
|
|
| ```bash |
| export HF_TOKEN="<write-token>" |
| .venv/bin/python scripts/deploy_training_space.py \ |
| --repo-id TheJackBright/polyguard-openenv-training-full \ |
| --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \ |
| --hardware a10g-large \ |
| --model-sweep Qwen/Qwen2.5-0.5B-Instruct,Qwen/Qwen2.5-1.5B-Instruct,Qwen/Qwen2.5-3B-Instruct \ |
| --sft-epochs 2 \ |
| --grpo-epochs 1 \ |
| --sft-max-steps 0 \ |
| --grpo-max-steps 0 \ |
| --grpo-max-prompts 0 |
| ``` |
|
|
| The training runner builds the full corpus with `--profile massive --with-local --with-synthetic --with-hf`, trains SFT as the baseline and GRPO as the improved environment-backed policy for each Qwen model, then writes isolated sweep artifacts under `outputs/reports/sweeps/<model>/` and `checkpoints/sweeps/<model>/`. |
|
|
| Status snapshot from April 26, 2026: |
|
|
| - `TheJackBright/polyguard-openenv-training-full` is running on `a10g-large`. |
| - Qwen 0.5B SFT and GRPO completed inside the Space. |
| - Qwen 1.5B SFT completed and Qwen 1.5B GRPO was running. |
| - Qwen 3B was not interrupted and should continue after 1.5B. |
| - `TheJackBright/polyguard-openenv-training-full-artifacts` had not received the exported files yet, so run files cannot be pulled until the Space reaches the upload stage. |
|
|
| The run-specific pull command is: |
|
|
| ```bash |
| .venv/bin/python scripts/pull_sweep_artifacts.py \ |
| --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \ |
| --run-id qwen-qwen2-5-0-5b-instruct |
| ``` |
|
|
| Final comparison and safety artifacts: |
|
|
| - `hf_sweep_summary.json` |
| - `anti_hacking_overfit_report.json` |
| - `sft_vs_grpo_reward.png` |
| - `sft_loss_curves.png` |
| - `grpo_reward_curves.png` |
| - `qwen_model_grpo_reward.png` |
| - `reward_component_bars.png` |
| - `anti_cheat_failure_rates.png` |
| - `train_holdout_gap.png` |
| - `inference_validity_reward.png` |
| - `inference_latency_validity.png` |
|
|
| Completed runs must use `trl_unsloth` or `trl_transformers`; fallback SFT/GRPO or fallback post-save inference fails the pull-time checks. |
|
|
| ## Active Product Model |
|
|
| After a sweep run has been pulled, activate it for the API/UI: |
|
|
| ```bash |
| .venv/bin/python scripts/activate_sweep_model.py \ |
| --source sweep \ |
| --run-id qwen-qwen2-5-0-5b-instruct \ |
| --preferred-artifact grpo_adapter |
| ``` |
|
|
| While the remote full sweep is still running, the app can be tested with the local Qwen 0.5B smoke artifact: |
|
|
| ```bash |
| .venv/bin/python scripts/activate_sweep_model.py \ |
| --source top-level \ |
| --run-id qwen-qwen2-5-0-5b-instruct \ |
| --preferred-artifact grpo_adapter \ |
| --label local-qwen-0.5b-active-smoke |
| ``` |
|
|
| This writes `checkpoints/active/active_model_manifest.json`, mirrors the manifest to `docs/results/active_model_manifest.json`, and lets `/policy/model_status` report which artifact is active. The provider load order is GRPO adapter first, merged SFT artifact second, then SFT adapter. |
|
|
| ## Final Judge-Ready Criteria |
|
|
| The final accepted reports must satisfy: |
|
|
| - `outputs/reports/sft_trl_run.json`: backend is `trl_unsloth` or `trl_transformers`. |
| - `outputs/reports/grpo_trl_run.json`: `status == "ok"`, accepted backend, non-empty `artifact_path`. |
| - `outputs/reports/postsave_inference.json`: `model_source` is not `fallback_policy`. |
| - `outputs/reports/improvement_report.json`: `improved == true`. |
|
|
| Run the strict gate after replacing smoke artifacts: |
|
|
| ```bash |
| POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py |
| ``` |
|
|
| ## Scaling Guidance |
|
|
| Start with small profiles and short max steps. After reset/step/reward/logging is stable, use `max_steps <= 0` for full-epoch SFT/GRPO over the selected corpus. Inspect sampled generations, candidate diversity, legality, train-holdout reward gap, and anti-cheat rates before treating a run as final. |
|
|