Training
End-to-End Loop
- Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
- Train SFT adapter with TRL and optional Unsloth.
- Train GRPO policy with environment-backed verifier reward.
- Run policy-stack ablations.
- Merge/export adapters safely.
- Validate post-save inference from saved artifacts.
- Generate plots and benchmark reports.
TRL Source Of Truth
- https://huggingface.co/docs/trl/index
- https://huggingface.co/docs/trl/grpo_trainer
- https://huggingface.co/docs/trl/openenv
Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via --allow-fallback or POLYGUARD_ALLOW_TRAIN_FALLBACK=true.
Local Smoke Commands
.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --report-path outputs/reports/sft_trl_run.json --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3
Full HF Space Sweep
The final GPU path is a Hugging Face Docker Space, not local Ollama or local GPU training.
export HF_TOKEN="<write-token>"
.venv/bin/python scripts/deploy_training_space.py \
--repo-id TheJackBright/polyguard-openenv-training-full \
--artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
--hardware a10g-large \
--model-sweep Qwen/Qwen2.5-0.5B-Instruct,Qwen/Qwen2.5-1.5B-Instruct,Qwen/Qwen2.5-3B-Instruct \
--sft-epochs 2 \
--grpo-epochs 1 \
--sft-max-steps 0 \
--grpo-max-steps 0 \
--grpo-max-prompts 0
The training runner builds the full corpus with --profile massive --with-local --with-synthetic --with-hf, trains SFT as the baseline and GRPO as the improved environment-backed policy for each Qwen model, then writes isolated sweep artifacts under outputs/reports/sweeps/<model>/ and checkpoints/sweeps/<model>/.
Status snapshot from April 26, 2026:
TheJackBright/polyguard-openenv-training-fullis running ona10g-large.- Qwen 0.5B SFT and GRPO completed inside the Space.
- Qwen 1.5B SFT completed and Qwen 1.5B GRPO was running.
- Qwen 3B was not interrupted and should continue after 1.5B.
TheJackBright/polyguard-openenv-training-full-artifactshad not received the exported files yet, so run files cannot be pulled until the Space reaches the upload stage.
The run-specific pull command is:
.venv/bin/python scripts/pull_sweep_artifacts.py \
--artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
--run-id qwen-qwen2-5-0-5b-instruct
Final comparison and safety artifacts:
hf_sweep_summary.jsonanti_hacking_overfit_report.jsonsft_vs_grpo_reward.pngsft_loss_curves.pnggrpo_reward_curves.pngqwen_model_grpo_reward.pngreward_component_bars.pnganti_cheat_failure_rates.pngtrain_holdout_gap.pnginference_validity_reward.pnginference_latency_validity.png
Completed runs must use trl_unsloth or trl_transformers; fallback SFT/GRPO or fallback post-save inference fails the pull-time checks.
Active Product Model
After a sweep run has been pulled, activate it for the API/UI:
.venv/bin/python scripts/activate_sweep_model.py \
--source sweep \
--run-id qwen-qwen2-5-0-5b-instruct \
--preferred-artifact grpo_adapter
While the remote full sweep is still running, the app can be tested with the local Qwen 0.5B smoke artifact:
.venv/bin/python scripts/activate_sweep_model.py \
--source top-level \
--run-id qwen-qwen2-5-0-5b-instruct \
--preferred-artifact grpo_adapter \
--label local-qwen-0.5b-active-smoke
This writes checkpoints/active/active_model_manifest.json, mirrors the manifest to docs/results/active_model_manifest.json, and lets /policy/model_status report which artifact is active. The provider load order is GRPO adapter first, merged SFT artifact second, then SFT adapter.
Final Judge-Ready Criteria
The final accepted reports must satisfy:
outputs/reports/sft_trl_run.json: backend istrl_unslothortrl_transformers.outputs/reports/grpo_trl_run.json:status == "ok", accepted backend, non-emptyartifact_path.outputs/reports/postsave_inference.json:model_sourceis notfallback_policy.outputs/reports/improvement_report.json:improved == true.
Run the strict gate after replacing smoke artifacts:
POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py
Scaling Guidance
Start with small profiles and short max steps. After reset/step/reward/logging is stable, use max_steps <= 0 for full-epoch SFT/GRPO over the selected corpus. Inspect sampled generations, candidate diversity, legality, train-holdout reward gap, and anti-cheat rates before treating a run as final.