polyguard-openenv / docs /training.md
TheJackBright's picture
Deploy PolyGuard OpenEnv Space
877add7 verified

Training

End-to-End Loop

  1. Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
  2. Train SFT adapter with TRL and optional Unsloth.
  3. Train GRPO policy with environment-backed verifier reward.
  4. Run policy-stack ablations.
  5. Merge/export adapters safely.
  6. Validate post-save inference from saved artifacts.
  7. Generate plots and benchmark reports.

TRL Source Of Truth

Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via --allow-fallback or POLYGUARD_ALLOW_TRAIN_FALLBACK=true.

Local Smoke Commands

.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3

Final Judge-Ready Criteria

The final accepted reports must satisfy:

  • outputs/reports/sft_trl_run.json: backend is trl_unsloth or trl_transformers.
  • outputs/reports/grpo_trl_run.json: status == "ok", accepted backend, non-empty artifact_path.
  • outputs/reports/postsave_inference.json: model_source is not fallback_policy.
  • outputs/reports/improvement_report.json: improved == true.

Run the strict gate after replacing smoke artifacts:

POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py

Scaling Guidance

Start with small profiles and short max steps. After reset/step/reward/logging is stable, increase prompt count, GRPO steps, generation count, and environment diversity. Inspect sampled generations for reward hacking before scaling.