adithya9903's picture
Deploy PolyGuard HF training Space
fd0c71a verified

Training

End-to-End Loop

  1. Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
  2. Train SFT adapter with TRL and optional Unsloth.
  3. Train GRPO policy with environment-backed verifier reward.
  4. Run policy-stack ablations.
  5. Merge/export adapters safely.
  6. Validate post-save inference from saved artifacts.
  7. Generate plots and benchmark reports.

TRL Source Of Truth

Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via --allow-fallback or POLYGUARD_ALLOW_TRAIN_FALLBACK=true.

Local Smoke Commands

.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --report-path outputs/reports/sft_trl_run.json --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3

Full HF Space Sweep

The final GPU path is a Hugging Face Docker Space, not local Ollama or local GPU training.

export HF_TOKEN="<write-token>"
.venv/bin/python scripts/deploy_training_space.py \
  --repo-id TheJackBright/polyguard-openenv-training-full \
  --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
  --hardware a10g-large \
  --model-sweep Qwen/Qwen2.5-0.5B-Instruct,Qwen/Qwen2.5-1.5B-Instruct,Qwen/Qwen2.5-3B-Instruct \
  --sft-epochs 2 \
  --grpo-epochs 1 \
  --sft-max-steps 0 \
  --grpo-max-steps 0 \
  --grpo-max-prompts 0

The training runner builds the full corpus with --profile massive --with-local --with-synthetic --with-hf, trains SFT as the baseline and GRPO as the improved environment-backed policy for each Qwen model, then writes isolated sweep artifacts under outputs/reports/sweeps/<model>/ and checkpoints/sweeps/<model>/.

Status snapshot from April 26, 2026:

  • TheJackBright/polyguard-openenv-training-full is running on a10g-large.
  • Qwen 0.5B SFT and GRPO completed inside the Space.
  • Qwen 1.5B SFT completed and Qwen 1.5B GRPO was running.
  • Qwen 3B was not interrupted and should continue after 1.5B.
  • TheJackBright/polyguard-openenv-training-full-artifacts had not received the exported files yet, so run files cannot be pulled until the Space reaches the upload stage.

The run-specific pull command is:

.venv/bin/python scripts/pull_sweep_artifacts.py \
  --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
  --run-id qwen-qwen2-5-0-5b-instruct

Final comparison and safety artifacts:

  • hf_sweep_summary.json
  • anti_hacking_overfit_report.json
  • sft_vs_grpo_reward.png
  • sft_loss_curves.png
  • grpo_reward_curves.png
  • qwen_model_grpo_reward.png
  • reward_component_bars.png
  • anti_cheat_failure_rates.png
  • train_holdout_gap.png
  • inference_validity_reward.png
  • inference_latency_validity.png

Completed runs must use trl_unsloth or trl_transformers; fallback SFT/GRPO or fallback post-save inference fails the pull-time checks.

Active Product Model

After a sweep run has been pulled, activate it for the API/UI:

.venv/bin/python scripts/activate_sweep_model.py \
  --source sweep \
  --run-id qwen-qwen2-5-0-5b-instruct \
  --preferred-artifact grpo_adapter

While the remote full sweep is still running, the app can be tested with the local Qwen 0.5B smoke artifact:

.venv/bin/python scripts/activate_sweep_model.py \
  --source top-level \
  --run-id qwen-qwen2-5-0-5b-instruct \
  --preferred-artifact grpo_adapter \
  --label local-qwen-0.5b-active-smoke

This writes checkpoints/active/active_model_manifest.json, mirrors the manifest to docs/results/active_model_manifest.json, and lets /policy/model_status report which artifact is active. The provider load order is GRPO adapter first, merged SFT artifact second, then SFT adapter.

Final Judge-Ready Criteria

The final accepted reports must satisfy:

  • outputs/reports/sft_trl_run.json: backend is trl_unsloth or trl_transformers.
  • outputs/reports/grpo_trl_run.json: status == "ok", accepted backend, non-empty artifact_path.
  • outputs/reports/postsave_inference.json: model_source is not fallback_policy.
  • outputs/reports/improvement_report.json: improved == true.

Run the strict gate after replacing smoke artifacts:

POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py

Scaling Guidance

Start with small profiles and short max steps. After reset/step/reward/logging is stable, use max_steps <= 0 for full-epoch SFT/GRPO over the selected corpus. Inspect sampled generations, candidate diversity, legality, train-holdout reward gap, and anti-cheat rates before treating a run as final.