Spaces:

TheJackBright
/

polyguard-openenv

Running

App Files Files Community

polyguard-openenv / docs /training.md

TheJackBright

Deploy PolyGuard OpenEnv Space

877add7 verified 12 days ago

preview code

raw

history blame contribute delete

2.24 kB

Training

End-to-End Loop

Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
Train SFT adapter with TRL and optional Unsloth.
Train GRPO policy with environment-backed verifier reward.
Run policy-stack ablations.
Merge/export adapters safely.
Validate post-save inference from saved artifacts.
Generate plots and benchmark reports.

TRL Source Of Truth

Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via --allow-fallback or POLYGUARD_ALLOW_TRAIN_FALLBACK=true.

Local Smoke Commands

.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3

Final Judge-Ready Criteria

The final accepted reports must satisfy:

outputs/reports/sft_trl_run.json: backend is trl_unsloth or trl_transformers.
outputs/reports/grpo_trl_run.json: status == "ok", accepted backend, non-empty artifact_path.
outputs/reports/postsave_inference.json: model_source is not fallback_policy.
outputs/reports/improvement_report.json: improved == true.

Run the strict gate after replacing smoke artifacts:

POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py

Scaling Guidance

Start with small profiles and short max steps. After reset/step/reward/logging is stable, increase prompt count, GRPO steps, generation count, and environment diversity. Inspect sampled generations for reward hacking before scaling.