Spaces:
Running
Running
Training
End-to-End Loop
- Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
- Train SFT adapter with TRL and optional Unsloth.
- Train GRPO policy with environment-backed verifier reward.
- Run policy-stack ablations.
- Merge/export adapters safely.
- Validate post-save inference from saved artifacts.
- Generate plots and benchmark reports.
TRL Source Of Truth
- https://huggingface.co/docs/trl/index
- https://huggingface.co/docs/trl/grpo_trainer
- https://huggingface.co/docs/trl/openenv
Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via --allow-fallback or POLYGUARD_ALLOW_TRAIN_FALLBACK=true.
Local Smoke Commands
.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3
Final Judge-Ready Criteria
The final accepted reports must satisfy:
outputs/reports/sft_trl_run.json: backend istrl_unslothortrl_transformers.outputs/reports/grpo_trl_run.json:status == "ok", accepted backend, non-emptyartifact_path.outputs/reports/postsave_inference.json:model_sourceis notfallback_policy.outputs/reports/improvement_report.json:improved == true.
Run the strict gate after replacing smoke artifacts:
POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py
Scaling Guidance
Start with small profiles and short max steps. After reset/step/reward/logging is stable, increase prompt count, GRPO steps, generation count, and environment diversity. Inspect sampled generations for reward hacking before scaling.