Spaces:
Running
Running
| # Training | |
| ## End-to-End Loop | |
| 1. Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback. | |
| 2. Train SFT adapter with TRL and optional Unsloth. | |
| 3. Train GRPO policy with environment-backed verifier reward. | |
| 4. Run policy-stack ablations. | |
| 5. Merge/export adapters safely. | |
| 6. Validate post-save inference from saved artifacts. | |
| 7. Generate plots and benchmark reports. | |
| ## TRL Source Of Truth | |
| - https://huggingface.co/docs/trl/index | |
| - https://huggingface.co/docs/trl/grpo_trainer | |
| - https://huggingface.co/docs/trl/openenv | |
| Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via `--allow-fallback` or `POLYGUARD_ALLOW_TRAIN_FALLBACK=true`. | |
| ## Local Smoke Commands | |
| ```bash | |
| .venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf | |
| .venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --use-unsloth | |
| .venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth | |
| .venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6 | |
| .venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged | |
| .venv/bin/python scripts/test_inference_postsave.py --samples 3 | |
| ``` | |
| ## Final Judge-Ready Criteria | |
| The final accepted reports must satisfy: | |
| - `outputs/reports/sft_trl_run.json`: backend is `trl_unsloth` or `trl_transformers`. | |
| - `outputs/reports/grpo_trl_run.json`: `status == "ok"`, accepted backend, non-empty `artifact_path`. | |
| - `outputs/reports/postsave_inference.json`: `model_source` is not `fallback_policy`. | |
| - `outputs/reports/improvement_report.json`: `improved == true`. | |
| Run the strict gate after replacing smoke artifacts: | |
| ```bash | |
| POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py | |
| ``` | |
| ## Scaling Guidance | |
| Start with small profiles and short max steps. After reset/step/reward/logging is stable, increase prompt count, GRPO steps, generation count, and environment diversity. Inspect sampled generations for reward hacking before scaling. | |