File size: 5,678 Bytes
21c7db9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8a246b
 
 
 
 
 
 
 
 
 
 
 
21c7db9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8a246b
 
 
 
21c7db9
 
 
f8a246b
 
 
 
21c7db9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# Training

## End-to-End Loop

1. Build training corpus from local structured data, synthetic episodes, optional HF instruction data, optional DDI API augmentation, and optional web fallback.
2. Train SFT adapter with TRL and optional Unsloth.
3. Train GRPO policy with environment-backed verifier reward.
4. Run policy-stack ablations.
5. Merge/export adapters safely.
6. Validate post-save inference from saved artifacts.
7. Generate plots and benchmark reports.

## TRL Source Of Truth

- https://huggingface.co/docs/trl/index
- https://huggingface.co/docs/trl/grpo_trainer
- https://huggingface.co/docs/trl/openenv

Training entrypoints require Hugging Face TRL by default. Fallback backends are opt-in only via `--allow-fallback` or `POLYGUARD_ALLOW_TRAIN_FALLBACK=true`.

## Shared Submission Artifacts

The environment files, training scripts, notebooks, and logs/results required
for review are indexed in [Submission Artifact Index](submission_artifacts.md).

Key shared files:

- Environment/runtime: `openenv.yaml`, `pyproject.toml`, `uv.lock`, `requirements*.txt`, `Dockerfile*`, `app/env/`, `server/app.py`, and `app/hf_space/Dockerfile`.
- Training scripts: `scripts/train_sft_trl.py`, `scripts/train_grpo_trl.py`, `scripts/deploy_training_space.py`, `app/hf_space/training_runner.py`, and `app/training/`.
- Training notebooks: `PolyGuard_SFT_GRPO_One_Run_Runner.ipynb` and `notebooks/09_training_loop.ipynb`.
- Training logs/results: `docs/results/final_submission_evidence/reports/`, `docs/results/sweeps/`, `docs/results/submission_evidence_qwen_0_5b_1_5b_3b/reports/`, and `docs/results/qwen_completed_runs/reports/`.

## Local Smoke Commands

```bash
.venv/bin/python scripts/build_training_corpus.py --profile small --with-local --with-synthetic --with-hf
.venv/bin/python scripts/train_sft_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --epochs 1 --max-steps 20 --report-path outputs/reports/sft_trl_run.json --use-unsloth
.venv/bin/python scripts/train_grpo_trl.py --model-id Qwen/Qwen2.5-1.5B-Instruct --max-steps 20 --num-generations 2 --use-unsloth
.venv/bin/python scripts/evaluate_policy_ablations.py --episodes 6
.venv/bin/python scripts/merge_adapters_safe.py --adapter-dir checkpoints/sft_adapter --output-dir checkpoints/merged
.venv/bin/python scripts/test_inference_postsave.py --samples 3
```

## Full HF Space Sweep

The final GPU path is a Hugging Face Docker Space, not local Ollama or local GPU training.

The root-level one-run notebook is:

```text
PolyGuard_SFT_GRPO_One_Run_Runner.ipynb
```

Run it top to bottom for the complete data build, SFT baseline, GRPO training,
artifact pull, post-save inference validation, report/chart generation, and
product HF Space deployment path. Any required Hugging Face credentials are
provided by the runner environment or Space secret, not stored in the repo.

The training runner builds the full corpus with `--profile massive --with-local --with-synthetic --with-hf`, trains SFT as the baseline and GRPO as the improved environment-backed policy for each Qwen model, then writes isolated sweep artifacts under `outputs/reports/sweeps/<model>/` and `checkpoints/sweeps/<model>/`.

The final public evidence is no longer the intermediate Space status. Use
`docs/results/final_submission_evidence/` for the completed Qwen 0.5B/1.5B SFT
reports and the completed Qwen 3B SFT+GRPO reports, charts, post-save
inference, ablations, and artifact manifest.

Final comparison and safety artifacts:

- `hf_sweep_summary.json`
- `anti_hacking_overfit_report.json`
- `sft_vs_grpo_reward.png`
- `sft_loss_curves.png`
- `grpo_reward_curves.png`
- `qwen_model_grpo_reward.png`
- `reward_component_bars.png`
- `anti_cheat_failure_rates.png`
- `train_holdout_gap.png`
- `inference_validity_reward.png`
- `inference_latency_validity.png`

Completed runs must use `trl_unsloth` or `trl_transformers`; fallback SFT/GRPO or fallback post-save inference fails the pull-time checks.

## Active Product Model

After a sweep run has been pulled, activate it for the API/UI:

```bash
.venv/bin/python scripts/activate_sweep_model.py \
  --source sweep \
  --run-id qwen-qwen2-5-0-5b-instruct \
  --preferred-artifact grpo_adapter
```

While the remote full sweep is still running, the app can be tested with the local Qwen 0.5B smoke artifact:

```bash
.venv/bin/python scripts/activate_sweep_model.py \
  --source top-level \
  --run-id qwen-qwen2-5-0-5b-instruct \
  --preferred-artifact grpo_adapter \
  --label local-qwen-0.5b-active-smoke
```

This writes `checkpoints/active/active_model_manifest.json`, mirrors the manifest to `docs/results/active_model_manifest.json`, and lets `/policy/model_status` report which artifact is active. The provider load order is GRPO adapter first, merged SFT artifact second, then SFT adapter.

## Final Judge-Ready Criteria

The final accepted reports must satisfy:

- `outputs/reports/sft_trl_run.json`: backend is `trl_unsloth` or `trl_transformers`.
- `outputs/reports/grpo_trl_run.json`: `status == "ok"`, accepted backend, non-empty `artifact_path`.
- `outputs/reports/postsave_inference.json`: `model_source` is not `fallback_policy`.
- `outputs/reports/improvement_report.json`: `improved == true`.

Run the strict gate after replacing smoke artifacts:

```bash
POLYGUARD_ENFORCE_SUBMISSION_LINKS=true .venv/bin/python scripts/acceptance_gate.py
```

## Scaling Guidance

Start with small profiles and short max steps. After reset/step/reward/logging is stable, use `max_steps <= 0` for full-epoch SFT/GRPO over the selected corpus. Inspect sampled generations, candidate diversity, legality, train-holdout reward gap, and anti-cheat rates before treating a run as final.