File size: 6,529 Bytes
21c7db9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8a246b
21c7db9
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# Submission Checklist

## Required Narrative

- Problem statement clearly states the capability gap: safe long-horizon polypharmacy action selection.
- Environment describes observation, action, state, episode termination, and OpenEnv endpoints.
- Agent capabilities cover med reconciliation, evidence, graph safety, dosing, candidate generation, planning, critique, and explanation.
- Tasks cover DDI risk, safer substitutions, taper/deprescribing, precision dosing, missing-data recovery, and new-drug decomposition.
- Reward/evaluation logic documents the 13 reward columns, 4 primary channels, anti-cheat checks, timeouts, and offline evaluation.
- Post-training/self-improvement strategy documents SFT warm start, GRPO with environment rewards, ablations, adapter export, and post-save inference validation.

## Required Deliverables

- GitHub repo with all required links in README.
- Hugging Face Space URL.
- Colab notebook URL.
- YouTube video URL or Hugging Face blog URL. The current README blog URL is the intended target but still returns 404 until published.
- Tracked plots and compact reports under `docs/results/`.
- Successful `docs/results/hf_space_verification.json` with `passed: true`.
- Participant-guide traceability map in `docs/participant_guide_traceability.md`.

## Commands To Validate Before Submission

```bash
uv run pytest
uv run openenv validate .
bash scripts/bootstrap_openenv.sh --runtime-check
(cd app/ui/frontend && npm run build)
.venv/bin/python scripts/evaluate_baselines.py
.venv/bin/python scripts/evaluate_all.py
.venv/bin/python scripts/evaluate_compare_runs.py --baseline outputs/reports/baselines.json --candidate outputs/reports/benchmark_report.json --output outputs/reports/improvement_report.json
.venv/bin/python scripts/acceptance_gate.py
```

After the story artifact is published, run the opt-in live link checker:

```bash
uv run python scripts/validate_submission_links.py
```

## Full Remote Training Evidence

```bash
export HF_TOKEN="<write-token>"
.venv/bin/python scripts/deploy_training_space.py \
  --repo-id TheJackBright/polyguard-openenv-training-full \
  --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
  --hardware a10g-large \
  --model-sweep Qwen/Qwen2.5-0.5B-Instruct,Qwen/Qwen2.5-1.5B-Instruct,Qwen/Qwen2.5-3B-Instruct \
  --sft-epochs 2 \
  --grpo-epochs 1 \
  --sft-max-steps 0 \
  --grpo-max-steps 0 \
  --grpo-max-prompts 0
.venv/bin/python scripts/pull_training_artifacts.py \
  --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts
.venv/bin/python scripts/activate_sweep_model.py \
  --source sweep \
  --run-id qwen-qwen2-5-0-5b-instruct \
  --preferred-artifact grpo_adapter
```

Final public artifacts should include `hf_sweep_summary.json`, `anti_hacking_overfit_report.json`, post-save inference reports, adapter evidence, `active_model_manifest.json`, and all relevant charts under `docs/results/` and `outputs/plots/`. Current tracked evidence includes a 3-model SFT-baseline sweep plus a top-level environment-backed GRPO run. Only claim a full public per-model GRPO sweep after those private artifacts are pulled, mirrored, and documented.

## Qwen 0.5B/1.5B Submission Evidence

```bash
.venv/bin/python scripts/generate_submission_evidence.py \
  --models qwen-qwen2-5-0-5b-instruct,qwen-qwen2-5-1-5b-instruct \
  --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
  --training-space-url https://thejackbright-polyguard-openenv-training-full.hf.space \
  --episodes 8
```

The generated files live in:

- `docs/results/submission_evidence_qwen_0_5b_1_5b/`
- `outputs/reports/submission_evidence/qwen_0_5b_1_5b/`
- `outputs/plots/submission_evidence/qwen_0_5b_1_5b/`
- `submission_bundle/qwen_0_5b_1_5b_evidence.zip`

The current live evidence confirms remote completion of 0.5B/1.5B SFT, GRPO, GRPO post-save inference, and policy ablations, but marks per-run GRPO files/checkpoints as pending because the private artifact repo has not uploaded them yet.

The implementation-ready active model bundle is available separately:

```text
https://huggingface.co/TheJackBright/polyguard-openenv-training-full-artifacts/tree/main/usable_model_bundles/local-qwen-0-5b-active-smoke
submission_bundle/model_artifacts/local-qwen-0-5b-active-smoke/
```

It includes the local active Qwen 0.5B `grpo_adapter`, `sft_adapter`, `merged` model, manifests, and reports for immediate app integration while the full per-run remote sweep artifacts remain pending.

Deploy the evaluation-only HF Space without interrupting the training Space:

```bash
.venv/bin/python scripts/deploy_evidence_space.py \
  --repo-id TheJackBright/polyguard-openenv-evidence \
  --artifact-repo-id TheJackBright/polyguard-openenv-training-full-artifacts \
  --training-space-url https://thejackbright-polyguard-openenv-training-full.hf.space \
  --models qwen-qwen2-5-0-5b-instruct,qwen-qwen2-5-1-5b-instruct \
  --hardware cpu-basic
```

## Strict Final Gate

```bash
export POLYGUARD_ENFORCE_SUBMISSION_LINKS=true
.venv/bin/python scripts/acceptance_gate.py
```

Strict mode must pass only after:

- README links are not placeholders.
- `docs/results/avg_reward.png` and `docs/results/policy_stack_avg_reward.png` exist.
- `docs/results/hf_space_verification.json` has `passed: true`.
- `outputs/reports/sft_trl_run.json` has `status: ok`, non-zero examples, a non-empty artifact path, and uses `trl_unsloth` or `trl_transformers`.
- `outputs/reports/grpo_trl_run.json` has `status: ok`, accepted backend, and non-empty `artifact_path`.
- `outputs/reports/postsave_inference.json` does not use `fallback_policy`.
- `outputs/reports/improvement_report.json` has `improved: true`.
- `outputs/reports/hf_sweep_summary.json` has at least one completed non-fallback model row.
- `outputs/reports/anti_hacking_overfit_report.json` has `passed: true`.
- `GET /policy/model_status` reports the intended active run and artifact availability.

Strict mode passed during the April 26, 2026 audit. It does not perform live HTTP status checks, so the final blog/video URL still needs explicit validation.

## HF Auth Commands

```bash
./.venv/bin/hf auth login
./.venv/bin/hf auth whoami
export HF_SPACE_REPO_ID="TheJackBright/polyguard-openenv-workbench"
```

Use `./.venv/bin/hf`, not the global `hf` binary.

Private HF training artifact repositories require authentication and should not be used as judge-facing public links unless they are made public or mirrored into the repository/Space documentation.