File size: 1,480 Bytes
21c7db9 f8a246b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | # Evaluation
Evaluation is computed from simulator rollouts, perturbation suites, baseline comparisons, and post-save inference checks.
## Run
```bash
.venv/bin/python scripts/evaluate_baselines.py
.venv/bin/python scripts/evaluate_all.py
.venv/bin/python scripts/evaluate_compare_runs.py --baseline outputs/reports/baselines.json --candidate outputs/reports/benchmark_report.json --output outputs/reports/improvement_report.json
```
## Main Artifacts
- `outputs/reports/benchmark_report.json`
- `outputs/reports/baselines.json`
- `outputs/reports/grpo_ablation_report.json`
- `outputs/reports/improvement_report.json`
- `outputs/plots/*.png`
- tracked mirrors under `docs/results/`
## Metric Families
- Offline policy quality: `avg_reward`, `legal_rate`, `success_rate`.
- Robustness under perturbations.
- Dosing-specific target attainment and toxicity avoidance.
- Calibration and abstention.
- Process fidelity and invalid-action behavior.
- Subgroup and explainability summaries.
- Failure visibility and anti-cheat counts.
## Improvement Gate
Final comparison must show positive or non-regressing behavior on:
- average reward
- legality rate
- success or justified-safe-resolution rate
- process fidelity
- timeout rate
- failure visibility
Older smoke artifacts are retained for auditability, but final claims should use
the curated bundle under `docs/results/final_submission_evidence/`. The root
repository README is the canonical narrative and evidence map.
|