Evaluation
Evaluation is computed from simulator rollouts, perturbation suites, baseline comparisons, and post-save inference checks.
Run
.venv/bin/python scripts/evaluate_baselines.py
.venv/bin/python scripts/evaluate_all.py
.venv/bin/python scripts/evaluate_compare_runs.py --baseline outputs/reports/baselines.json --candidate outputs/reports/benchmark_report.json --output outputs/reports/improvement_report.json
Main Artifacts
outputs/reports/benchmark_report.jsonoutputs/reports/baselines.jsonoutputs/reports/grpo_ablation_report.jsonoutputs/reports/improvement_report.jsonoutputs/plots/*.png- tracked mirrors under
docs/results/
Metric Families
- Offline policy quality:
avg_reward,legal_rate,success_rate. - Robustness under perturbations.
- Dosing-specific target attainment and toxicity avoidance.
- Calibration and abstention.
- Process fidelity and invalid-action behavior.
- Subgroup and explainability summaries.
- Failure visibility and anti-cheat counts.
Improvement Gate
Final comparison must show positive or non-regressing behavior on:
- average reward
- legality rate
- success or justified-safe-resolution rate
- process fidelity
- timeout rate
- failure visibility
Older smoke artifacts are retained for auditability, but final claims should use
the curated bundle under docs/results/final_submission_evidence/. The root
repository README is the canonical narrative and evidence map.