# Evaluation Evaluation is computed from simulator rollouts, perturbation suites, baseline comparisons, and post-save inference checks. ## Run ```bash .venv/bin/python scripts/evaluate_baselines.py .venv/bin/python scripts/evaluate_all.py .venv/bin/python scripts/evaluate_compare_runs.py --baseline outputs/reports/baselines.json --candidate outputs/reports/benchmark_report.json --output outputs/reports/improvement_report.json ``` ## Main Artifacts - `outputs/reports/benchmark_report.json` - `outputs/reports/baselines.json` - `outputs/reports/grpo_ablation_report.json` - `outputs/reports/improvement_report.json` - `outputs/plots/*.png` - tracked mirrors under `docs/results/` ## Metric Families - Offline policy quality: `avg_reward`, `legal_rate`, `success_rate`. - Robustness under perturbations. - Dosing-specific target attainment and toxicity avoidance. - Calibration and abstention. - Process fidelity and invalid-action behavior. - Subgroup and explainability summaries. - Failure visibility and anti-cheat counts. ## Improvement Gate Final comparison must show positive or non-regressing behavior on: - average reward - legality rate - success or justified-safe-resolution rate - process fidelity - timeout rate - failure visibility Current tracked smoke artifacts are not final evidence: `docs/results/improvement_report.json` currently records `improved: false`. Replace it after real SFT/GRPO training.