| # Evaluation |
|
|
| Evaluation is computed from simulator rollouts, perturbation suites, baseline comparisons, and post-save inference checks. |
|
|
| ## Run |
|
|
| ```bash |
| .venv/bin/python scripts/evaluate_baselines.py |
| .venv/bin/python scripts/evaluate_all.py |
| .venv/bin/python scripts/evaluate_compare_runs.py --baseline outputs/reports/baselines.json --candidate outputs/reports/benchmark_report.json --output outputs/reports/improvement_report.json |
| ``` |
|
|
| ## Main Artifacts |
|
|
| - `outputs/reports/benchmark_report.json` |
| - `outputs/reports/baselines.json` |
| - `outputs/reports/grpo_ablation_report.json` |
| - `outputs/reports/improvement_report.json` |
| - `outputs/plots/*.png` |
| - tracked mirrors under `docs/results/` |
|
|
| ## Metric Families |
|
|
| - Offline policy quality: `avg_reward`, `legal_rate`, `success_rate`. |
| - Robustness under perturbations. |
| - Dosing-specific target attainment and toxicity avoidance. |
| - Calibration and abstention. |
| - Process fidelity and invalid-action behavior. |
| - Subgroup and explainability summaries. |
| - Failure visibility and anti-cheat counts. |
|
|
| ## Improvement Gate |
|
|
| Final comparison must show positive or non-regressing behavior on: |
|
|
| - average reward |
| - legality rate |
| - success or justified-safe-resolution rate |
| - process fidelity |
| - timeout rate |
| - failure visibility |
|
|
| Older smoke artifacts are retained for auditability, but final claims should use |
| the curated bundle under `docs/results/final_submission_evidence/`. The root |
| repository README is the canonical narrative and evidence map. |
|
|