Spaces:
Running
Running
| # Evaluation | |
| Evaluation is computed from simulator rollouts, perturbation suites, baseline comparisons, and post-save inference checks. | |
| ## Run | |
| ```bash | |
| .venv/bin/python scripts/evaluate_baselines.py | |
| .venv/bin/python scripts/evaluate_all.py | |
| .venv/bin/python scripts/evaluate_compare_runs.py --baseline outputs/reports/baselines.json --candidate outputs/reports/benchmark_report.json --output outputs/reports/improvement_report.json | |
| ``` | |
| ## Main Artifacts | |
| - `outputs/reports/benchmark_report.json` | |
| - `outputs/reports/baselines.json` | |
| - `outputs/reports/grpo_ablation_report.json` | |
| - `outputs/reports/improvement_report.json` | |
| - `outputs/plots/*.png` | |
| - tracked mirrors under `docs/results/` | |
| ## Metric Families | |
| - Offline policy quality: `avg_reward`, `legal_rate`, `success_rate`. | |
| - Robustness under perturbations. | |
| - Dosing-specific target attainment and toxicity avoidance. | |
| - Calibration and abstention. | |
| - Process fidelity and invalid-action behavior. | |
| - Subgroup and explainability summaries. | |
| - Failure visibility and anti-cheat counts. | |
| ## Improvement Gate | |
| Final comparison must show positive or non-regressing behavior on: | |
| - average reward | |
| - legality rate | |
| - success or justified-safe-resolution rate | |
| - process fidelity | |
| - timeout rate | |
| - failure visibility | |
| Current tracked smoke artifacts are not final evidence: `docs/results/improvement_report.json` currently records `improved: false`. Replace it after real SFT/GRPO training. | |