Spaces:
Running
Running
File size: 1,432 Bytes
877add7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | # Evaluation
Evaluation is computed from simulator rollouts, perturbation suites, baseline comparisons, and post-save inference checks.
## Run
```bash
.venv/bin/python scripts/evaluate_baselines.py
.venv/bin/python scripts/evaluate_all.py
.venv/bin/python scripts/evaluate_compare_runs.py --baseline outputs/reports/baselines.json --candidate outputs/reports/benchmark_report.json --output outputs/reports/improvement_report.json
```
## Main Artifacts
- `outputs/reports/benchmark_report.json`
- `outputs/reports/baselines.json`
- `outputs/reports/grpo_ablation_report.json`
- `outputs/reports/improvement_report.json`
- `outputs/plots/*.png`
- tracked mirrors under `docs/results/`
## Metric Families
- Offline policy quality: `avg_reward`, `legal_rate`, `success_rate`.
- Robustness under perturbations.
- Dosing-specific target attainment and toxicity avoidance.
- Calibration and abstention.
- Process fidelity and invalid-action behavior.
- Subgroup and explainability summaries.
- Failure visibility and anti-cheat counts.
## Improvement Gate
Final comparison must show positive or non-regressing behavior on:
- average reward
- legality rate
- success or justified-safe-resolution rate
- process fidelity
- timeout rate
- failure visibility
Current tracked smoke artifacts are not final evidence: `docs/results/improvement_report.json` currently records `improved: false`. Replace it after real SFT/GRPO training.
|