adithya9903's picture
Deploy PolyGuard HF training Space
fd0c71a verified

Evaluation

Evaluation is computed from simulator rollouts, perturbation suites, baseline comparisons, and post-save inference checks.

Run

.venv/bin/python scripts/evaluate_baselines.py
.venv/bin/python scripts/evaluate_all.py
.venv/bin/python scripts/evaluate_compare_runs.py --baseline outputs/reports/baselines.json --candidate outputs/reports/benchmark_report.json --output outputs/reports/improvement_report.json

Main Artifacts

  • outputs/reports/benchmark_report.json
  • outputs/reports/baselines.json
  • outputs/reports/grpo_ablation_report.json
  • outputs/reports/improvement_report.json
  • outputs/plots/*.png
  • tracked mirrors under docs/results/

Metric Families

  • Offline policy quality: avg_reward, legal_rate, success_rate.
  • Robustness under perturbations.
  • Dosing-specific target attainment and toxicity avoidance.
  • Calibration and abstention.
  • Process fidelity and invalid-action behavior.
  • Subgroup and explainability summaries.
  • Failure visibility and anti-cheat counts.

Improvement Gate

Final comparison must show positive or non-regressing behavior on:

  • average reward
  • legality rate
  • success or justified-safe-resolution rate
  • process fidelity
  • timeout rate
  • failure visibility

Current tracked smoke artifacts are not final evidence: docs/results/improvement_report.json currently records improved: false. Replace it after real SFT/GRPO training.