Spaces:

TheJackBright
/

polyguard-openenv

Running

App Files Files Community

polyguard-openenv / docs /reward_design.md

TheJackBright

Deploy PolyGuard OpenEnv Space

877add7 verified 12 days ago

preview code

raw

history blame contribute delete

1.39 kB

Reward Design

All reward outputs are clamped to [0.001, 0.999] with 3-decimal precision. The reward model is intentionally decomposed so training progress and reward hacking are inspectable.

Component Rewards

The runtime keeps 13 reward columns:

format_compliance_score
candidate_alignment_score
legality_score
safety_delta_score
burden_improvement_score
disease_stability_score
dosing_quality_score
abstention_quality_score
efficiency_score
process_fidelity_score
explanation_grounding_score
anti_cheat_score
uncertainty_calibration_score

Primary Channels

The component columns map into 4 judge-friendly reward channels:

safety_legality
clinical_improvement
dosing_quality
process_integrity

These channels are exposed in info.primary_reward_channels, logged during GRPO verification, and plotted in evaluation reports.

Anti-Hacking Checks

The environment explicitly penalizes repeated action loops, keep-regimen abuse, review abuse, candidate ID mismatch, illegal candidate selection, known high-risk DDI no-op behavior, parser exploit patterns, and retrying a failed no-op action.

Failure Visibility

Per-step payloads include failure_reasons, invalid_action_count, checks, timeout flags, safety report, anti-cheat reasons, transition delta, reward breakdown, and primary reward channels.