Spaces:
Running
Reward Design
All reward outputs are clamped to [0.001, 0.999] with 3-decimal precision. The reward model is intentionally decomposed so training progress and reward hacking are inspectable.
Component Rewards
The runtime keeps 13 reward columns:
format_compliance_scorecandidate_alignment_scorelegality_scoresafety_delta_scoreburden_improvement_scoredisease_stability_scoredosing_quality_scoreabstention_quality_scoreefficiency_scoreprocess_fidelity_scoreexplanation_grounding_scoreanti_cheat_scoreuncertainty_calibration_score
Primary Channels
The component columns map into 4 judge-friendly reward channels:
safety_legalityclinical_improvementdosing_qualityprocess_integrity
These channels are exposed in info.primary_reward_channels, logged during GRPO verification, and plotted in evaluation reports.
Anti-Hacking Checks
The environment explicitly penalizes repeated action loops, keep-regimen abuse, review abuse, candidate ID mismatch, illegal candidate selection, known high-risk DDI no-op behavior, parser exploit patterns, and retrying a failed no-op action.
Failure Visibility
Per-step payloads include failure_reasons, invalid_action_count, checks, timeout flags, safety report, anti-cheat reasons, transition delta, reward breakdown, and primary reward channels.