Spaces:
Running
Running
| # Reward Design | |
| All reward outputs are clamped to `[0.001, 0.999]` with 3-decimal precision. The reward model is intentionally decomposed so training progress and reward hacking are inspectable. | |
| ## Component Rewards | |
| The runtime keeps 13 reward columns: | |
| - `format_compliance_score` | |
| - `candidate_alignment_score` | |
| - `legality_score` | |
| - `safety_delta_score` | |
| - `burden_improvement_score` | |
| - `disease_stability_score` | |
| - `dosing_quality_score` | |
| - `abstention_quality_score` | |
| - `efficiency_score` | |
| - `process_fidelity_score` | |
| - `explanation_grounding_score` | |
| - `anti_cheat_score` | |
| - `uncertainty_calibration_score` | |
| ## Primary Channels | |
| The component columns map into 4 judge-friendly reward channels: | |
| - `safety_legality` | |
| - `clinical_improvement` | |
| - `dosing_quality` | |
| - `process_integrity` | |
| These channels are exposed in `info.primary_reward_channels`, logged during GRPO verification, and plotted in evaluation reports. | |
| ## Anti-Hacking Checks | |
| The environment explicitly penalizes repeated action loops, keep-regimen abuse, review abuse, candidate ID mismatch, illegal candidate selection, known high-risk DDI no-op behavior, parser exploit patterns, and retrying a failed no-op action. | |
| ## Failure Visibility | |
| Per-step payloads include `failure_reasons`, `invalid_action_count`, `checks`, timeout flags, safety report, anti-cheat reasons, transition delta, reward breakdown, and primary reward channels. | |