| # Reward Design |
|
|
| All reward outputs are clamped to `[0.001, 0.999]` with 3-decimal precision. The reward model is intentionally decomposed so training progress and reward hacking are inspectable. |
|
|
| ## Component Rewards |
|
|
| The runtime keeps 13 reward columns: |
|
|
| - `format_compliance_score` |
| - `candidate_alignment_score` |
| - `legality_score` |
| - `safety_delta_score` |
| - `burden_improvement_score` |
| - `disease_stability_score` |
| - `dosing_quality_score` |
| - `abstention_quality_score` |
| - `efficiency_score` |
| - `process_fidelity_score` |
| - `explanation_grounding_score` |
| - `anti_cheat_score` |
| - `uncertainty_calibration_score` |
|
|
| ## Primary Channels |
|
|
| The component columns map into 4 judge-friendly reward channels: |
|
|
| - `safety_legality` |
| - `clinical_improvement` |
| - `dosing_quality` |
| - `process_integrity` |
|
|
| These channels are exposed in `info.primary_reward_channels`, logged during GRPO verification, and plotted in evaluation reports. |
|
|
| ## Anti-Hacking Checks |
|
|
| The environment explicitly penalizes repeated action loops, keep-regimen abuse, review abuse, candidate ID mismatch, illegal candidate selection, known high-risk DDI no-op behavior, parser exploit patterns, and retrying a failed no-op action. |
|
|
| ## Failure Visibility |
|
|
| Per-step payloads include `failure_reasons`, `invalid_action_count`, `checks`, timeout flags, safety report, anti-cheat reasons, transition delta, reward breakdown, and primary reward channels. |
|
|