# Reward Design All reward outputs are clamped to `[0.001, 0.999]` with 3-decimal precision. The reward model is intentionally decomposed so training progress and reward hacking are inspectable. ## Component Rewards The runtime keeps 13 reward columns: - `format_compliance_score` - `candidate_alignment_score` - `legality_score` - `safety_delta_score` - `burden_improvement_score` - `disease_stability_score` - `dosing_quality_score` - `abstention_quality_score` - `efficiency_score` - `process_fidelity_score` - `explanation_grounding_score` - `anti_cheat_score` - `uncertainty_calibration_score` ## Primary Channels The component columns map into 4 judge-friendly reward channels: - `safety_legality` - `clinical_improvement` - `dosing_quality` - `process_integrity` These channels are exposed in `info.primary_reward_channels`, logged during GRPO verification, and plotted in evaluation reports. ## Anti-Hacking Checks The environment explicitly penalizes repeated action loops, keep-regimen abuse, review abuse, candidate ID mismatch, illegal candidate selection, known high-risk DDI no-op behavior, parser exploit patterns, and retrying a failed no-op action. ## Failure Visibility Per-step payloads include `failure_reasons`, `invalid_action_count`, `checks`, timeout flags, safety report, anti-cheat reasons, transition delta, reward breakdown, and primary reward channels.