Spaces:

TheJackBright
/

polyguard-openenv

Running

polyguard-openenv / docs /reward_design.md

Deploy PolyGuard OpenEnv Space

877add7 verified 13 days ago

1.39 kB

	# Reward Design

	All reward outputs are clamped to `[0.001, 0.999]` with 3-decimal precision. The reward model is intentionally decomposed so training progress and reward hacking are inspectable.

	## Component Rewards

	The runtime keeps 13 reward columns:

	- `format_compliance_score`
	- `candidate_alignment_score`
	- `legality_score`
	- `safety_delta_score`
	- `burden_improvement_score`
	- `disease_stability_score`
	- `dosing_quality_score`
	- `abstention_quality_score`
	- `efficiency_score`
	- `process_fidelity_score`
	- `explanation_grounding_score`
	- `anti_cheat_score`
	- `uncertainty_calibration_score`

	## Primary Channels

	The component columns map into 4 judge-friendly reward channels:

	- `safety_legality`
	- `clinical_improvement`
	- `dosing_quality`
	- `process_integrity`

	These channels are exposed in `info.primary_reward_channels`, logged during GRPO verification, and plotted in evaluation reports.

	## Anti-Hacking Checks

	The environment explicitly penalizes repeated action loops, keep-regimen abuse, review abuse, candidate ID mismatch, illegal candidate selection, known high-risk DDI no-op behavior, parser exploit patterns, and retrying a failed no-op action.

	## Failure Visibility

	Per-step payloads include `failure_reasons`, `invalid_action_count`, `checks`, timeout flags, safety report, anti-cheat reasons, transition delta, reward breakdown, and primary reward channels.