| # Mathematics Behind PolyGuard Agents |
|
|
| This note is the expert-facing mathematical map of PolyGuard: what the |
| agents optimize, how actions are constrained, how reward is computed, and why |
| the training stack uses SFT plus environment-verified GRPO instead of an |
| unconstrained chat policy. It expands the shorter `docs/math.md`. |
|
|
| Source-of-truth implementation files: |
|
|
| - `app/env/env_core.py`: reset, observation, step, traces, OpenEnv state. |
| - `app/models/policy/candidate_builder.py`: constrained candidate set. |
| - `app/env/verifier.py`: hard legality and safety verifier. |
| - `app/env/transition.py`: state transition dynamics. |
| - `app/env/reward_router.py`: reward decomposition and aggregation. |
| - `app/env/reward_scaling.py`: strict reward normalization. |
| - `app/env/anti_cheat.py`: reward-hacking guards. |
| - `app/agents/orchestrator.py`: multi-agent policy stack. |
| - `app/models/baselines/contextual_bandit_policy.py`: LinUCB/Thompson co-policy. |
| - `app/training/sft_trl.py`: supervised warm start. |
| - `app/training/grpo_trl.py`: TRL GRPO with environment reward verification. |
|
|
| ## 1. Problem Formulation |
|
|
| PolyGuard is best read as a finite-horizon constrained POMDP: |
|
|
| ```text |
| M = (S, A, O, T, R, H, C) |
| ``` |
|
|
| where: |
|
|
| - `S` is the latent patient/regimen state. |
| - `A` is the set of medication actions expressible by `PolyGuardAction`. |
| - `O` is the observation emitted to the agent. |
| - `T(s' | s, a)` is the simulator transition. |
| - `R(s, a, s')` is the verifier-backed reward. |
| - `H` is the episode horizon, derived from sub-environment difficulty. |
| - `C(s, a)` is the hard clinical/safety constraint predicate. |
|
|
| The policy objective is: |
|
|
| ```text |
| maximize_pi E_pi [ sum_{t=0}^{H-1} R(s_t, a_t, s_{t+1}) ] |
| subject to C(s_t, a_t) = 1 whenever possible |
| ``` |
|
|
| There is no explicit discount factor in the runtime. Time preference enters |
| through the finite horizon and the efficiency reward: |
|
|
| ```text |
| efficiency_t = q(1 - step_count_t / (max_steps + 1)) |
| ``` |
|
|
| where `q` is PolyGuard's reward clamp and quantizer: |
|
|
| ```text |
| q(x) = round(clip(x, 0.001, 0.999), 3) |
| ``` |
|
|
| Why this framing: medication optimization is partially observable, long |
| horizon, and safety constrained. A free-form language model objective would |
| allow plausible but illegal actions. PolyGuard instead learns inside a small |
| legal action set with explicit reward columns, so failures remain auditable. |
|
|
| ## 2. State, Observation, And Partial Observability |
|
|
| The latent state `s_t` is represented by `PolyGuardState`: |
|
|
| ```text |
| s_t = ( |
| patient profile, |
| active decision mode, |
| step count, |
| max steps, |
| risk summary, |
| burden score, |
| precision dosing flags, |
| unresolved conflicts, |
| action history, |
| cumulative reward, |
| done flag |
| ) |
| ``` |
|
|
| At reset, the initial risk summary is: |
|
|
| ```text |
| polypharmacy_count = number_of_medications |
| burden_score = min(1, number_of_medications / 12) |
| severe_pair_count = number_of_contraindicated_pairs |
| ``` |
|
|
| The agent does not receive all latent simulator internals. The observation |
| `o_t = O(s_t)` exposes a controlled view: |
|
|
| ```text |
| o_t = ( |
| patient summary, |
| medication table, |
| comorbidities, |
| organ function and labs/vitals, |
| graph safety summary, |
| burden summary, |
| precision dosing flags, |
| unresolved conflicts, |
| candidate action set, |
| step budget, |
| action history, |
| warnings, |
| abstention indicators |
| ) |
| ``` |
|
|
| Uncertainty is a simple observable proxy: |
|
|
| ```text |
| missing = I[egfr missing] + I[ast missing] + I[alt missing] |
| base_uncertainty = missing / 3 |
| conflict_penalty = min(0.3, 0.1 * number_of_unresolved_conflicts) |
| u_t = clip(base_uncertainty + conflict_penalty, 0, 1) |
| ``` |
|
|
| The environment recommends abstention/review when: |
|
|
| ```text |
| u_t > 0.65 |
| ``` |
|
|
| The supervisor uses a stricter routing threshold: |
|
|
| ```text |
| mode_t = REVIEW if u_t > 0.72 |
| mode_t = DOSE_OPT if sub_environment = PRECISION_DOSING or dosing is active |
| mode_t = REGIMEN_OPT otherwise |
| ``` |
|
|
| Why this choice: the observation keeps the agent honest. Missing labs and |
| conflicts are not hidden from reward, but they are presented as uncertainty |
| signals that should change policy behavior rather than invite overconfident |
| recommendations. |
|
|
| ## 3. Constrained Action Model |
|
|
| The runtime action is a strict `PolyGuardAction`: |
|
|
| ```text |
| a_t = ( |
| mode, |
| action_type, |
| target_drug, |
| replacement_drug, |
| dose_bucket, |
| taper_days, |
| monitoring_plan, |
| evidence_query, |
| new_drug_name, |
| candidate_components, |
| candidate_id, |
| confidence, |
| rationale_brief |
| ) |
| ``` |
|
|
| The environment first builds a candidate set: |
|
|
| ```text |
| C_t = B(s_t) |
| ``` |
|
|
| where `B` is `build_candidates`. Candidate generation is rule-seeded and |
| bounded: |
|
|
| ```text |
| 3 <= |C_t| <= 10 |
| ``` |
|
|
| Each candidate carries proxy features: |
|
|
| ```text |
| c = ( |
| candidate_id, |
| mode, |
| action_type, |
| estimated_safety_delta, |
| burden_delta, |
| disease_stability_estimate, |
| uncertainty_score, |
| legality_precheck, |
| rationale_tags |
| ) |
| ``` |
|
|
| The legal candidate set is: |
|
|
| ```text |
| L_t = { c in C_t : verifier(s_t, c).legal = true } |
| ``` |
|
|
| Policy selection is candidate selection, not arbitrary action synthesis: |
|
|
| ```text |
| a_t = to_action(c_t), c_t in C_t |
| ``` |
|
|
| The action type space is intentionally small: |
|
|
| ```text |
| KEEP_REGIMEN |
| STOP_DRUG |
| SUBSTITUTE_WITHIN_CLASS |
| RECOMMEND_ALTERNATIVE |
| REDUCE_DOSE_BUCKET |
| INCREASE_DOSE_BUCKET |
| TAPER_INITIATE |
| TAPER_CONTINUE |
| DOSE_HOLD |
| ORDER_MONITORING_AND_WAIT |
| FETCH_EXTERNAL_EVIDENCE |
| DECOMPOSE_NEW_DRUG |
| REQUEST_SPECIALIST_REVIEW |
| REQUEST_PHARMACIST_REVIEW |
| ``` |
|
|
| Why this choice: most safety failures in clinical LLM tasks come from an |
| unbounded output space. PolyGuard makes the LLM solve ranking and explanation |
| inside a constrained action manifold, then lets the verifier and transition |
| system enforce semantics. |
|
|
| ## 4. Hard Legality Constraints |
|
|
| The verifier computes: |
|
|
| ```text |
| V(s_t, a_t) = (legal, violations, severity, fallback) |
| ``` |
|
|
| Examples of hard constraints: |
|
|
| - The target drug must exist in the current regimen when required. |
| - Substitutions and alternatives must be drawn from allowed substitution rules. |
| - Evidence-fetch URLs must be allowlisted. |
| - New-drug decomposition must include a new drug and components. |
| - Abrupt stopping is illegal when taper rules require tapering. |
| - Renal/hepatic unsafe dose escalation is illegal. |
| - Duplicate therapy and contraindicated substitutions are illegal. |
| - Monitoring/hold actions require a monitoring plan. |
| - Destabilizing deprescribing patterns are illegal. |
|
|
| The environment step uses a two-gate transition: |
|
|
| ```text |
| if V(s_t, a_t).legal and not anti_cheat(s_t, a_t): |
| s_{t+1} = T(s_t, a_t) |
| else: |
| s_{t+1} = rollback_state_with_failed_action_record(s_t, a_t) |
| ``` |
|
|
| Even blocked actions advance the step count and become visible in |
| `action_history`, `failure_reasons`, `invalid_action_count`, and trace logs. |
|
|
| Why this choice: legality is a constraint, not a soft preference. The reward |
| still exposes illegal behavior numerically, but illegal behavior is prevented |
| from mutating patient state. |
|
|
| ## 5. Transition Dynamics |
|
|
| The transition function mutates the regimen and derived risk state. Important |
| deterministic transitions include: |
|
|
| ```text |
| STOP_DRUG: |
| medications' = medications without target_drug |
| |
| SUBSTITUTE_WITHIN_CLASS or RECOMMEND_ALTERNATIVE: |
| target_drug' = replacement_drug |
| |
| REDUCE_DOSE_BUCKET / INCREASE_DOSE_BUCKET: |
| dose_bucket moves one level over [LOW, MEDIUM, HIGH] |
| |
| DOSE_HOLD: |
| dose_bucket' = HOLD |
| |
| ORDER_MONITORING_AND_WAIT: |
| optional hold + unresolved review conflicts cleared |
| |
| REQUEST_*_REVIEW: |
| active_mode' = REVIEW |
| unresolved_conflicts append review marker |
| |
| FETCH_EXTERNAL_EVIDENCE: |
| external mention/component counts update |
| missing-data conflicts can be cleared |
| |
| DECOMPOSE_NEW_DRUG: |
| component count and unknown-risk flags update |
| ``` |
|
|
| After any applied transition, burden is recomputed with dose weights: |
|
|
| ```text |
| w(LOW) = 0.70 |
| w(MEDIUM) = 1.00 |
| w(HIGH) = 1.25 |
| w(HOLD) = 0.45 |
| w(NA) = 1.00 |
| |
| burden_{t+1} = clip( sum_{m in medications_{t+1}} w(dose_bucket_m) / 12, 0, 1 ) |
| ``` |
|
|
| The severe-pair count is recomputed from known contraindicated pairs: |
|
|
| ```text |
| severe_pair_count_{t+1} = |
| |{(i, j): i < j and contraindicated(drug_i, drug_j)}| |
| ``` |
|
|
| Why this choice: transitions are intentionally deterministic and inspectable. |
| That makes reward debugging and training reproducibility easier than a hidden |
| black-box clinical simulator. |
|
|
| ## 6. Multi-Agent Factorization |
|
|
| PolyGuard's "agents" are a policy factorization, not independent RL learners |
| with separate private rewards. Each module emits features, candidates, gates, |
| or explanations consumed by the next stage: |
|
|
| ```text |
| MedRec -> Evidence -> GraphSafety -> Dosing -> Candidate |
| -> Supervisor -> Planner -> Critic -> Env -> Explainer |
| ``` |
|
|
| The orchestrated policy can be written: |
|
|
| ```text |
| pi(a | o) = |
| pi_critic( |
| pi_planner( |
| top_k_bandit( |
| pi_supervisor( |
| features_medrec,evidence,graph,dosing,candidates |
| ) |
| ) |
| ) |
| ) |
| ``` |
|
|
| More concretely: |
|
|
| ```text |
| z_medrec = f_medrec(s_t) |
| z_evid = f_evidence(s_t) |
| z_graph = f_graph(s_t) |
| z_dose = f_dosing(s_t) |
| C_t = f_candidate(s_t) |
| m_t = f_supervisor(s_t, z_dose) |
| K_t = f_bandit(C_t, m_t) |
| a_hat_t = f_planner(K_t, m_t, provider_prompt) |
| a_t = f_critic(s_t, a_hat_t) |
| ``` |
|
|
| Coordination modes change the graph behavior: |
|
|
| - `sequential_pipeline`: one pass through the stack. |
| - `supervisor_routed`: filters candidates by macro mode. |
| - `replan_on_veto`: replans into review mode when the critic rejects. |
| - `lightweight_debate`: allows a small debate/replan signal around vetoes. |
|
|
| Why this choice: the decomposition creates audit points. Experts can inspect |
| whether a failure came from candidate construction, uncertainty routing, |
| planner choice, critic behavior, transition logic, or reward shaping. |
|
|
| ## 7. Graph Safety Mathematics |
|
|
| The graph safety module summarizes regimen risk. In the no-artifact fallback, |
| the encoder maps a regimen to a 24-dimensional vector: |
|
|
| ```text |
| g = encode_regimen(drugs) in R^24 |
| ``` |
|
|
| The vector includes hashed drug identity features, drug-class counts, |
| side-effect tag load, medication count, contraindicated-pair count, and flags |
| for sedative, anticoagulant, and glucose-lowering classes. |
|
|
| Pairwise DDI severity is: |
|
|
| ```text |
| score_pair(a, b) = |
| 0.95 if contraindicated(a, b) |
| 0.15 otherwise |
| ``` |
|
|
| Fallback severe-alert probability is: |
|
|
| ```text |
| p_severe = min(0.99, 0.10 + 0.30 * number_of_risky_pairs) |
| ``` |
|
|
| Side-effect probabilities normalize ontology tag counts: |
|
|
| ```text |
| p(tag) = count(tag across regimen) / sum_tag count(tag) |
| ``` |
|
|
| If a trained graph artifact exists, learned heads may override the fallback |
| severe-alert and side-effect estimates. |
|
|
| Why this choice: the graph model supplies dense safety features while the |
| verifier still enforces hard contraindication rules. Learned risk can help |
| ranking, but it is not trusted as the only safety barrier. |
|
|
| ## 8. Dosing Mathematics |
|
|
| Dose-sensitive drugs are currently selected from sensitive classes: |
|
|
| ```text |
| {anticoagulant, sedative, glucose_lowering} |
| ``` |
|
|
| Dose features include interaction load and organ stress: |
|
|
| ```text |
| interaction_load = min(1, number_of_medications / 12) |
| |
| organ_stress = min( |
| 1, |
| max(0, (35 - egfr) / 35) |
| + max(0, (ast - 80) / 80) |
| + max(0, (alt - 80) / 80) |
| ) |
| ``` |
|
|
| The surrogate PK/PD state is: |
|
|
| ```text |
| x = ( |
| effect_level, |
| toxicity_level, |
| underdose_risk, |
| organ_stress, |
| interaction_load |
| ) |
| ``` |
|
|
| Initial proxies: |
|
|
| ```text |
| effect_0 = min(1, 0.35 + 0.45 * adherence) |
| toxicity_0 = min(1, 0.08 + 0.40 * organ_stress) |
| underdose_0 = max(0, 1 - effect_0) |
| ``` |
|
|
| For a dose change `d`: |
|
|
| ```text |
| effective_delta = d * (1 - min(0.6, 0.4 * organ_stress)) |
| |
| effect' = |
| clip(effect + 0.28 * effective_delta - 0.05 * interaction_load, 0, 1) |
| |
| toxicity_gain = |
| max(0, d) * (0.35 + 0.25 * organ_stress + 0.20 * interaction_load) |
| |
| toxicity' = |
| clip(0.85 * toxicity + toxicity_gain, 0, 1) |
| |
| underdose' = |
| clip(1 - effect' + 0.15 * max(0, -d), 0, 1) |
| ``` |
|
|
| Dosing quality proxies: |
|
|
| ```text |
| target_attainment = clip(1 - |effect_level - 0.62|, 0, 1) |
| toxicity_proxy = min(1, toxicity + 0.20 * organ_stress + 0.12 * interaction_load) |
| underdose_proxy = min(1, underdose_risk + max(0, 0.30 - effect_level)) |
| measurement_need = max(toxicity_proxy, underdose_proxy) |
| ``` |
|
|
| The runtime reward currently uses a coarse dose-mode reward: |
|
|
| ```text |
| dosing_quality_score = 0.75 if action.mode = DOSE_OPT else 0.50 |
| ``` |
|
|
| The detailed PK/PD analysis is still useful because it influences the agent |
| stack and evaluation, even when the scalar reward channel remains deliberately |
| simple. |
|
|
| Why this choice: dose optimization needs its own state features, but dense |
| dosing reward must not overpower legality and safety in early RL training. |
|
|
| ## 9. Contextual Bandit Co-Policy |
|
|
| The bandit proposes a top-k shortlist before the planner finalizes an action. |
| Each candidate becomes an 8-dimensional feature vector: |
|
|
| ```text |
| x(c) = [ |
| 1, |
| I[legality_precheck], |
| estimated_safety_delta, |
| burden_delta, |
| disease_stability_estimate, |
| 1 - uncertainty_score, |
| I[mode = DOSE_OPT], |
| I[mode = REVIEW] |
| ] |
| ``` |
|
|
| An arm is keyed by macro mode and action type: |
|
|
| ```text |
| arm(c) = mode(c) || ":" || action_type(c) |
| ``` |
|
|
| ### LinUCB |
|
|
| For each arm `a`, PolyGuard maintains: |
|
|
| ```text |
| A_a = I + sum x x^T |
| b_a = sum r x |
| theta_a = A_a^{-1} b_a |
| ``` |
|
|
| The score is: |
|
|
| ```text |
| score_a(x) = |
| theta_a^T x + alpha * sqrt(x^T A_a^{-1} x) |
| ``` |
|
|
| where the default `alpha` is read from `POLYGUARD_BANDIT_ALPHA`, defaulting to |
| `0.55`. |
|
|
| ### Thompson Sampling Variant |
|
|
| The alternate score is: |
|
|
| ```text |
| score_a(x) = theta_a^T x + Normal(0, alpha) |
| ``` |
|
|
| The absolute sampled noise is logged as the exploration bonus. |
|
|
| ### Explicit Exploration |
|
|
| With probability `epsilon`, default `0.1`, the policy swaps the top candidate |
| with another candidate in the sorted list: |
|
|
| ```text |
| if Uniform(0, 1) < epsilon: |
| swap(scored[0], scored[random_non_top_index]) |
| ``` |
|
|
| After the environment step: |
|
|
| ```text |
| A_a <- A_a + x x^T |
| b_a <- b_a + r x |
| ``` |
|
|
| Why this choice: the bandit gives a sample-efficient, inspectable exploration |
| layer. It can improve candidate ordering without allowing the LLM to leave the |
| safe candidate space. |
|
|
| ## 10. Planner Policy |
|
|
| The planner receives candidates, a supervisor mode, and optional provider |
| context. It filters candidates by mode when possible: |
|
|
| ```text |
| C_t^m = { c in C_t : mode(c) = m_t } |
| ``` |
|
|
| Then the provider selects a candidate id: |
|
|
| ```text |
| y_t ~ pi_theta(. | prompt(C_t^m, o_t)) |
| candidate_id = parse(y_t) |
| a_hat_t = to_action(candidate_id) |
| ``` |
|
|
| If an active Transformers/adapter artifact is available, the model generates a |
| completion and the runtime extracts a provided `cand_NN`. If no active artifact |
| is available or loading fails, the deterministic safety ranker chooses: |
|
|
| ```text |
| argmax_c (legality_precheck(c), estimated_safety_delta(c), -uncertainty_score(c)) |
| ``` |
|
|
| The planner confidence is: |
|
|
| ```text |
| confidence = max(0.45, 1 - uncertainty_score(candidate)) |
| ``` |
|
|
| Why this choice: the learned policy is used where language models are useful: |
| contextual judgment over a compact set plus rationale generation. Ranking |
| fallbacks keep the product path deterministic and testable when model artifacts |
| are unavailable. |
|
|
| ## 11. Critic And Safety Veto |
|
|
| The critic re-runs the verifier: |
|
|
| ```text |
| report = V(s_t, a_hat_t) |
| ``` |
|
|
| If the report is legal: |
|
|
| ```text |
| a_t = a_hat_t |
| ``` |
|
|
| Otherwise, the critic returns a review-style fallback action. The environment |
| still subjects that final action to the same legality and anti-cheat gates, so |
| critic output is not privileged over the environment. |
|
|
| Why this choice: the planner is allowed to be probabilistic, but state mutation |
| is not. The critic provides an additional audit point before the environment |
| transition. |
|
|
| ## 12. Anti-Cheat And Reward-Hacking Guards |
|
|
| The anti-cheat detector computes an exploit predicate: |
|
|
| ```text |
| E(s_t, a_t) in {0, 1} |
| ``` |
|
|
| It fires on: |
|
|
| - repeated candidate loops over the last `MAX_REPEATED_ACTIONS = 3` actions; |
| - excessive keep-regimen behavior after at least 3 actions; |
| - excessive review behavior after at least 3 actions; |
| - malformed candidate ids; |
| - candidate ids outside the legal candidate set; |
| - repeated no-op retries after failed actions; |
| - parser exploit patterns in rationale text; |
| - repeated no-op behavior on a hidden high-risk DDI holdout pair. |
|
|
| The configured ratio thresholds are: |
|
|
| ```text |
| MAX_KEEP_REGIMEN_RATIO = 0.6 |
| MAX_REVIEW_RATIO = 0.5 |
| ``` |
|
|
| Reward impact: |
|
|
| ```text |
| anti_cheat_score = 0.001 if E(s_t, a_t) else 0.999 |
| ``` |
|
|
| Termination impact: |
|
|
| ```text |
| done = true, reason = "exploit_detection" if E(s_t, a_t) |
| ``` |
|
|
| Why this choice: RL policies exploit reward functions. PolyGuard makes common |
| shortcuts explicit, penalized, and visible in traces instead of treating them |
| as silent bad luck. |
|
|
| ## 13. Reward Components |
|
|
| PolyGuard computes 13 reward columns. Every component is clamped by `q`. |
|
|
| Let: |
|
|
| ```text |
| u_t = overall uncertainty |
| legal = V(s_t, a_t).legal |
| exploit = E(s_t, a_t) |
| pre_burden, post_burden = burden before/after step |
| pre_pairs, post_pairs = severe-pair count before/after step |
| ``` |
|
|
| Risk-like deltas become rewards through: |
|
|
| ```text |
| delta_reward(pre, post) = q(0.5 + 0.6 * (pre - post)) |
| ``` |
|
|
| So: |
|
|
| ```text |
| burden_reward = delta_reward(pre_burden, post_burden) |
| pair_reward = delta_reward(pre_pairs, post_pairs) |
| |
| safety_delta_score = |
| q(0.65 * pair_reward + 0.35 * burden_reward) if legal |
| 0.001 otherwise |
| ``` |
|
|
| The current component formulas are: |
|
|
| | Component | Formula | |
| | --- | --- | |
| | `format_compliance_score` | `0.999` after schema validation | |
| | `candidate_alignment_score` | `0.999` if `candidate_id` starts with `cand_`, else `0.001` | |
| | `legality_score` | `0.999` if legal, else `0.001` | |
| | `safety_delta_score` | weighted pair/burden improvement if legal, else `0.001` | |
| | `burden_improvement_score` | `burden_reward` if legal, else `0.001` | |
| | `disease_stability_score` | `0.90` except `STOP_DRUG` or `INCREASE_DOSE_BUCKET`, which use `0.58` | |
| | `dosing_quality_score` | `0.75` if action mode is `DOSE_OPT`, else `0.50` | |
| | `abstention_quality_score` | `0.82` for review action with `u_t > 0.6`, else `0.56` | |
| | `efficiency_score` | `q(1 - step_count / (max_steps + 1))` | |
| | `process_fidelity_score` | `0.92` if legal, else `0.08` | |
| | `explanation_grounding_score` | `0.80` if rationale exists, else `0.20` | |
| | `anti_cheat_score` | `0.001` if exploit detected, else `0.999` | |
| | `uncertainty_calibration_score` | `q(1 - |confidence - (1 - u_t)|)` | |
|
|
| Sub-environment modifiers: |
|
|
| ```text |
| WEB_SEARCH_MISSING_DATA: |
| FETCH_EXTERNAL_EVIDENCE: |
| process_fidelity_score >= 0.90 |
| explanation_grounding_score >= 0.85 |
| otherwise: |
| process_fidelity_score *= 0.75 |
| |
| ALTERNATIVE_SUGGESTION: |
| RECOMMEND_ALTERNATIVE or SUBSTITUTE_WITHIN_CLASS: |
| safety_delta_score >= 0.88 |
| burden_improvement_score >= 0.76 |
| otherwise: |
| safety_delta_score *= 0.82 |
| |
| NEW_DRUG_DECOMPOSITION: |
| DECOMPOSE_NEW_DRUG with components: |
| explanation_grounding_score >= 0.90 |
| process_fidelity_score >= 0.88 |
| uncertainty_calibration_score >= 0.82 |
| otherwise: |
| explanation_grounding_score *= 0.70 |
| ``` |
|
|
| Why this choice: dense reward reduces sparse-credit problems, but the columns |
| are semantically separated so experts can detect when total reward improves |
| for the wrong reason. |
|
|
| ## 14. Primary Reward Channels |
|
|
| The 13 columns roll up into four primary channels: |
|
|
| ```text |
| safety_legality = |
| avg( |
| legality_score, |
| candidate_alignment_score, |
| anti_cheat_score, |
| uncertainty_calibration_score |
| ) |
| |
| clinical_improvement = |
| avg( |
| safety_delta_score, |
| burden_improvement_score, |
| disease_stability_score |
| ) |
| |
| dosing_quality = |
| avg( |
| dosing_quality_score, |
| abstention_quality_score |
| ) |
| |
| process_integrity = |
| avg( |
| format_compliance_score, |
| efficiency_score, |
| process_fidelity_score, |
| explanation_grounding_score |
| ) |
| ``` |
|
|
| Each average is clamped through `q`. These channels are emitted in |
| `info.primary_reward_channels`, GRPO logs, reports, plots, and ablation |
| summaries. |
|
|
| Why this choice: primary channels make the reward legible to judges and domain |
| experts without hiding the lower-level reward columns needed for debugging. |
|
|
| ## 15. Total Reward |
|
|
| The scalar environment reward is a weighted average: |
|
|
| ```text |
| R_env(s_t, a_t, s_{t+1}) = |
| q( sum_i w_i c_i / sum_i w_i ) |
| ``` |
|
|
| Current weights sum to 1: |
|
|
| | Component | Weight | |
| | --- | ---: | |
| | `format_compliance_score` | `0.08` | |
| | `candidate_alignment_score` | `0.08` | |
| | `legality_score` | `0.12` | |
| | `safety_delta_score` | `0.15` | |
| | `burden_improvement_score` | `0.08` | |
| | `disease_stability_score` | `0.10` | |
| | `dosing_quality_score` | `0.08` | |
| | `abstention_quality_score` | `0.06` | |
| | `efficiency_score` | `0.06` | |
| | `process_fidelity_score` | `0.06` | |
| | `explanation_grounding_score` | `0.03` | |
| | `anti_cheat_score` | `0.06` | |
| | `uncertainty_calibration_score` | `0.04` | |
|
|
| Safety-related terms have the largest combined mass: |
|
|
| ```text |
| legality + safety_delta + burden + disease_stability + anti_cheat |
| = 0.12 + 0.15 + 0.08 + 0.10 + 0.06 |
| = 0.51 |
| ``` |
|
|
| That does not include candidate alignment or calibration, which also affect |
| safety behavior. |
|
|
| Why this choice: the scalar reward is needed by RL algorithms, but the weights |
| make safety and clinical improvement dominate style, speed, and explanation. |
|
|
| ## 16. Episode Termination |
|
|
| Termination is deterministic: |
|
|
| ```text |
| done = true if: |
| exploit_detected |
| or step_count >= max_steps |
| or at least 3 recent invalid actions |
| or severe_pair_count >= 2 after enough steps |
| or burden_score > 0.92 after step 2 |
| or burden_score < 0.25 and no unresolved conflicts |
| or wall-clock/step timeout |
| ``` |
|
|
| The main success-like terminal condition is: |
|
|
| ```text |
| safe_resolution: |
| burden_score < 0.25 and unresolved_conflicts = empty |
| ``` |
|
|
| Why this choice: the environment needs both positive endings and explicit |
| failure endings. Otherwise an RL policy could learn to loop, delay, or avoid |
| difficult decisions. |
|
|
| ## 17. SFT Warm Start |
|
|
| SFT trains the model to emit the target candidate id for curated examples. A |
| record is serialized as: |
|
|
| ```text |
| { |
| instruction: "Select the safest legal medication action candidate_id.", |
| medications: ..., |
| candidates: ..., |
| answer: target_candidate_id |
| } |
| ``` |
|
|
| The mathematical objective is standard token-level negative log likelihood: |
|
|
| ```text |
| L_SFT(theta) = |
| - sum_{(x, y*) in D} log pi_theta(y* | x) |
| ``` |
|
|
| where `y*` includes the target candidate id. |
|
|
| Why this choice: SFT gives the policy the output format and obvious clinical |
| priors before RL. Without SFT, GRPO would spend too much budget learning to |
| name a valid candidate id. |
|
|
| ## 18. GRPO With Environment-Backed Reward |
|
|
| GRPO prompts are built from patient/candidate records. For each prompt, the |
| model emits one or more completions containing a candidate id: |
|
|
| ```text |
| y_i ~ pi_theta(. | x), i = 1..G |
| ``` |
|
|
| The environment verifier parses each completion, resets a deterministic |
| PolyGuard environment using the recorded seed/difficulty/sub-environment, maps |
| the candidate id to an action, takes one environment step, and returns a reward. |
|
|
| The training reward used by the GRPO reward function is: |
|
|
| ```text |
| legal_bonus = 0.95 if action is legal else 0.05 |
| |
| R_GRPO = |
| q(0.80 * R_env + 0.20 * legal_bonus) |
| ``` |
|
|
| The reward function logs: |
|
|
| ```text |
| generated_candidate_id |
| selected_candidate_id |
| legal |
| reward |
| reward_breakdown |
| primary_reward_channels |
| termination_reason |
| ``` |
|
|
| Conceptually, group-relative policy optimization forms a within-prompt |
| advantage: |
|
|
| ```text |
| A_i = (R_i - mean_j R_j) / (std_j R_j + epsilon) |
| ``` |
|
|
| and updates the policy with a clipped policy-ratio objective: |
|
|
| ```text |
| rho_i(theta) = pi_theta(y_i | x) / pi_old(y_i | x) |
| |
| J_GRPO(theta) = |
| E[ (1/G) * sum_i min( |
| rho_i(theta) * A_i, |
| clip(rho_i(theta), 1 - eps, 1 + eps) * A_i |
| ) |
| - beta * KL(pi_theta || pi_ref) |
| ] |
| ``` |
|
|
| The exact optimizer mechanics are owned by TRL's `GRPOTrainer`; PolyGuard's |
| critical contribution is the reward function that executes verifier-backed |
| environment transitions instead of scoring completions with a text-only judge. |
|
|
| Why this choice: GRPO avoids training a separate value model, works naturally |
| with multiple completions per prompt, and lets the environment supply rewards |
| that are grounded in legality, transition effects, and anti-cheat checks. |
|
|
| ## 19. Evaluation Metrics |
|
|
| Rollout metrics are sample means over environment steps or episodes: |
|
|
| ```text |
| avg_reward = mean_t R_t |
| legality_rate = mean_t I[action_t legal] |
| success_rate = mean_episode I[termination_reason = safe_resolution] |
| abstention_rate = mean_t I[action_type starts with REQUEST_] |
| timeout_rate = timeout_count / number_of_rewards |
| ``` |
|
|
| Reward components and primary channels are averaged column-wise: |
|
|
| ```text |
| avg_component_k = mean_t c_{t,k} |
| avg_channel_j = mean_t channel_{t,j} |
| ``` |
|
|
| Policy-stack ablations compare: |
|
|
| ```text |
| bandit-only |
| llm-only |
| llm+bandit |
| ``` |
|
|
| Baselines include: |
|
|
| ```text |
| no-change: |
| always KEEP_REGIMEN |
| |
| rules-only: |
| argmax_c (legality_precheck, estimated_safety_delta) |
| |
| greedy: |
| argmax_c (estimated_safety_delta, burden_delta) |
| ``` |
|
|
| Why this choice: average reward alone is not trustworthy. PolyGuard also |
| reports legality, success, process fidelity, anti-cheat counts, invalid |
| actions, timeouts, and failure visibility. |
|
|
| ## 20. What Experts Should Watch |
|
|
| High-quality behavior should show: |
|
|
| - High legality without collapsing into review-only actions. |
| - Lower severe-pair and burden metrics over transitions. |
| - Good uncertainty calibration: confidence near `1 - uncertainty`. |
| - High process fidelity in special sub-environments. |
| - Low exploit detection and low invalid-action counts. |
| - GRPO reward improvements that are visible in primary channels, not just in |
| one easy component. |
|
|
| Potential failure signatures: |
|
|
| - Reward rises while `safety_legality` falls. |
| - `abstention_quality_score` rises with review abuse. |
| - Candidate alignment is high but `candidate_not_in_legal_set` appears in |
| anti-cheat logs. |
| - Dosing mode is selected often without better target/toxicity metrics. |
| - The policy exploits deterministic first-candidate fallbacks instead of |
| actually emitting candidate ids. |
|
|
| The intended expert reading is therefore not "the scalar reward went up". |
| The intended reading is: |
|
|
| ```text |
| policy improved iff |
| scalar reward improves |
| and safety_legality does not regress |
| and clinical_improvement improves or stays justified |
| and process_integrity remains high |
| and anti-cheat/failure logs remain acceptable |
| ``` |
|
|
| ## 21. Design Summary |
|
|
| PolyGuard chooses: |
|
|
| - A constrained POMDP/CMDP framing because free-form medication actions are |
| unsafe and hard to evaluate. |
| - A hierarchical multi-agent policy because clinical medication decisions have |
| separable routing, candidate generation, critique, and explanation stages. |
| - A contextual bandit shortlist because it is transparent, online-updateable, |
| and sample efficient. |
| - SFT first because candidate-id format and clinical priors should not be |
| discovered from sparse RL reward. |
| - GRPO next because group-relative rewards fit verifier-backed completion |
| scoring without a separate critic/value model. |
| - Decomposed reward because safety-critical RL must be debuggable by reward |
| channel, not only by total return. |
| - Hard verifier gates because some actions should be impossible to apply even |
| when a learned policy assigns them high probability. |
|
|
| This is a research environment and simulator. The mathematics describes how |
| PolyGuard trains and evaluates agents inside this controlled OpenEnv setting; |
| it is not a clinical decision rule for patient care. |
|
|