Spaces:
Sleeping
Sleeping
Upload docs/REWARD_HACKING.md with huggingface_hub
Browse files- docs/REWARD_HACKING.md +120 -0
docs/REWARD_HACKING.md
ADDED
|
@@ -0,0 +1,120 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reward Hacking Defense
|
| 2 |
+
|
| 3 |
+
## Threat model
|
| 4 |
+
|
| 5 |
+
GRPO optimizes a policy directly against the scalar reward signal, so any
|
| 6 |
+
exploitable gap between "what the reward measures" and "what the task
|
| 7 |
+
actually requires" becomes a high-gradient attractor — the policy will
|
| 8 |
+
collapse into the cheapest hack the verifier cannot see. Because our stack
|
| 9 |
+
is verifier-style (Stim ground truth + PyMatching reference frame + a text
|
| 10 |
+
parser), every reward must be a *physical invariant* or a *cross-checkable
|
| 11 |
+
auxiliary*, not a regex of the model's own output.
|
| 12 |
+
|
| 13 |
+
## The 5-reward composite
|
| 14 |
+
|
| 15 |
+
All five rewards are pure functions
|
| 16 |
+
`(parsed_action, sample, layout) -> float in [0, 1]` evaluated independently
|
| 17 |
+
and combined as a weighted sum clamped to `[0, 1]`.
|
| 18 |
+
|
| 19 |
+
| Name | Weight | What it rewards | What it cannot reward |
|
| 20 |
+
|------|--------|-----------------|-----------------------|
|
| 21 |
+
| `logical_correction` | 0.40 | Predicted Pauli frame, when applied at end-of-circuit, induces the same logical-Z observable flip Stim recorded as ground truth. | Anything not derivable from Stim's observable trace. Cannot be reverse-engineered from the prompt alone. |
|
| 22 |
+
| `syndrome_consistency` | 0.20 | Hamming similarity between *predicted* final-round detector parities (induced by the predicted X errors) and the *observed* final-round detector parities. | Earlier-round detectors are intentionally unscored; partial credit on bit-flipped hallucinations. |
|
| 23 |
+
| `hamming_overlap` | 0.20 | Mean of set-aware Jaccard(X_pred, X_ref) and Jaccard(Z_pred, Z_ref) against PyMatching's reference Pauli frame. | Symmetric "predict-empty-when-empty" hacks: the set-aware rule scores 0.0 for missed errors and 0.0 for false alarms. |
|
| 24 |
+
| `format_compliance` | 0.10 | 1.0 only when the strict canonical `X_ERRORS=[...] / Z_ERRORS=[...]` form parses BOTH lists cleanly (lenient/partial parses score 0.5; nothing parseable scores 0.0). | Cannot be earned by whitespace tricks alone — the parser validates that every integer is in `[0, num_data_qubits)` and de-duplicates. |
|
| 25 |
+
| `pymatching_beat` | 0.10 | 1.0 iff PyMatching got this syndrome wrong AND the model got it right. | Imitation of PyMatching: matching its output exactly forfeits the bonus on every syndrome PyMatching also gets right (most of them). |
|
| 26 |
+
|
| 27 |
+
Weights sum to 1.00. Source of truth: `openenv.yaml` and
|
| 28 |
+
`qubit_medic.config.REWARD_WEIGHTS`. Implementations live in
|
| 29 |
+
`qubit_medic/server/rewards.py`.
|
| 30 |
+
|
| 31 |
+
## Attack/defense matrix
|
| 32 |
+
|
| 33 |
+
| Hack the model could attempt | Channel(s) that catch it |
|
| 34 |
+
|---|---|
|
| 35 |
+
| Output empty string | `format_compliance = 0` (no strict pattern, no lists) |
|
| 36 |
+
| Memorize one canonical Pauli frame across all syndromes | `hamming_overlap` drops on novel syndromes (Jaccard against per-syndrome PyMatching reference); `logical_correction` drops to chance |
|
| 37 |
+
| Match PyMatching exactly on every shot | `pymatching_beat = 0` whenever PyMatching is also correct (which is most syndromes), so the 0.10 channel never fires |
|
| 38 |
+
| Output a random valid format string | `logical_correction` collapses to ~chance; `syndrome_consistency` and `hamming_overlap` both drop |
|
| 39 |
+
| Skip syndrome reasoning, copy the in-prompt example block | The parser slices from the LAST `X_ERRORS=` key (so the prompt's example doesn't win); `syndrome_consistency` then penalises the stale answer |
|
| 40 |
+
| Game the format checker with whitespace / capitalisation tricks | `format_compliance` is parseability-based: `_parse_int_list` rejects out-of-range integers, drops dups, and `strict_format` requires the canonical `=[...]` form for the full 1.0 |
|
| 41 |
+
| Inject extra correction operators ("over-correct") | `hamming_overlap` uses set-aware Jaccard whose union grows with false alarms (precision-aware), so over-correction strictly lowers the score |
|
| 42 |
+
| Predict an empty frame when the syndrome is non-empty (FIX 1, 2026-04) | `syndrome_consistency` is **capped at 0.5** when prediction is empty AND any detector fired — the empty-everywhere collapse mode can never reach the full 1.0 |
|
| 43 |
+
| Output a logical-flipped Pauli frame that *coincidentally* satisfies final-round parities | `logical_correction = 0` because the implied observable flip differs from Stim ground truth; `hamming_overlap` also drops vs PyMatching's reference frame |
|
| 44 |
+
| Hallucinate qubit IDs outside `[0, num_data_qubits)` to spoof a long answer | `_parse_int_list` drops out-of-range tokens and flags `parse_success=False`, so `format_compliance` collapses to 0.0/0.5 |
|
| 45 |
+
| Exploit per-axis Jaccard (predict X right, Z empty when Z is empty) | The set-aware rule (`true_set` empty AND `pred_set` empty -> 1.0; either non-empty asymmetric -> 0.0) plus the 0.5 mean across axes prevents winning by guessing one axis is empty |
|
| 46 |
+
| Time-stall (delay step beyond `EPISODE_TIMEOUT_SECONDS`) to evade scoring | The env builds a zero-reward `RewardBreakdown` and marks the episode `truncated=True`, so timeouts strictly hurt |
|
| 47 |
+
|
| 48 |
+
## Hard guarantees
|
| 49 |
+
|
| 50 |
+
These are physical invariants that hold by construction; no policy can
|
| 51 |
+
satisfy them via parser games:
|
| 52 |
+
|
| 53 |
+
- **Logical-Z preservation (Stim ground truth).** `predicted_observable_flip`
|
| 54 |
+
re-applies the predicted X errors as a Pauli frame at end-of-circuit and
|
| 55 |
+
computes the implied flip on the logical Z observable. `logical_correction`
|
| 56 |
+
is 1.0 iff the implied flip equals `sample.actual_observable_flip` recorded
|
| 57 |
+
by Stim. There is no way to fake this without genuinely solving the decoder
|
| 58 |
+
task on this syndrome.
|
| 59 |
+
- **Final-round detector arithmetic.** `_syndrome_from_pauli_frame` computes
|
| 60 |
+
the implied final-round detector bits from the predicted X errors and the
|
| 61 |
+
detector-to-data-qubit incidence map (`final_detector_supports`, derived
|
| 62 |
+
from Euclidean adjacency in the rotated memory_z layout). These bits are
|
| 63 |
+
compared against `sample.syndrome_bits` directly — the model never sees
|
| 64 |
+
the comparison target.
|
| 65 |
+
- **PyMatching reference frame.** `sample.pymatching_x_errors` /
|
| 66 |
+
`pymatching_z_errors` are computed by the Sparse Blossom matching decoder
|
| 67 |
+
(PyMatching v2) for this exact syndrome; the model has no access to them
|
| 68 |
+
at action time.
|
| 69 |
+
- **Hidden ground truth.** `DecoderState` carries `true_x_errors`,
|
| 70 |
+
`true_z_errors`, `actual_observable_flip`, `pymatching_observable_pred`,
|
| 71 |
+
`circuit_text`, and `dem_text`, but the externally-visible `state()`
|
| 72 |
+
endpoint *deliberately omits all of these* (see
|
| 73 |
+
`qubit_medic/server/environment.py` `state()` method). Only the reward
|
| 74 |
+
functions see them.
|
| 75 |
+
- **LLM-space → Stim-space conversion.** Predicted qubit ids are mapped from
|
| 76 |
+
LLM-space (0..N-1, the only IDs the prompt advertises) into Stim's
|
| 77 |
+
internal coordinate system before scoring (`layout.llm_to_stim`). The
|
| 78 |
+
model can't gain anything by guessing Stim's internal numbering.
|
| 79 |
+
- **Episode pairing enforcement.** `step()` raises a clean `ValueError` for
|
| 80 |
+
unknown episode IDs (compliance audit 2026-04). A trainer cannot replay
|
| 81 |
+
step() against a stale episode to harvest a stale reward.
|
| 82 |
+
|
| 83 |
+
## Known weaknesses
|
| 84 |
+
|
| 85 |
+
Honest accounting of what this composite still does **not** catch:
|
| 86 |
+
|
| 87 |
+
- **Hamming-similarity is not a strict equality on syndrome consistency.**
|
| 88 |
+
A predicted Pauli frame whose final-round implied bits happen to overlap
|
| 89 |
+
the observed bits on most positions (without being correct) still scores
|
| 90 |
+
partial credit on `syndrome_consistency`. The 0.5 cap on
|
| 91 |
+
empty-prediction-vs-active-syndrome closes the worst case, but a *near*
|
| 92 |
+
empty answer that flips one well-chosen qubit can still earn a high
|
| 93 |
+
consistency score on prompts where most final-round detectors quiesced.
|
| 94 |
+
- **`hamming_overlap` treats the PyMatching reference as ground truth.**
|
| 95 |
+
PyMatching is itself near-optimal but not optimal; on syndromes where the
|
| 96 |
+
Stim-true correction differs from PyMatching's, a model that found the
|
| 97 |
+
*true* correction is penalised on Reward 3 even though it's right on
|
| 98 |
+
Reward 1. We accept this trade-off because Reward 5 (`pymatching_beat`)
|
| 99 |
+
is the channel that explicitly rewards out-performing PyMatching, and it
|
| 100 |
+
has its own 0.10 weight.
|
| 101 |
+
- **No per-round detector scoring.** Earlier-round detectors carry signal
|
| 102 |
+
the LLM could exploit, but we score only the final round to keep the
|
| 103 |
+
Pauli-frame action space tractable. A model could in principle "ignore
|
| 104 |
+
intermediate rounds" without penalty as long as its terminal frame is
|
| 105 |
+
correct — which is the same trade-off AlphaQubit made.
|
| 106 |
+
- **Format compliance is binary-ish (0 / 0.5 / 1).** The 2026-04 spec
|
| 107 |
+
rewrite removed full credit for non-canonical resemblances; this is
|
| 108 |
+
intentional, but it means a model that emits beautiful chain-of-thought
|
| 109 |
+
reasoning *and then forgets the final `X_ERRORS=[...]` line* gets
|
| 110 |
+
reduced credit identical to a near-miss. We trade interpretability for
|
| 111 |
+
anti-gaming.
|
| 112 |
+
- **`pymatching_beat` is sparse.** Most syndromes are easy and PyMatching
|
| 113 |
+
wins; the bonus only fires on the hard tail. This is by design (the
|
| 114 |
+
trajectory of its mean is the proof of post-imitation behaviour) but it
|
| 115 |
+
means GRPO sees this signal as ~zero-sum noise for most of training.
|
| 116 |
+
- **No protection against constant-stream prompts.** If a future trainer
|
| 117 |
+
modification kept episodes alive across `step()` calls, the active-episode
|
| 118 |
+
bookkeeping could in principle leak via observed reward statistics. The
|
| 119 |
+
current single-step-per-episode design (`done=True` after every `step`)
|
| 120 |
+
prevents this; do not relax it without a fresh hacking-surface review.
|