# Architecture - Qubit-Medic

The system has three concentric layers, each behind a clean contract.

```
+-------------------------------------------------------------+
|                       LLM trainer                          |
|                  (TRL GRPOTrainer + Unsloth)                |
|                                                             |
|   for each step:                                            |
|     prompts = sample(prompt_pool)                          |
|     completions = model.generate(prompts, n=4)             |
|     for c in completions:                                   |
|         rewards = env_client.step(c).info["rewards"]       |
+----------------------------+--------------------------------+
                             | HTTP (or in-process)
                             v
+-------------------------------------------------------------+
|              FastAPI server: qubit_medic.server.app         |
|                                                             |
|   POST /reset    -> DecoderObservation                      |
|   POST /step     -> StepResult (reward + info breakdown)    |
|   GET  /health   -> liveness + curriculum stats             |
|   POST /decode   -> baseline PyMatching prediction          |
+----------------------------+--------------------------------+
                             |
                             v
+-------------------------------------------------------------+
|         DecoderEnvironment (qubit_medic.server.environment) |
|                                                             |
|   reset():                                                  |
|     1. CurriculumScheduler.sample()                         |
|     2. cached: stim.Circuit + DEM + pymatching.Matching     |
|     3. compile_detector_sampler().sample(1) -> syndrome     |
|     4. build_prompt(...)  -> DecoderObservation             |
|                                                             |
|   step(raw_response):                                       |
|     1. parse_action()  ->  ParseResult (X/Z error sets)     |
|     2. layout.llm_to_stim()  remap to Stim qubit IDs        |
|     3. compute_all_rewards():                               |
|        - logical_correction (Stim ground truth)             |
|        - syndrome_consistency (final-round detectors)       |
|        - hamming_overlap (vs PyMatching reference frame)    |
|        - format_compliance (parser output)                  |
|        - pymatching_beat (LLM right & PM wrong)             |
|     4. CurriculumScheduler.update(level, logical_correct)   |
|     5. return StepResult                                    |
+-------------------------------------------------------------+
```

## Trust boundaries

```
+-----------+     prompt + syndrome      +--------------+
|    LLM    | <-------------------------- | Observation  |
+-----------+                             +--------------+
      |
      v raw text
+-----------+     parse + remap          +-----------+
|  Action   | --> [LLM ID space] -----> | Stim ID    |
+-----------+                            +-----------+
                                              |
                                              v scoring
                                        +-----------+
                                        |   State   |
                                        | (server)  |
                                        +-----------+
```

The `DecoderState` (server-side) holds the ground-truth observable flip,
the true error pattern (PyMatching reference frame), and the seed used for
sampling. **None** of this is ever returned to the LLM. This is the
participant guide's `"avoid unrestricted global state"` discipline made
concrete by Pydantic schemas.

## Why a terminal Pauli frame, and what it costs

The LLM emits two integer lists: which data qubits suffered an X error and
which suffered a Z error, **at the moment of final measurement** (a
terminal Pauli frame). For the rotated `memory_z` task this is sufficient
for the logical observable - the destructive Z measurement is exactly the
Z observable, and an X error on a data qubit in the observable's support
flips its measurement outcome.

The trade-off is that an end-of-circuit Pauli frame *only* constrains the
final-round detectors (the ones that incorporate the destructive Z
measurement results). Earlier-round detectors fire only in response to
errors that propagate through the stabilizer rounds, and a terminal frame
cannot say anything about them. Reward 2 (syndrome consistency)
explicitly grades only the final-round detector bits, which matches the
representation's expressive power. The remaining detector bits are
implicitly *available* in the prompt for the LLM to reason about, but
unscored.

## Why five rewards instead of one

The participant guide is emphatic: *"use multiple independent reward
functions, not just one."* Each of our five rewards is independently
verifiable in well under a millisecond and disagrees with at least one
other on degenerate inputs:

* All-zeros agent on a syndrome with a logical-but-undetectable error:
  `logical_correction = 0` but `syndrome_consistency = 1`. The R2 - R1
  disagreement exposes the failure case.
* Random-qubit agent that lands on the right observable parity by luck:
  `logical_correction = 1` but `syndrome_consistency` and
  `hamming_overlap` are both low. R1 alone over-rewards; the others
  expose the lack of understanding.

This decomposition is what the guide calls *"hard to game by
construction."*