QuantumScribe / docs /architecture.md
ronitraj's picture
Upload docs/architecture.md with huggingface_hub
7e06782 verified
# Architecture - Qubit-Medic
The system has three concentric layers, each behind a clean contract.
```
+-------------------------------------------------------------+
| LLM trainer |
| (TRL GRPOTrainer + Unsloth) |
| |
| for each step: |
| prompts = sample(prompt_pool) |
| completions = model.generate(prompts, n=4) |
| for c in completions: |
| rewards = env_client.step(c).info["rewards"] |
+----------------------------+--------------------------------+
| HTTP (or in-process)
v
+-------------------------------------------------------------+
| FastAPI server: qubit_medic.server.app |
| |
| POST /reset -> DecoderObservation |
| POST /step -> StepResult (reward + info breakdown) |
| GET /health -> liveness + curriculum stats |
| POST /decode -> baseline PyMatching prediction |
+----------------------------+--------------------------------+
|
v
+-------------------------------------------------------------+
| DecoderEnvironment (qubit_medic.server.environment) |
| |
| reset(): |
| 1. CurriculumScheduler.sample() |
| 2. cached: stim.Circuit + DEM + pymatching.Matching |
| 3. compile_detector_sampler().sample(1) -> syndrome |
| 4. build_prompt(...) -> DecoderObservation |
| |
| step(raw_response): |
| 1. parse_action() -> ParseResult (X/Z error sets) |
| 2. layout.llm_to_stim() remap to Stim qubit IDs |
| 3. compute_all_rewards(): |
| - logical_correction (Stim ground truth) |
| - syndrome_consistency (final-round detectors) |
| - hamming_overlap (vs PyMatching reference frame) |
| - format_compliance (parser output) |
| - pymatching_beat (LLM right & PM wrong) |
| 4. CurriculumScheduler.update(level, logical_correct) |
| 5. return StepResult |
+-------------------------------------------------------------+
```
## Trust boundaries
```
+-----------+ prompt + syndrome +--------------+
| LLM | <-------------------------- | Observation |
+-----------+ +--------------+
|
v raw text
+-----------+ parse + remap +-----------+
| Action | --> [LLM ID space] -----> | Stim ID |
+-----------+ +-----------+
|
v scoring
+-----------+
| State |
| (server) |
+-----------+
```
The `DecoderState` (server-side) holds the ground-truth observable flip,
the true error pattern (PyMatching reference frame), and the seed used for
sampling. **None** of this is ever returned to the LLM. This is the
participant guide's `"avoid unrestricted global state"` discipline made
concrete by Pydantic schemas.
## Why a terminal Pauli frame, and what it costs
The LLM emits two integer lists: which data qubits suffered an X error and
which suffered a Z error, **at the moment of final measurement** (a
terminal Pauli frame). For the rotated `memory_z` task this is sufficient
for the logical observable - the destructive Z measurement is exactly the
Z observable, and an X error on a data qubit in the observable's support
flips its measurement outcome.
The trade-off is that an end-of-circuit Pauli frame *only* constrains the
final-round detectors (the ones that incorporate the destructive Z
measurement results). Earlier-round detectors fire only in response to
errors that propagate through the stabilizer rounds, and a terminal frame
cannot say anything about them. Reward 2 (syndrome consistency)
explicitly grades only the final-round detector bits, which matches the
representation's expressive power. The remaining detector bits are
implicitly *available* in the prompt for the LLM to reason about, but
unscored.
## Why five rewards instead of one
The participant guide is emphatic: *"use multiple independent reward
functions, not just one."* Each of our five rewards is independently
verifiable in well under a millisecond and disagrees with at least one
other on degenerate inputs:
* All-zeros agent on a syndrome with a logical-but-undetectable error:
`logical_correction = 0` but `syndrome_consistency = 1`. The R2 - R1
disagreement exposes the failure case.
* Random-qubit agent that lands on the right observable parity by luck:
`logical_correction = 1` but `syndrome_consistency` and
`hamming_overlap` are both low. R1 alone over-rewards; the others
expose the lack of understanding.
This decomposition is what the guide calls *"hard to game by
construction."*