Spaces:
Sleeping
Architecture - Qubit-Medic
The system has three concentric layers, each behind a clean contract.
+-------------------------------------------------------------+
| LLM trainer |
| (TRL GRPOTrainer + Unsloth) |
| |
| for each step: |
| prompts = sample(prompt_pool) |
| completions = model.generate(prompts, n=4) |
| for c in completions: |
| rewards = env_client.step(c).info["rewards"] |
+----------------------------+--------------------------------+
| HTTP (or in-process)
v
+-------------------------------------------------------------+
| FastAPI server: qubit_medic.server.app |
| |
| POST /reset -> DecoderObservation |
| POST /step -> StepResult (reward + info breakdown) |
| GET /health -> liveness + curriculum stats |
| POST /decode -> baseline PyMatching prediction |
+----------------------------+--------------------------------+
|
v
+-------------------------------------------------------------+
| DecoderEnvironment (qubit_medic.server.environment) |
| |
| reset(): |
| 1. CurriculumScheduler.sample() |
| 2. cached: stim.Circuit + DEM + pymatching.Matching |
| 3. compile_detector_sampler().sample(1) -> syndrome |
| 4. build_prompt(...) -> DecoderObservation |
| |
| step(raw_response): |
| 1. parse_action() -> ParseResult (X/Z error sets) |
| 2. layout.llm_to_stim() remap to Stim qubit IDs |
| 3. compute_all_rewards(): |
| - logical_correction (Stim ground truth) |
| - syndrome_consistency (final-round detectors) |
| - hamming_overlap (vs PyMatching reference frame) |
| - format_compliance (parser output) |
| - pymatching_beat (LLM right & PM wrong) |
| 4. CurriculumScheduler.update(level, logical_correct) |
| 5. return StepResult |
+-------------------------------------------------------------+
Trust boundaries
+-----------+ prompt + syndrome +--------------+
| LLM | <-------------------------- | Observation |
+-----------+ +--------------+
|
v raw text
+-----------+ parse + remap +-----------+
| Action | --> [LLM ID space] -----> | Stim ID |
+-----------+ +-----------+
|
v scoring
+-----------+
| State |
| (server) |
+-----------+
The DecoderState (server-side) holds the ground-truth observable flip,
the true error pattern (PyMatching reference frame), and the seed used for
sampling. None of this is ever returned to the LLM. This is the
participant guide's "avoid unrestricted global state" discipline made
concrete by Pydantic schemas.
Why a terminal Pauli frame, and what it costs
The LLM emits two integer lists: which data qubits suffered an X error and
which suffered a Z error, at the moment of final measurement (a
terminal Pauli frame). For the rotated memory_z task this is sufficient
for the logical observable - the destructive Z measurement is exactly the
Z observable, and an X error on a data qubit in the observable's support
flips its measurement outcome.
The trade-off is that an end-of-circuit Pauli frame only constrains the final-round detectors (the ones that incorporate the destructive Z measurement results). Earlier-round detectors fire only in response to errors that propagate through the stabilizer rounds, and a terminal frame cannot say anything about them. Reward 2 (syndrome consistency) explicitly grades only the final-round detector bits, which matches the representation's expressive power. The remaining detector bits are implicitly available in the prompt for the LLM to reason about, but unscored.
Why five rewards instead of one
The participant guide is emphatic: "use multiple independent reward functions, not just one." Each of our five rewards is independently verifiable in well under a millisecond and disagrees with at least one other on degenerate inputs:
- All-zeros agent on a syndrome with a logical-but-undetectable error:
logical_correction = 0butsyndrome_consistency = 1. The R2 - R1 disagreement exposes the failure case. - Random-qubit agent that lands on the right observable parity by luck:
logical_correction = 1butsyndrome_consistencyandhamming_overlapare both low. R1 alone over-rewards; the others expose the lack of understanding.
This decomposition is what the guide calls "hard to game by construction."