ronitraj commited on
Commit
7e06782
·
verified ·
1 Parent(s): 4693f9a

Upload docs/architecture.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/architecture.md +111 -0
docs/architecture.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture - Qubit-Medic
2
+
3
+ The system has three concentric layers, each behind a clean contract.
4
+
5
+ ```
6
+ +-------------------------------------------------------------+
7
+ | LLM trainer |
8
+ | (TRL GRPOTrainer + Unsloth) |
9
+ | |
10
+ | for each step: |
11
+ | prompts = sample(prompt_pool) |
12
+ | completions = model.generate(prompts, n=4) |
13
+ | for c in completions: |
14
+ | rewards = env_client.step(c).info["rewards"] |
15
+ +----------------------------+--------------------------------+
16
+ | HTTP (or in-process)
17
+ v
18
+ +-------------------------------------------------------------+
19
+ | FastAPI server: qubit_medic.server.app |
20
+ | |
21
+ | POST /reset -> DecoderObservation |
22
+ | POST /step -> StepResult (reward + info breakdown) |
23
+ | GET /health -> liveness + curriculum stats |
24
+ | POST /decode -> baseline PyMatching prediction |
25
+ +----------------------------+--------------------------------+
26
+ |
27
+ v
28
+ +-------------------------------------------------------------+
29
+ | DecoderEnvironment (qubit_medic.server.environment) |
30
+ | |
31
+ | reset(): |
32
+ | 1. CurriculumScheduler.sample() |
33
+ | 2. cached: stim.Circuit + DEM + pymatching.Matching |
34
+ | 3. compile_detector_sampler().sample(1) -> syndrome |
35
+ | 4. build_prompt(...) -> DecoderObservation |
36
+ | |
37
+ | step(raw_response): |
38
+ | 1. parse_action() -> ParseResult (X/Z error sets) |
39
+ | 2. layout.llm_to_stim() remap to Stim qubit IDs |
40
+ | 3. compute_all_rewards(): |
41
+ | - logical_correction (Stim ground truth) |
42
+ | - syndrome_consistency (final-round detectors) |
43
+ | - hamming_overlap (vs PyMatching reference frame) |
44
+ | - format_compliance (parser output) |
45
+ | - pymatching_beat (LLM right & PM wrong) |
46
+ | 4. CurriculumScheduler.update(level, logical_correct) |
47
+ | 5. return StepResult |
48
+ +-------------------------------------------------------------+
49
+ ```
50
+
51
+ ## Trust boundaries
52
+
53
+ ```
54
+ +-----------+ prompt + syndrome +--------------+
55
+ | LLM | <-------------------------- | Observation |
56
+ +-----------+ +--------------+
57
+ |
58
+ v raw text
59
+ +-----------+ parse + remap +-----------+
60
+ | Action | --> [LLM ID space] -----> | Stim ID |
61
+ +-----------+ +-----------+
62
+ |
63
+ v scoring
64
+ +-----------+
65
+ | State |
66
+ | (server) |
67
+ +-----------+
68
+ ```
69
+
70
+ The `DecoderState` (server-side) holds the ground-truth observable flip,
71
+ the true error pattern (PyMatching reference frame), and the seed used for
72
+ sampling. **None** of this is ever returned to the LLM. This is the
73
+ participant guide's `"avoid unrestricted global state"` discipline made
74
+ concrete by Pydantic schemas.
75
+
76
+ ## Why a terminal Pauli frame, and what it costs
77
+
78
+ The LLM emits two integer lists: which data qubits suffered an X error and
79
+ which suffered a Z error, **at the moment of final measurement** (a
80
+ terminal Pauli frame). For the rotated `memory_z` task this is sufficient
81
+ for the logical observable - the destructive Z measurement is exactly the
82
+ Z observable, and an X error on a data qubit in the observable's support
83
+ flips its measurement outcome.
84
+
85
+ The trade-off is that an end-of-circuit Pauli frame *only* constrains the
86
+ final-round detectors (the ones that incorporate the destructive Z
87
+ measurement results). Earlier-round detectors fire only in response to
88
+ errors that propagate through the stabilizer rounds, and a terminal frame
89
+ cannot say anything about them. Reward 2 (syndrome consistency)
90
+ explicitly grades only the final-round detector bits, which matches the
91
+ representation's expressive power. The remaining detector bits are
92
+ implicitly *available* in the prompt for the LLM to reason about, but
93
+ unscored.
94
+
95
+ ## Why five rewards instead of one
96
+
97
+ The participant guide is emphatic: *"use multiple independent reward
98
+ functions, not just one."* Each of our five rewards is independently
99
+ verifiable in well under a millisecond and disagrees with at least one
100
+ other on degenerate inputs:
101
+
102
+ * All-zeros agent on a syndrome with a logical-but-undetectable error:
103
+ `logical_correction = 0` but `syndrome_consistency = 1`. The R2 - R1
104
+ disagreement exposes the failure case.
105
+ * Random-qubit agent that lands on the right observable parity by luck:
106
+ `logical_correction = 1` but `syndrome_consistency` and
107
+ `hamming_overlap` are both low. R1 alone over-rewards; the others
108
+ expose the lack of understanding.
109
+
110
+ This decomposition is what the guide calls *"hard to game by
111
+ construction."*