ronitraj commited on
Commit
e8d256a
Β·
verified Β·
1 Parent(s): f9bf581

Upload BLOG.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. BLOG.md +117 -186
BLOG.md CHANGED
@@ -1,284 +1,215 @@
1
- # Qubit-Medic β€” Teaching a 3B LLM to Decode Quantum Errors
2
 
3
- **An OpenEnv where a Qwen2.5-3B model learns to outperform a 50-year-old graph-matching algorithm at preserving logical qubits.**
4
-
5
- > Mini-blog for the OpenEnv Hackathon (India, April 2026).
6
- > All artifacts referenced here are linked at the bottom.
7
 
8
  ![Surface-code grid animation](figures/grid_animation.gif)
9
 
10
- ---
11
 
12
- ## TL;DR
13
 
14
- Qubit-Medic is an OpenEnv-compliant environment that turns **quantum error correction** β€” usually the domain of millions-of-dollars custom-architecture research like DeepMind's AlphaQubit β€” into a verifiable RL task that runs on a free Colab T4. The agent observes a surface-code syndrome generated by **Stim** (the same Clifford simulator used in *Nature* 2024 and Willow 2024) and must emit a **Pauli frame** that preserves the encoded logical Z observable. Five independent verifiable reward channels score the answer against real physics β€” no learned reward model, no human-preference labels.
15
 
16
- We trained Qwen2.5-3B-Instruct with SFT followed by GRPO. Inference happens behind an HTTP contract (`/reset`, `/step`, `/state`) so the same trainer code, baselines, and held-out evals work whether you point them at a local container or our live Hugging Face Space.
17
 
18
- - πŸ§ͺ **Live environment**: <https://huggingface.co/spaces/ronitraj/QuantumScribe>
19
- - πŸ‹οΈ **Trained adapter**: <https://huggingface.co/ronitraj/quantumscribe>
20
- - πŸ“’ **Colab notebook (actual training run)**: [`notebooks/meta_final.ipynb`](notebooks/meta_final.ipynb)
21
- - πŸ“ˆ **W&B project**: <https://wandb.ai/ronitraj/QuantumScribe-GRPO>
22
 
23
- ---
24
 
25
- ## 1. The problem judges should care about
26
 
27
- Quantum computers are noisy. You cannot observe errors directly β€” you can only measure **stabilizer parities** (the *syndrome*) and try to infer which Pauli error occurred. A **decoder** is the algorithm that turns a syndrome into a correction.
28
 
29
- The classical state of the art is **PyMatching** (Higgott & Gidney 2023), a sparse-blossom minimum-weight perfect matching solver. PyMatching is fast, well-understood, and provably optimal *on a particular graph approximation*. It is also the baseline every QEC paper has to beat.
30
 
31
- In November 2024, **DeepMind published AlphaQubit** (*Nature* 635:834): a transformer trained to outperform PyMatching on Google's Willow chip. The result was significant β€” a learned decoder beating a hand-crafted classical solver β€” but it required a custom architecture, Google's data, and an undisclosed compute budget rumored to be in the millions.
32
 
33
- > **Our question:** can a *commodity* open LLM, with *commodity* training tools, learn to decode the same surface code?
34
 
35
- We do not claim to match AlphaQubit's accuracy. We claim that **the training loop, environment, and reward design that AlphaQubit used can be reproduced on a free Colab T4 with off-the-shelf TRL + Unsloth + OpenEnv** β€” and that doing so makes QEC accessible as an RL benchmark for the broader community.
36
 
37
- This fits **Theme #3.1 (World Modeling β€” Professional / Scientific Tasks)** with a strong wild-card flavor: most hackathon environments are coding agents, Wordle clones, or grid-world games. Quantum decoders are an underexplored frontier for LLM training, and the verifier is *physics*, not a human grader.
38
 
39
- ---
40
 
41
- ## 2. Environment design (the 40% innovation category)
42
 
43
- ### What the agent sees
44
 
45
- ```
46
- prompt: "You are a surface-code decoder. The detector parities are: ..."
47
- state: level, episode_id, syndrome bits, logical_basis
48
- ```
49
 
50
- ### What the agent does
51
 
52
- The agent emits a **Pauli frame** as text β€” a comma-separated list of qubit-id : Pauli-letter pairs, e.g. `0:X, 3:Z, 7:Y`. The string is parsed into a length-N vector of Pauli operators acting on the data qubits.
53
 
54
- ### How the episode ends
55
 
56
- Episodes are **single-step**. One syndrome in, one parseable correction out, one reward vector. We chose single-step deliberately: it makes the verifier deterministic, the reward attribution unambiguous, and the training loop trivially parallelizable. Multi-step extensions (e.g., interactive measurement rounds) are future work.
57
 
58
- ### Why this is a real OpenEnv environment, not a static dataset
59
 
60
- Every `reset()` call samples a *fresh* syndrome by running a Stim circuit with a new random seed at the requested noise rate. There is no fixed corpus the agent could memorize. The training loop genuinely connects to a live simulator at every step β€” this is the bar the judges' guide called out as "the training loop should connect to your environment, not a static dataset."
61
 
62
- ### Curriculum learning
63
 
64
- Hard quantum codes never fire any reward at all on a cold-start LLM, so we ramp three difficulty levels:
65
 
66
- | Level | Distance | Rounds | Noise `p` | Promotion threshold |
67
- |---|---|---|---|---|
68
- | `L1_warmup` | 3 | 1 | 1e-4 | 0.80 |
69
- | `L2_target` | 3 | 3 | 1e-3 | 0.70 |
70
- | `L3_stretch` | 5 | 5 | 1e-3 | 0.30 |
71
 
72
- L1 is generous enough that even a randomly initialized policy gets non-zero reward (the guide's "make success possible early" rule). L2 is the headline target β€” distance-3, 3 rounds of measurement, SI1000 noise (Gidney & Fowler 2021) β€” which is what AlphaQubit reported on. L3 is aspirational stretch.
73
 
74
- ### Simulation substrate (matters for credibility)
75
 
76
- We use **Stim** ([Gidney 2021, *Quantum* 5:497](https://arxiv.org/abs/2103.02202)), the field-standard Clifford simulator for QEC. Stim is what AlphaQubit and Willow use. It is fast (millions of shots per core-second) and proven correct against published surface-code benchmarks. **We did not write our own simulator** β€” judges should not have to take our word that the physics is right.
77
 
78
- The PyMatching reference decoder is also field-standard ([Higgott & Gidney 2023](https://arxiv.org/abs/2303.15933)). Comparing against PyMatching is comparing against the same baseline every QEC paper compares against.
79
 
80
- ---
81
 
82
- ## 3. Reward design β€” five verifiable channels (the 10% pipeline category)
83
 
84
- A single reward is easy to game. The hackathon guide says so explicitly. We ship **five independent verifiable channels** that score the *same* `(prompt, completion)` pair through a shared batch cache. Weights from `openenv.yaml`:
85
 
86
- | Channel | Weight | What it scores | What it makes hard to fake |
87
- |---|---|---|---|
88
- | `logical_correction` | 0.40 | 1 if predicted Pauli frame preserves the logical Z observable | Stim ground truth β€” cannot be inferred from prompt alone |
89
- | `syndrome_consistency` | 0.20 | Hamming similarity over final-round detector parities | Predicted frame must be physically self-consistent |
90
- | `hamming_overlap` | 0.20 | Mean Jaccard similarity vs PyMatching reference frame | Penalizes wild output, rewards proximity to a strong baseline |
91
- | `format_compliance` | 0.10 | 1 / 0.5 / 0 for full / partial / unparseable | Pure output discipline |
92
- | `pymatching_beat` | 0.10 | 1 iff PyMatching wrong AND model right on this syndrome | The actual research target β€” no false credit |
93
 
94
- **Weight drift transparency.** The trainer-side `REWARD_WEIGHTS` in `qubit_medic/config.py` currently uses `0.35/0.25/0.20/0.10/0.10`. The manifest `openenv.yaml` is the canonical environment-side weighting. Both are documented in the README's reward section. We chose to disclose this honestly rather than silently align the two.
95
 
96
- ### Reward hacking β€” what we defended against
97
 
98
- A full attack/defense matrix lives in [`docs/REWARD_HACKING.md`](docs/REWARD_HACKING.md). Highlights:
99
 
100
- | Attack the model could try | What stops it |
101
- |---|---|
102
- | Output empty string | `format_compliance = 0` |
103
- | Memorize one canonical Pauli frame | `hamming_overlap` drops on new syndromes; `logical_correction` drops on different error patterns |
104
- | Output exactly what PyMatching does | `pymatching_beat = 0` (no margin gained) |
105
- | Output random valid format | `logical_correction β†’ ~0.5`, total reward stays low |
106
- | Skip syndrome reasoning | `syndrome_consistency` drops |
107
 
108
- The crucial design choice is that **no single channel is sufficient** to score well. Format-only outputs lose the substance channels. Substance-only outputs that fail to parse lose `format_compliance`. Memorized outputs lose `hamming_overlap` on novel syndromes. The composite reward only goes up when the model actually solves the decoding problem.
109
 
110
- ### Verifier-style RL, not RLHF
111
 
112
- Every reward is computed by **Stim ground truth + PyMatching reference + a text parser**. Zero learned reward models. Zero human-preference labels. This is RLVR (RL with Verifiable Rewards) in the GRPO style described by Shao et al. 2024 (DeepSeekMath). The guide explicitly recommends this for verifiable tasks: "build the verifier first, then plug that verifier into RL training."
113
 
114
- ---
115
 
116
- ## 4. Training pipeline
117
 
118
- ### Stack
119
 
120
- - **Base model:** [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) (4-bit quantized via Unsloth)
121
- - **SFT trainer:** TRL `SFTTrainer` warm-start
122
- - **RL trainer:** TRL `GRPOTrainer`
123
- - **Efficiency:** Unsloth β€” 4-bit QLoRA + Flash Attention 2; fits in 14 GB VRAM
124
- - **Environment transport:** OpenEnv HTTP contract; trainer talks to the env via the same `DecoderClient` whether local or remote
125
 
126
- ### Why SFT first, then GRPO
127
 
128
- Cold-start GRPO on Qwen-3B produces near-zero rewards for the first 100+ steps because the model does not yet emit our format. The hackathon guide flags this exact failure mode: "RL only works if the probability of getting a good answer is greater than zero."
129
 
130
- We use a small SFT warm-start on synthesized "good" traces (PyMatching outputs reformatted into our prompt schema) to teach the format and a sensible prior. GRPO then refines beyond what supervised data can teach. This matches the guide's recommendation: "Start from a capable base/instruct model, add light formatting or task scaffolding, use RL for improvement, not as magic from scratch."
131
 
132
- ### What we monitored during training
133
 
134
- - Per-channel reward means and std (not just total)
135
- - `format_compliance` separate from substance metrics
136
- - Per-step generation samples (we found mode collapse in groups by inspection, not metrics)
137
- - Generation lengths (rollout vs eval distribution mismatch)
138
- - KL divergence vs the reference policy
139
- - Inside-group reward standard deviation (low std = zero advantage = wasted update)
140
 
141
- W&B link: [ronitraj/QuantumScribe-GRPO](https://wandb.ai/ronitraj/QuantumScribe-GRPO). Specific runs: SFT [`yli513jl`](https://wandb.ai/ronitraj/QuantumScribe-GRPO/runs/yli513jl), GRPO [`4p7eurnc`](https://wandb.ai/ronitraj/QuantumScribe-GRPO/runs/4p7eurnc).
142
 
143
- ---
144
 
145
- ## 5. Results, honestly (the 20% improvement category)
146
 
147
- ### Headline numbers (from `data/eval_grpo.json`, L2_target, 100 episodes)
148
 
149
- | Metric | Value | What it means |
150
- |---|---|---|
151
- | `logical_correction_rate` | **0.964** | Model preserves the logical qubit on 96.4% of held-out syndromes |
152
- | `format_compliance_rate` | **1.000** | Every output parses |
153
- | `mean_hamming_overlap` | **0.92** | Predictions sit close to the PyMatching reference |
154
- | `mean_total_reward` | **0.85** | Composite score |
155
- | `pymatching_beat_rate` | **0.000** | We do not beat PyMatching at d=3 yet |
156
 
157
- ### Honest caveat
158
 
159
- The headline `logical_correction_rate` of 96.4% is real and meaningful β€” the LLM has learned a competent decoder. But `pymatching_beat_rate = 0.0` means we have *not* yet outperformed PyMatching on this slice. PyMatching is very strong at d=3, p=1e-3, and the regime where it leaves room (ambiguous syndromes near threshold) is exactly the regime where our 3B model's gradient signal is weakest.
160
 
161
- We chose to disclose this prominently rather than pick a different metric. Judges can verify by running the eval script themselves against the live Space.
162
 
163
- ### Baselines (against the same environment)
164
 
165
- | Policy | logical_correction | total_reward |
166
- |---|---|---|
167
- | All-zeros | 0.92 | 0.745 |
168
- | Random Pauli | 0.60 | 0.483 |
169
- | PyMatching | 0.99 | 0.874 |
170
- | **Qubit-Medic (SFT+GRPO)** | **0.964** | **0.85** |
171
 
172
- Source: `data/remote_eval/*.json`. Each baseline was run *against the live HF Space* (URLs, throughputs, and elapsed seconds embedded in each JSON). This is real network round-trip data, not synthetic.
173
 
174
- ### Plots embedded in README
175
 
176
- - `figures/total_reward.png` β€” composite reward over training steps
177
- - `figures/logical_correction.png` β€” per-channel improvement
178
- - `figures/pymatching_beat_rate.png` β€” the unflattering one we left in
179
- - `figures/eval_metrics_bars.png` β€” held-out eval vs baselines
180
- - `figures/sft_curriculum_mix.png` β€” SFT data composition
181
 
182
- All axes labeled, units shown, saved as PNG, committed to repo. See `figures/FIGURES.md` for provenance and regeneration commands.
183
 
184
- ---
185
 
186
- ## 6. Why this matters (the storytelling 30% category)
187
 
188
- **For the QEC research community:** every published QEC ML decoder so far has needed bespoke infrastructure. By packaging the simulator, the verifier, and the curriculum behind a one-line `Environment.from_hub("ronitraj/QuantumScribe")`, we make it possible for any RL researcher to attempt a decoder without learning Stim's circuit DSL.
189
 
190
- **For the LLM/RL community:** quantum error correction is a rare task with truly objective verification (logical observable preservation is *unambiguous*) that is also non-trivially hard (the search space is exponential in distance). It is a clean benchmark that resists reward hacking by construction.
191
 
192
- **For hackathon judges:** if the trend in 2026 is "agents that interact with real-world systems," QEC is among the most demanding instances of that paradigm. Stim is a real physics engine. PyMatching is a real graph algorithm. Surface codes are deployed on real hardware (Willow). The agent's behavior matters, scientifically, in a way that beating Wordle does not.
193
 
194
- ---
195
 
196
- ## 7. Engineering hygiene (table stakes)
197
 
198
- - `openenv.yaml` valid, latest OpenEnv release pinned in `requirements.txt`
199
- - Standard `reset` / `step` / `state` Gym-style API
200
- - Client / server separation: `qubit_medic/client/client.py` posts HTTP, never imports server internals at module level
201
- - Reserved tool names not used as MCP tools (only as HTTP endpoints, which is allowed)
202
- - Dockerfile builds clean from `requirements.txt` only β€” heavy ML deps (`torch`, `transformers`, `trl`, `unsloth`) live in `requirements-train.txt` and are installed only by the Colab notebook, not the Spaces image
203
- - Stim/PyMatching pre-warmed at Docker build time so the first request is fast
204
- - Non-root user in Dockerfile (HF Spaces best-practice)
205
- - All plots in `.png` form in the repo, not buried in deleted W&B runs
206
 
207
- ---
208
 
209
- ## 8. What we explicitly did *not* do
210
 
211
- - **Did not** invent a new simulator. We use Stim.
212
- - **Did not** invent a new reward. We use logical-Z observable preservation (the standard QEC figure of merit).
213
- - **Did not** train a base model. We fine-tune Qwen2.5-3B with LoRA.
214
- - **Did not** claim to match AlphaQubit. We do not. We claim *the loop is reproducible on commodity hardware*.
215
- - **Did not** hide the unflattering metric. `pymatching_beat_rate=0.0` is in the README headline.
216
 
217
- ---
218
 
219
- ## 9. Reproducibility
220
 
221
- Three ways to run, in 60 seconds each:
222
 
223
- ```bash
224
- # (1) Live HF Space β€” no install
225
- curl https://ronitraj-quantumscribe.hf.space/healthz
226
 
227
- # (2) Local Docker (env + verifier only, no LLM)
228
- docker run --rm -p 7860:7860 ghcr.io/ronitraj/quantumscribe:latest
229
 
230
- # (3) Local Python server
231
- uvicorn qubit_medic.server.app:app --port 7860
232
- # Visit http://127.0.0.1:7860/docs
233
- ```
234
 
235
- To eval the trained adapter on your own machine:
236
- ```bash
237
- pip install -r requirements-train.txt
238
- python -m scripts.eval --adapter ronitraj/quantumscribe --level L2_target --episodes 100
239
- ```
240
 
241
- To re-run training (T4 colab):
242
- - Open `notebooks/meta_final.ipynb`
243
- - Runtime β†’ GPU β†’ T4
244
- - Run all cells
245
 
246
- ---
247
 
248
- ## 10. Links (everything in one place)
249
-
250
- | Artifact | URL |
251
- |---|---|
252
- | πŸ§ͺ Live HF Space | <https://huggingface.co/spaces/ronitraj/QuantumScribe> |
253
- | πŸ‹οΈ Trained LoRA adapter | <https://huggingface.co/ronitraj/quantumscribe> |
254
- | πŸ“’ Colab training notebook (actual run) | [`notebooks/meta_final.ipynb`](notebooks/meta_final.ipynb) |
255
- | πŸ“ˆ W&B project | <https://wandb.ai/ronitraj/QuantumScribe-GRPO> |
256
- | πŸ›  OpenEnv manifest | [`openenv.yaml`](openenv.yaml) |
257
- | πŸ“ Architecture deep-dive | [`docs/architecture.md`](docs/architecture.md) |
258
- | πŸ”Œ Environment API spec | [`docs/ENVIRONMENT_API.md`](docs/ENVIRONMENT_API.md) |
259
- | πŸ›‘ Reward-hacking analysis | [`docs/REWARD_HACKING.md`](docs/REWARD_HACKING.md) |
260
- | 🎬 2-minute video walkthrough | *TODO β€” link before submission* |
261
- | πŸ“° README | [`README.md`](README.md) |
262
 
263
- ---
264
 
265
- ## 11. Citations
266
 
267
- - **Stim simulator** β€” Gidney, C. (2021). *Quantum* 5:497. [arXiv:2103.02202](https://arxiv.org/abs/2103.02202)
268
- - **AlphaQubit** β€” Bausch, J. et al. (2024). *Nature* 635:834. [DOI](https://doi.org/10.1038/s41586-024-08148-8)
269
- - **Willow chip QEC** β€” Acharya, R. et al., Google Quantum AI (2024). [arXiv:2408.13687](https://arxiv.org/abs/2408.13687)
270
- - **SI1000 noise model** β€” Gidney & Fowler (2021). [arXiv:2108.10457](https://arxiv.org/abs/2108.10457)
271
- - **PyMatching v2 (sparse blossom)** β€” Higgott & Gidney (2023). [arXiv:2303.15933](https://arxiv.org/abs/2303.15933)
272
- - **GRPO** β€” Shao, Z. et al. (2024). DeepSeekMath. [arXiv:2402.03300](https://arxiv.org/abs/2402.03300)
273
 
274
- Full BibTeX in [`README.md`](README.md#citations).
275
 
276
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
277
 
278
- ## 12. Acknowledgments
279
 
280
- DeepMind (AlphaQubit), Google Quantum AI (Stim, Willow), Craig Gidney (Stim, SI1000), Oscar Higgott (PyMatching), Hugging Face (Spaces, TRL), Unsloth (efficient fine-tuning), and the OpenEnv team for the framework that made this possible in a hackathon timebox.
281
 
282
  ---
283
 
284
- *Submission for the OpenEnv Hackathon, India 2026 β€” Theme #3.1 (World Modeling, Professional Tasks) with a side of Theme #5 (Wild Card).*
 
1
+ # Qubit-Medic: Teaching a Language Model to Read the Whispers of a Dying Qubit
2
 
3
+ How we made an RL environment that can train a 3B parameter LLM as an agent, to do quantum error correction on free Colab compute.
 
 
 
4
 
5
  ![Surface-code grid animation](figures/grid_animation.gif)
6
 
7
+ ## A field built on the most fragile thing in the universe
8
 
9
+ Before we get to what we built, let's talk about the strangest computers in the world.
10
 
11
+ A regular computer stores information in bits. A bit is either 0 or 1. Simple. Robust. You can drop a USB stick on the floor, freeze it, throw it in a microwave for half a second, and the bits inside are usually fine. Bits are stable because they're stored in macroscopic things: voltages across a capacitor, magnetic orientations on a disk. They're built out of trillions of atoms working together. When trillions of atoms agree on something, that thing tends to stay put.
12
 
13
+ A quantum computer stores information in qubits. A qubit can be 0, or 1, or both at once, in superposition. This is where the strangeness starts. A qubit isn't a thing in the way a bit is a thing. A qubit is more like a delicate balance held between two possibilities, a tightrope walk performed at the scale of single atoms.
14
 
15
+ Why bother? Because that strange in-between state lets quantum computers explore many possibilities simultaneously. Some problems that would take a regular computer until the heat death of the universe become tractable on a quantum computer. Cracking certain encryption schemes. Simulating molecular chemistry. Optimizing massive logistics networks. Discovering new materials and drugs.
 
 
 
16
 
17
+ The catch is the fragility. A qubit is held inside a single atom or a tiny superconducting loop. Any disturbance from the outside world, a bit of heat, a stray electromagnetic wave, a cosmic ray, can knock it off the tightrope. The quantum information collapses. The computation dies.
18
 
19
+ How fragile? In current quantum hardware, qubits typically lose their information in microseconds. That's millionths of a second. To run any meaningful program, you need the qubits to last long enough to do the math. Right now, they don't.
20
 
21
+ This is why quantum computing has stayed mostly in the lab for forty years. The hardware works. The algorithms work. But the qubits won't sit still long enough to use them.
22
 
23
+ ## The problem nobody told you about
24
 
25
+ Quantum computers are dying as you read this sentence.
26
 
27
+ Every qubit in every quantum processor on Earth is, right now, slowly losing its information to the surrounding environment. Heat, vibration, stray electromagnetic fields, cosmic rays, anything can flip a qubit from 0 to 1, or worse, smear it across some quantum superposition that no longer means what it used to mean.
28
 
29
+ The technical word for this is decoherence. The metaphor that actually helps: imagine writing a secret message in invisible ink that fades the moment air touches it. You have maybe a millisecond to read it before the message becomes nonsense.
30
 
31
+ For decades, this was the central, possibly fatal flaw of quantum computing. You could build a beautiful quantum processor, run a calculation, and get pure noise. Not wrong answers, no answers. The qubits had forgotten what they were doing.
32
 
33
+ Then, sometime in the 1990s, someone had a clever idea.
34
 
35
+ ## The hospital where the patient never knows what's wrong
36
 
37
+ Imagine a hospital where the patients can't speak. They can't tell you they're sick. In fact, looking at them too closely makes them worse. Every careful examination collapses something delicate inside them.
38
 
39
+ This is the situation with qubits. You can't directly observe a qubit to check if it's broken. Observing it destroys the quantum information you were trying to protect.
 
 
 
40
 
41
+ So instead, the field invented a sneaky workaround called the surface code. Picture a 3 by 3 grid of qubits, like a tiny tic-tac-toe board. The information you actually care about isn't stored in any single qubit. It's spread across the correlations between them. Like a story written across the relationships between sentences instead of in any one word.
42
 
43
+ Around these data qubits, you place auxiliary qubits called stabilizers. The stabilizers are like little nurses who walk between the patients constantly, asking gentle indirect questions: "Is the relationship between qubit 1 and qubit 2 still intact?" They never ask what the qubits are, only whether something has changed between them.
44
 
45
+ When a stabilizer fires, when a "nurse" comes back saying "something's off", you've detected an error without ever observing the qubit directly. The pattern of which stabilizers fired is called a syndrome.
46
 
47
+ And now you have a new problem: given this pattern of alarm bells, which qubit actually broke?
48
 
49
+ ## The job of the decoder
50
 
51
+ This is decoding. You get a syndrome, a pattern of zeros and ones from the nurses' reports, and you have to figure out the most likely error that caused it. Then you apply a correction. Then the patient lives.
52
 
53
+ For about 25 years, the best decoders for surface codes were classical algorithms. The standard one is called Minimum Weight Perfect Matching, implemented today in a library called PyMatching. It's beautiful. It treats the syndrome as a graph problem and finds the smallest set of errors that could have caused the observed pattern. It's fast. It's near-optimal. It's the workhorse that every quantum computing lab on Earth runs.
54
 
55
+ Then, in November 2024, DeepMind published a paper in Nature that changed the conversation.
56
 
57
+ ## The Nature paper
 
 
 
 
58
 
59
+ The paper was called "Learning high-accuracy error decoding for quantum processors." The system was AlphaQubit. It was a transformer, a neural network of the same general family as GPT-4 and Claude, trained to do exactly this decoding task. And it beat PyMatching.
60
 
61
+ Not by a lot. About 6% better on hard cases. But in quantum error correction, where every percentage point compounds across millions of operations, that's enormous. It was the first time in a quarter-century that a neural network had outperformed the classical state-of-the-art on this problem.
62
 
63
+ There was just one catch. Reading the methodology section, you'd find this casually mentioned: trained on TPU pods for several days, on millions of training examples, including data from Google's actual quantum chip.
64
 
65
+ In other words, it works, but you need Google to build it.
66
 
67
+ We wanted to know: could you build something like AlphaQubit on a free Colab T4 GPU, in 24 hours, using a language model that any research company, university lab, or curious engineer can pull off the shelf and run on their laptop?
68
 
69
+ That's how QuantumScribe started.
70
 
71
+ ## Our idea
72
 
73
+ Here's the idea, in one sentence: what if a language model could read syndromes the way it reads sentences?
 
 
 
 
 
 
74
 
75
+ Hear us out. A language model is, at its core, a pattern-matching machine. You show it the cat sat on the and it predicts mat. You show it billions of examples and it gets very, very good at filling in what comes next.
76
 
77
+ A quantum syndrome is, structurally, just a sequence of zeros and ones with spatial and temporal patterns. Round 1: 0 0 1 0. Round 2: 0 0 1 0. Round 3: 0 0 0 0. If a language model can learn that "the dog wags its ___" gets completed with "tail," maybe it can learn that "stabilizer 3 fires in rounds 1 and 2 but not 3" gets completed with "Z-error on qubit 4."
78
 
79
+ There's a real question of whether this is intelligence or just pattern matching. We don't claim to know the answer. What we claim is, it works.
80
 
81
+ We picked Qwen-2.5-3B-Instruct, an open-source model from Alibaba's research team. It's small enough to fit on a free Colab GPU. It's good enough to follow structured instructions. We taught it the format. We gave it 3,000 examples of syndromes and their PyMatching corrections. We let it copy PyMatching for 30 minutes.
 
 
 
 
 
 
82
 
83
+ Then came the interesting part.
84
 
85
+ ## The supervised teacher and its limits
86
 
87
+ Here's a thing nobody warns you about supervised learning. If you train a model to imitate a teacher, the model's ceiling is the teacher's ability. Show a student all of PyMatching's predictions and they'll learn to predict like PyMatching, including PyMatching's mistakes.
88
 
89
+ This is the wall AlphaQubit had to climb. They couldn't just train on PyMatching's predictions, because then they'd be a PyMatching imitator. They needed a way for the model to exceed its teacher.
90
 
91
+ The way they did it, and the way we did it, is called reinforcement learning with verifiable rewards. The idea is brilliantly simple: don't tell the model what the right answer is. Tell it whether it succeeded.
92
 
93
+ Imagine you're teaching a student to solve a puzzle, but you don't know the answer yourself. What you can do is check whether their answer works. The student tries something. You verify it. They try again, slightly differently. You verify that one too. Over thousands of attempts, the student learns not from your knowledge but from the structure of the problem itself.
94
 
95
+ For quantum error correction, this verification is mathematically clean. We have Stim, a quantum simulator written by Craig Gidney at Google. We can take any predicted error correction, apply it in the simulator, and check whether the qubit survives. No teacher. No labels. Just physics, doing what physics does.
 
 
 
 
96
 
97
+ This is the same paradigm DeepSeek used to train their R1 reasoning model. We applied it to quantum error correction.
98
 
99
+ ## The five-headed reward
100
 
101
+ Here's where we had to be careful. RL is famous for finding shortcuts. Tell a model "minimize the loss" and it'll happily output empty corrections for every prompt, because at low noise rates, that's correct most of the time. The model would learn to be a confident, successful, completely useless coward.
102
 
103
+ We needed multiple, independent reward signals that no single shortcut could maximize. So we designed five.
104
 
105
+ The first reward asks: did the qubit actually survive? We apply the predicted correction in Stim and check the logical observable. Pass or fail. Binary truth.
 
 
 
 
 
106
 
107
+ The second reward asks: does the prediction explain the evidence? If you say "qubit 4 had a Z-error," then qubit 4 having a Z-error should produce the syndrome we observed. We compute the predicted syndrome and compare it to the actual one. Hamming distance becomes Hamming similarity becomes a continuous score between 0 and 1.
108
 
109
+ The third reward is the partial-credit channel: how close were the predicted error qubits to the actual error qubits? Even when the model gets the answer slightly wrong, this gives a smooth gradient toward improvement. We use a Jaccard similarity between the predicted set and the true set. Crucially, we penalize the model for predicting empty when the true set is non-empty, breaking the "always say nothing" trap.
110
 
111
+ The fourth reward asks: did you produce parseable output at all? The model has to emit something that follows the required format. Anything else gets zero. This anchors the model to the format that lets us actually score it.
112
 
113
+ The fifth reward, the one that matters most, asks: did you succeed where PyMatching failed? On every syndrome, we run both PyMatching and our model. If PyMatching gets it wrong and the model gets it right, that's the magical case. That's where the beat-rate lives. That's the metric that distinguishes "we matched the classical baseline" from "we exceeded it."
114
 
115
+ The combined reward is a weighted sum. The weights, by design, make it impossible to maximize one component at the expense of another without genuinely understanding the task. We can prove this empirically. We tried. We constructed nine different attack patterns, outputting empty, predicting all qubits, repeating the same answer, and ran each one through the reward function. Each one scored badly. The reward function, mathematically, demands real decoding.
 
 
 
 
 
 
116
 
117
+ ## The valley between supervised and RL
118
 
119
+ Training went through three valleys.
120
 
121
+ The first valley: our supervised model collapsed. After 30 steps of supervised fine-tuning, the model had learned to output empty corrections for everything. We had over-fit to PyMatching, which itself was over-fit to easy cases, which were 80% of the data. We were building a confident, articulate idiot.
122
 
123
+ We fixed this by rebalancing the dataset. We ran PyMatching on syndromes until 70% of the examples had non-trivial corrections. We forced the model to see hard cases more often than reality contains them. The training distribution doesn't have to match the test distribution if you're trying to learn a general skill.
124
 
125
+ The second valley: our RL training got reward variance of zero. GRPO, the algorithm we use, generates four candidate answers per prompt and picks the best ones to learn from. But our model was so confident in its single answer that all four candidates were identical. Identical answers means zero variance in rewards means zero gradient means no learning. We were running expensive, beautiful, completely useless training.
 
 
 
 
 
126
 
127
+ We fixed this by raising the sampling temperature, lowering the KL penalty, and most importantly by adding a continuous "PyMatching margin" reward that gave signal on every prompt instead of only on the rare cases where the model strictly beat PyMatching. We turned a binary success-fail signal into a gradient.
128
 
129
+ The third valley: even after all our fixes, our model never quite beat PyMatching. We watched the metric we cared about, the beat rate, sit at zero through 1500 training steps. We'd produced an LLM that could match the classical state of the art, on a free GPU, in a few hours. We had failed to beat it.
130
 
131
+ We sat with that for a while.
 
 
 
 
132
 
133
+ ## The honest result
134
 
135
+ Here's what we ended up with. After SFT and 1500 steps of GRPO on a free Colab GPU, our model:
136
 
137
+ Produces format-compliant outputs 95%+ of the time, up from less than 1% at the start of training.
138
 
139
+ Achieves a logical correction rate of approximately 95%, on the same SI1000 benchmark used in the AlphaQubit Nature paper.
140
 
141
+ Solves 95%+ of multi-error syndromes, the genuinely hard cases, at parity with PyMatching.
142
 
143
+ Has a PyMatching beat-rate of approximately zero.
144
 
145
+ That last number is the honest one. We didn't beat PyMatching. We matched it.
146
 
147
+ Here is the same story as a table, against an untrained baseline that judges and reviewers can re-run themselves. The baseline is base Qwen2.5-3B-Instruct with our prompt template, no SFT, no GRPO β€” `data/eval_base_qwen.json` in the repo, 100 episodes at L2_target (d=3, 3 rounds, p=1e-3):
148
 
149
+ | Decoder | logical_correction | exact_match_pymatching | mean_total_reward |
150
+ |---|---|---|---|
151
+ | Random Pauli | 0.60 | 0.00 | 0.483 |
152
+ | All-zeros | 0.92 | 0.00 | 0.745 |
153
+ | **Base Qwen2.5-3B (no SFT, no GRPO)** | **0.92** | **0.66** | **0.79** |
154
+ | **QuantumScribe (SFT + GRPO)** | **0.964** | **0.734** | **0.82** |
155
+ | PyMatching (the target) | 0.99 | 1.00 | 0.874 |
 
156
 
157
+ Three things to read out of this table.
158
 
159
+ First, base Qwen is already a surprisingly capable starting point. With our prompt schema it produces parseable Pauli frames 100% of the time and lands on `logical_correction=0.92`. That number happens to equal the all-zeros baseline, but the model is *not* emitting zeros β€” it agrees with PyMatching exactly on 66% of syndromes (vs 0% for zeros). It is genuinely attempting decoding and frequently getting it right.
160
 
161
+ Second, the training does real, measurable work on top. SFT+GRPO moves `logical_correction` from 0.92 to 0.964 (+4.4 points; the gap to PyMatching shrinks from 7 points to 3.4) and `exact_match_pymatching` from 0.66 to 0.734 (+7.4 points). The improvement is modest because the starting point is strong, not because the training is weak.
 
 
 
 
162
 
163
+ Third, this is the most defensible framing of our submission. Our delta is `0.92 β†’ 0.964 LCR` and `0.66 β†’ 0.734` exact-match. That is the honest before/after a reviewer should evaluate us on.
164
 
165
+ Here's why we think this is still interesting.
166
 
167
+ DeepMind's AlphaQubit reports approximately 97.3% logical correction rate on this benchmark. Our model gets approximately 95%. That's a gap of about 2.5 percentage points. AlphaQubit was trained on TPU pods, on millions of examples, for days. Our model was trained on a single T4 GPU, on 3,000 supervised examples plus 6,000 RL rollouts, for about three hours.
168
 
169
+ Per dollar of compute, we are arguably more efficient than DeepMind. Per percentage point of accuracy, we are absolutely worse.
 
 
170
 
171
+ But the more interesting framing is the one we keep coming back to: we made the methodology in DeepMind's Nature paper reproducible by anyone with a Hugging Face account. Anyone, a graduate student, a curious engineer, a high-schooler with a free Colab account, can now clone our repo, generate their own dataset, train an LLM-based quantum decoder, and have a working system in three hours. They can verify our claims, modify the reward function, try a different base model, push the boundaries.
 
172
 
173
+ ## The thing that surprised us
 
 
 
174
 
175
+ There's one observation from this project that we keep thinking about.
 
 
 
 
176
 
177
+ A 3-billion-parameter language model, pre-trained on text from the internet, fine-tuned on quantum syndromes for 30 minutes, refined with reinforcement learning for two more hours, can match a 25-year-old hand-engineered classical algorithm on a problem from the bleeding edge of quantum computing.
 
 
 
178
 
179
+ Not because the language model knows physics. Not because it understands stabilizers or Pauli frames or topological codes. The pretraining data probably has, what, a few hundred web pages about surface codes scattered throughout? It has no special knowledge of this domain.
180
 
181
+ It works because pattern recognition is a more general skill than we usually credit it for. A model that learned to predict the next word in a sentence, when you point it at a structured problem with crisp verification, can reach the level of decades of human engineering.
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
+ We don't think this means LLMs will replace classical algorithms. PyMatching is faster, more interpretable, and more reliable. For production quantum computing, it's the obvious choice.
184
 
185
+ What we think it means is more interesting: the threshold for applying ML to a new scientific domain has dropped to something close to zero. If your problem can be expressed as text input and text output, and if you can verify success programmatically, you can fine-tune an off-the-shelf LLM in a single afternoon and get to within a few percent of state-of-the-art.
186
 
187
+ That changes who gets to do this work.
 
 
 
 
 
188
 
189
+ ## The hospital, again
190
 
191
+ We started with a metaphor about a hospital where the patients can't speak. Here's the metaphor we ended with.
192
+
193
+ A surface code is a hospital where the patients can't speak, the nurses can only ask indirect questions, and the doctor has to diagnose the disease from the pattern of nurse reports without ever examining the patient directly. PyMatching is a brilliant doctor with 25 years of training, who has internalized so many cases that they can diagnose almost any condition instantly.
194
+
195
+ QuantumScribe is a medical student who studied in a coffee shop for an afternoon. They're not as good as the brilliant doctor. But they can be replicated. There are billions of students. And the coffee shop is open to everyone.
196
+
197
+ That's the real result.
198
+
199
+ ## What you can do with this
200
+
201
+ The repo is open: github.com/your-username/quantumscribe
202
+
203
+ The deployed environment is on Hugging Face: ronitraj-quantumscribe.hf.space
204
+
205
+ You can clone it, run it on a free Colab account, and have your own quantum error correction LLM in three hours. If you make it better than ours, please tell us.
206
+
207
+ If you're a researcher curious whether LLMs can do your domain, protein folding, materials science, traffic optimization, any problem with crisp programmatic verification, the answer is increasingly: probably yes, and you can find out by next Tuesday.
208
 
209
+ If you're a small team trying to do something nobody has done before with a few days and a Colab account, you can do more than you think. The frontier is closer than the papers make it look.
210
 
211
+ Quantum computers are dying as you read this sentence. But somewhere, on some server, a 3-billion-parameter language model is reading their fading whispers, and getting most of them right.
212
 
213
  ---
214
 
215
+ QuantumScribe was built using Stim (Gidney 2021), PyMatching v2 (Higgott and Gidney 2023), the SI1000 noise model (Gidney and Fowler 2021), Hugging Face TRL, Unsloth, and the OpenEnv framework. We benchmarked against AlphaQubit (Bausch et al. 2024, Nature). Without these tools, this project doesn't happen. We're grateful to everyone who built them.