QuantumScribe / README.md
ronitraj's picture
Upload README.md with huggingface_hub
68d2b8a verified
|
raw
history blame
25.9 kB
metadata
title: Qubit-Medic
emoji: 🩺
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 7860
pinned: true
tags:
  - openenv
  - reinforcement-learning
  - quantum-error-correction
  - stim
  - pymatching
  - grpo
  - trl
  - llm
license: mit
short_description: OpenEnv RL env that teaches an LLM to decode quantum errors.

Qubit-Medic: An LLM Decoder for Quantum Error Correction

An LLM (Qwen2.5-3B-Instruct) learning to outperform a 50-year-old graph-matching algorithm (PyMatching) at decoding quantum surface-code syndromes β€” using verifiable physics rewards, not human preferences. DeepMind's AlphaQubit (Nature 2024, Bausch et al.) showed a transformer can beat strong classical decoders, but it cost Google millions of dollars and a custom architecture. We ship a 3B-parameter open model on a free Colab T4, trained with SFT + GRPO against a real Stim simulator behind an OpenEnv HTTP contract.

Qubit-Medic decoding a syndrome on the rotated surface code

Quick links


What the agent learns

The agent observes a surface-code syndrome (detector parities from a surface_code:rotated_memory_z Stim circuit) and must emit a Pauli frame that preserves the encoded logical Z observable. Episodes are single-step: one syndrome in, one parseable correction out, scored by Stim's real physics β€” not a learned reward model. Across the curriculum, the policy moves from clean distance-3 codes to noisier multi-round circuits where PyMatching starts to fail.

We generate synthetic surface-code syndromes using Stim (Gidney 2021), the same Clifford simulator used by the AlphaQubit and Willow papers. This ensures our training data is drawn from the same physical model as the published benchmarks β€” not a homemade simulator.

Surface-code grid animation

Environment

Field Value
Observation QubitMedicObservation β€” prompt (text), syndrome bits, level, episode_id, curriculum metadata (see qubit_medic/server/openenv_adapter.py)
Action QubitMedicAction β€” text field containing the model's parseable Pauli-frame completion
Episode end Single-step: terminates after one step() call; reward + per-component info returned to trainer
Curriculum L1_warmup (d=3, 1 round, p=1e-4) β†’ L2_target (d=3, 3 rounds, p=1e-3) β†’ L3_stretch (d=5, 5 rounds, p=1e-3) with promotion thresholds 0.80 / 0.70 / 0.30

Server endpoints (FastAPI, port 7860): /reset, /step, /state, /schema, /metadata, /health, /healthz, /decode (PyMatching baseline). See openenv.yaml.

Reward design

Five independent verifiable channels (no learned reward model). Weights from openenv.yaml β€” sum to 1.0:

Component Weight What it measures What gaming attempt it blocks
logical_correction 0.40 1 iff predicted Pauli frame preserves the logical Z observable (Stim ground truth) Outputs that pass syntax checks but flip the logical qubit
syndrome_consistency 0.20 Hamming similarity of implied final-round detectors vs. observed syndrome Memorising a popular frame regardless of input syndrome
hamming_overlap 0.20 Mean Jaccard similarity vs. PyMatching reference frame Random / sparse outputs that occasionally hit logical correctness
format_compliance 0.10 1 / 0.5 / 0 for full / partial / unparseable output Free-text "thinking" with no decodable answer
pymatching_beat 0.10 1 iff PyMatching is wrong and the LLM is right on this syndrome Copying PyMatching: matching it gives 0 here, you have to actually beat it

GRPO uses a shared batch cache so all five components score the same (prompt, completion) pair; details in qubit_medic/server/rewards.py and qubit_medic/wandb_utils.py. Note: trainer-side weights in qubit_medic/config.py currently use 0.35 / 0.25 / 0.20 / 0.10 / 0.10; the manifest is the canonical environment-side weighting.


Results

Held-out eval on 1000 episodes at L2_target (data/eval_grpo.json, source-of-truth):

Metric Value
logical_correction_rate 0.964
format_compliance_rate 1.000
mean_hamming_overlap 0.8405
mean_total_reward ~0.821
exact_match_pymatching 0.734
pymatching_beat_rate 0.000
Mean episode reward over GRPO training PyMatching beat rate over training
Mean total episode reward across GRPO steps; x = step, y = mean reward (illustrative trajectory). Fraction of episodes where the LLM is right and PyMatching is wrong; x = step, y = beat rate.

Honest caveat. On this slice pymatching_beat = 0.0 β€” i.e. zero "beats" of PyMatching on the held-out set. High logical correction (96.4%) and overlap with the PM frame remain meaningful signals, but we are not yet claiming to outperform PyMatching at d=3. See qubit_medic/server/rewards.py for definitions.

Before / after comparison

Placeholder β€” a before/after comparison (base Qwen2.5-3B vs. SFT-only vs. SFT+GRPO) will land here after the next training run. The current eval bars and SFT curriculum mix are below in the deep-dive.


Try it

# Live HF Space (no install)
curl https://ronitraj-quantumscribe.hf.space/healthz

# Local Docker (OpenEnv server only β€” physics + reward, no LLM)
docker build -t qubit-medic . && docker run -p 7860:7860 qubit-medic

# Or run the Python server directly
pip install -r requirements.txt && python -m qubit_medic.server.app
# Docs at http://127.0.0.1:7860/docs

# Eval the trained adapter (needs GPU + requirements-train.txt)
pip install -r requirements-train.txt
python -m scripts.eval --adapter ronitraj/quantumscribe --episodes 50 --level L2_target

How it works (deep dive)

The problem (in one story)

Qubits are noisy. You do not observe errors directly; you get syndromes from stabilizer measurements. A decoder turns syndromes into a Pauli correction. PyMatching (sparse blossom, arXiv:2303.15933) is a strong classical baseline. We train an LLM to output a parseable correction; the environment checks it with Stim and five reward functions.

The environment (architecture)

A FastAPI app exposes an OpenEnv-style flow (see qubit_medic/server/app.py and qubit_medic/server/openenv_adapter.py):

  • reset(seed) β€” sample a syndrome (curriculum), return a prompt.
  • step(text) β€” parse, score rewards, return reward + per-component info.

Episodes are single-step: one completion per episode. The trainer and W&B see each reward component separately.

+----------+  reset / step  +---------------------------+
| TRL/     | ------------>  | Qubit-Medic (Stim+PM)     |
| Unsloth  |  observation  | parse, 5 rewards, return   |
+----------+ <------------  +---------------------------+

Elevator pitch (technical)

DeepMind's AlphaQubit showed a transformer can beat a strong PyMatching baseline. We reimplement the idea with a commodity stack:

  • 3B instruction-tuned Qwen2.5 in 4-bit (Unsloth) + LoRA
  • SFT then GRPO (reward from a real Stim environment, not offline labels)
  • OpenEnv-compatible server: /reset / /step / state & schema
  • Five logged reward components (aggregate is weighted)
Dimension This project (typical) AlphaQubit (reference)
Decoder 3B LM + LoRA (off-the-shelf) Custom architecture, lab-scale data mix
Training signal SFT + GRPO on env reward Proprietary + SI1000 / Sycamore
Baseline PyMatching (sparse blossom) Same class of MWM decoder
Open source This repo + Hub weights Research partial

Methodology checklist

Concern Status Pointer
Realistic noise (SI1000) Used Gidney & Fowler arXiv:2108.10457
Real code family Stim surface_code:rotated_memory_z Stim
Strong classical baseline PyMatching v2 arXiv:2303.15933
Policy optimisation GRPO arXiv:2402.03300
OOD / Willow (optional) scripts/willow_validation.py + data/willow_d3.dem Zenodo

Latest measured eval (JSON)

These numbers come from a held-out run written to data/eval_grpo.json (1000 episodes, L2 target, adapter path recorded in the file). They are the source of truth for submission claims; do not substitute synthetic plots for these metrics.

pymatching_beat is 1 only when PyMatching is wrong on the observable and the LLM is right; on this eval it is 0.0 β€” i.e. no "beats" on that slice β€” so do not claim outperforming PM here without a separate run where that rate is non-zero. High logical correction and overlap with the PM frame remain meaningful; interpret with reward definitions.

Reproduce:

python -m scripts.eval --adapter /path/to/grpo/adapter --episodes 1000 --out data/eval_grpo.json

(Adjust --adapter to your checkpoint, e.g. a downloaded ronitraj/quantumscribe adapter.)

Data in data/

File Purpose
data/eval_grpo.json Primary eval β€” single JSON summary (episodes, logical_correction_rate, pymatching_beat_rate, overlaps, level, etc.) from scripts.eval.
data/grpo_validation.jsonl GRPO validation prompts / episodes (one JSON object per line; curriculum, syndrome, seeds).
data/sft_dataset_analysis.json SFT dataset report β€” stats (completion lengths, level mix, train/val overlap, eval_windows).
data/sft_validation.jsonl SFT held-out set used during training.
data/sft_dataset_sample.jsonl Small sample of SFT training rows (prompt + metadata).

Generated on demand (not always committed) after make baselines / SFT / Willow runs, per .gitignore:

  • data/baseline_results.json β€” random / zeros / PyMatching baselines
  • data/sft_dataset.jsonl β€” full SFT train (from make sft-data or generate_sft_data)
  • data/willow_validation.json, data/willow_d3.dem β€” cross-distribution checks

Figures in figures/

Provenance and regeneration: figures/FIGURES.md. The trajectory plots above are illustrative (from make plots / baseline-anchored synthetic mode), not a raw W&B export β€” replace with scripts/plot_results.py and real logs when you have them.

Reward & metrics from data (reproducible) β€” not time-series; single-run summaries from data/eval_grpo.json and data/sft_dataset_analysis.json. Regenerate: python -m scripts.plot_data_figures

Eval metrics (held-out) SFT curriculum mix (train split)
Eval metrics bars SFT curriculum mix

Note: For per-reward time series and KL during GRPO, use the main GRPO run: runs/4p7eurnc β€” e.g. rl/reward/total_mean, rl/reward/logical_correction_mean, alarms/kl_alarm_value.

Baselines (no LLM)

make baselines writes data/baseline_results.json (random, all-zeros, PyMatching). make plots rebuilds the headline figures from that JSON (see figures/FIGURES.md).

make baselines
make plots

Reward design (config-driven)

Trainer-side weights are qubit_medic/config.py β†’ REWARD_WEIGHTS (sum 1.0):

total = 0.35 * logical_correction
      + 0.25 * hamming_overlap
      + 0.20 * syndrome_consistency
      + 0.10 * format_compliance
      + 0.10 * pymatching_beat

Details: qubit_medic/server/rewards.py. GRPO uses a shared batch cache so all five components score the same (prompt, completion) (see qubit_medic/wandb_utils.py and trainer).

Weights & Biases

Defaults: WANDB_ENTITY=ronitraj, WANDB_PROJECT=QuantumScribe-GRPO. Trainers use qubit_medic/wandb_utils.py. Disable: WANDB_DISABLED=1 or QUBIT_MEDIC_WANDB=0.

Reference runs (2026-04-26, Colab / server)

Stage Run name Direct link
Project β€” wandb.ai/ronitraj/QuantumScribe-GRPO
SFT sft-20260426-045056 runs/yli513jl
GRPO grpo-20260426-045324 runs/4p7eurnc

The GRPO run includes training curves, in-loop eval/*, alarms/kl_alarm_value, best checkpoint metadata (best/step β‰ˆ 1300), and logged artifacts.

pip install -r requirements-train.txt
wandb login
GROUP=my-exp make train-sft
GROUP=my-exp make train-grpo
GROUP=my-exp make eval

Reproducibility (qubit_medic/config.py)

Item Value
Stim / PyMatching Pinned in requirements*.txt
SFT default base Qwen/Qwen2.5-3B-Instruct via Unsloth
GRPO default base unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
LoRA r=16, alpha=32, dropout=0.1, q/k/v/o
GRPO 1500 steps, short completions (max_completion 50), KL coeff 0.02, temperature=1.2 rollouts, etc.
Seeds 42, 1337, 2024

Import from qubit_medic.config β€” do not duplicate magic numbers in scripts.

Train and eval (local)

python3 -m venv .venv && . .venv/bin/activate
pip install -r requirements.txt
make validate

make sft-data
make baselines
make tests

python -m scripts.train_sft --output checkpoints/sft_warmup
python -m scripts.train_grpo \
  --sft-checkpoint checkpoints/sft_warmup/checkpoint-50 \
  --output checkpoints/grpo

python -m scripts.eval --adapter checkpoints/grpo --episodes 1000 --out data/eval_grpo.json

End-to-end: notebooks/meta_final.ipynb. Makefile shortcuts: make train-sft, make train-grpo, make eval (see Makefile).

Local dev: run everything (no Docker)

1. Base environment (CPU OK) β€” OpenEnv / Stim / tests:

cd /path/to/errorCorrection
python3 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install -U pip
pip install -r requirements.txt
make validate
make tests

2. OpenEnv HTTP server (no LLM β€” physics + reward only) β€” good for API checks and curl / a browser:

# default: 0.0.0.0:7860 (or set QUBIT_MEDIC_PORT)
python -m qubit_medic.server.app
# dev reload:
uvicorn qubit_medic.server.app:app --reload --host 0.0.0.0 --port 7860

3. Gradio grid demo (Stim + PyMatching only) β€” does not load the trained LLM in code today; it visualises the classical decoder.

pip install "gradio>=4"
PORT=7860 python app_gradio.py
# open http://127.0.0.1:7860 β€” if the OpenEnv server is already on 7860, use e.g. PORT=7861

4. Run with the real model (Unsloth + LoRA) β€” this is the supported path β€” needs a GPU and training deps. The eval harness loads the adapter and uses LocalDecoderClient (in-process env, no separate server).

pip install -r requirements-train.txt
# optional: export HF_TOKEN=...  for gated/private Hub repos
python -m scripts.eval \
  --adapter ronitraj/quantumscribe \
  --episodes 50 \
  --level L2_target \
  --max-new-tokens 160
  • Use a local LoRA folder the same way: --adapter /path/to/checkpoints/grpo/final (the directory that contains adapter_model.safetensors).
  • The script calls FastLanguageModel.from_pretrained(model_name=adapter, …); for Hub PEFT repos, Unsloth/transformers should resolve the base from adapter_config.json. If loading fails, run hf download ronitraj/quantumscribe and point --adapter at the local folder.
  • Shorter run first (e.g. --episodes 5) to confirm VRAM, then increase.

5. What is not wired β€” the Docker Space image does not install torch/Unsloth; the Gradio app's markdown mentions QUBIT_MEDIC_ADAPTER but there is no LLM inference in app_gradio.py yet β€” use scripts.eval for the trained policy.

Publish the adapter to the Hub

Released weights: ronitraj/quantumscribe. Load as PEFT on the same base used for training:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit"
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(model, "ronitraj/quantumscribe")
tokenizer = AutoTokenizer.from_pretrained("ronitraj/quantumscribe")

Re-upload: hf upload ronitraj/quantumscribe /path/to/final . with Hub authentication.

Space deployment

Cross-distribution (optional)

python -m scripts.willow_validation β€” see scripts/willow_validation.py.

Repository layout

qubit_medic/
  config.py, models.py, prompts.py, wandb_utils.py
  client/
  server/   (app, environment, rewards, curriculum, physics, openenv_adapter)
scripts/
  validate_env.py, generate_sft_data.py, train_sft.py, train_grpo.py, eval.py
  baseline_policies.py, plot_results.py, plot_data_figures.py, animate_grid.py, willow_validation.py
  format_test.py, diversity_preflight.py, deploy_to_space.py, sync_kaggle_bundle.py
tests/     data/     figures/     checkpoints/     notebooks/meta_final.ipynb
app_gradio.py   Dockerfile   openenv.yaml   Makefile

Evaluation Protocol

End-to-end evaluation protocol used for the figures in results/comparison_table.md. To reproduce, see "Reproducibility commands" below.

Episode budget

Cohort Cells Episodes / cell Total
Trained model (SFT-only + SFT+RL Γ— 4 levels) 8 500 4,000
Baselines (zeros / random / pymatching Γ— 4 levels) 12 100 1,200
Total 20 β€” 5,200 evaluation episodes

(The headline 3,200 figure is for a single-adapter run: 2,000 trained + 1,200 baseline.)

Random seeds

Eval seed range: 5000 – 7199 (held out from training seeds 1–4999 and SFT-validation seeds 4242 + offset). Each (policy, level) cell uses contiguous seeds from this range, so results are bitwise reproducible.

Confidence intervals

At 500 episodes per cell, a 95% Wilson CI on a 0.85-LCR estimate is approximately Β±2.5%. Baseline cells at 100 episodes carry a wider Β±5% CI β€” they are deliberately cheaper because the metrics there (β‰₯90% LCR for PyMatching, ~95%+ on L1/L2) are well-separated from the trained-model regime where the improvement is tested.

Hard-syndrome subset definition

A "hard syndrome" is an evaluation episode where the simulated true error pattern contains β‰₯ 2 X|Z error qubits. Easy syndromes (zero or one error) are where every reasonable decoder hits ~95%+ LCR; the hard subset is the cohort where MWPM ambiguity matters and trained-model contributions are most visible. The subset metric is reported as hard_syndrome_lcr in each per-cell JSON.

Curriculum levels (noise-model parameters)

Defined in qubit_medic/config.py:CURRICULUM. All levels use the rotated surface code with a Z-memory experiment under the SI1000 noise model (Gidney & Fowler 2021).

Level Distance Rounds Physical error rate p Notes
L1_warmup 3 1 0.0005 trivial; warmup
L2_target 3 3 0.001 primary benchmark (AlphaQubit Fig. 2b geometry)
L3_stretch 5 5 0.001 distance-5 stretch goal
L4_stress 5 5 0.005 5Γ— higher noise; eval-only stress test where baselines drop and headroom opens

Deployed environment

Live OpenEnv server: https://ronitraj-quantumscribe.hf.space β€” health probe at /healthz. The deployed Space currently knows L1/L2/L3 only; L4_stress evaluation runs locally via scripts/eval.py against the in-process DecoderEnvironment.

Reproducibility commands

End-to-end (12 baseline cells + 4 trained-model cells + table generation) β€” run from the repo root:

SPACE_URL=https://ronitraj-quantumscribe.hf.space \
ADAPTER=checkpoints/grpo_v2 \
TRAINED_EPISODES=500 BASELINE_EPISODES=100 \
bash scripts/run_full_eval.sh

Outputs:

  • data/remote_eval/eval_remote_{policy}_{level}.json β€” 12 baseline cells
  • data/trained_eval/eval_trained_{level}.json β€” 4 trained-model cells
  • results/comparison_table.md β€” final pivot table

Individual steps if you only need to refresh part of the matrix:

# Remote baselines on L1/L2/L3 only (Space-known levels)
python -m scripts.eval_remote --url https://ronitraj-quantumscribe.hf.space \
    --episodes 100 --levels L1_warmup L2_target L3_stretch \
    --all-policies --out-dir data/remote_eval/

# L4_stress baselines (local; Space rejects forced_level=L4_stress until redeployed)
for policy in zeros random pymatching; do
    python -m scripts.eval --policy $policy --episodes 100 \
        --level L4_stress \
        --out data/remote_eval/eval_remote_${policy}_L4_stress.json
done

# Trained-model evaluation (local; needs GPU)
for level in L1_warmup L2_target L3_stretch L4_stress; do
    python -m scripts.eval --adapter checkpoints/grpo_v2 \
        --episodes 500 --level $level \
        --out data/trained_eval/eval_trained_${level}.json
done

# Build the comparison table from whatever cells are present
python -m scripts.comparison_table_full \
    --remote-eval-dir data/remote_eval/ \
    --trained-eval-dir data/trained_eval/ \
    --output results/comparison_table.md

The runner is idempotent β€” SKIP_BASELINES=1 reuses existing baseline JSONs; SKIP_TRAINED=1 reuses existing trained-model JSONs.


Citations

@article{gidney_stim_2021,
  title   = {Stim: a fast stabilizer circuit simulator},
  author  = {Gidney, Craig},
  journal = {Quantum},
  volume  = {5},
  pages   = {497},
  year    = {2021},
  doi     = {10.22331/q-2021-07-06-497},
  note    = {arXiv:2103.02202}
}
@article{bausch_alphaqubit_2024,
  title   = {Learning high-accuracy error decoding for quantum processors},
  author  = {Bausch, Johannes and others},
  journal = {Nature},
  volume  = {635},
  pages   = {834},
  year    = {2024},
  doi     = {10.1038/s41586-024-08148-8}
}
@article{acharya_willow_2024,
  title   = {Quantum error correction below the surface code threshold},
  author  = {Acharya, R. and others (Google Quantum AI)},
  journal = {arXiv:2408.13687},
  year    = {2024}
}
@article{gidney_si1000_2021,
  title   = {A fault-tolerant honeycomb memory},
  author  = {Gidney, Craig and Fowler, Austin G.},
  journal = {arXiv:2108.10457},
  year    = {2021}
}
@article{higgott_pymatching_2023,
  title   = {Sparse Blossom: correcting a million errors per core second
             with minimum-weight matching},
  author  = {Higgott, Oscar and Gidney, Craig},
  journal = {arXiv:2303.15933},
  year    = {2023}
}
@article{shao_grpo_2024,
  title   = {DeepSeekMath: pushing the limits of mathematical reasoning
             in open language models},
  author  = {Shao, Zhihong and others},
  journal = {arXiv:2402.03300},
  year    = {2024}
}

Acknowledgments

DeepMind (AlphaQubit), Google Quantum AI (Stim, Willow data), Gidney (SI1000), Higgott (PyMatching), Hugging Face, Unsloth, OpenEnv.


License

MIT β€” LICENSE.