QuantumScribe / README.md
ronitraj's picture
deploy via scripts/deploy_to_space.py
d6715d9 verified
|
raw
history blame
18 kB
metadata
title: Qubit-Medic
emoji: 🩺
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 7860
pinned: true
tags:
  - openenv
  - reinforcement-learning
  - quantum-error-correction
  - stim
  - pymatching
  - grpo
  - trl
  - llm
license: mit
short_description: OpenEnv RL env that teaches an LLM to decode quantum errors.

Qubit-Medic

An LLM trained to decode quantum surface-code syndromes. We follow the AlphaQubit-style recipe (Nature 2024): a language model as decoder with verifiable rewards—implemented on Stim + PyMatching, an OpenEnv-style HTTP contract, SFT warm-up + GRPO (TRL/Unsloth), and multi-component rewards that are hard to game.

Qubit-Medic decoding a syndrome on the rotated surface code

Hugging Face

Weights & Biases (this experiment)


Quick links

Resource URL
Hugging Face Space (live demo + API) ronitraj/QuantumScribe — health: /healthz
Trained LoRA on the Hub ronitraj/quantumscribe (PEFT adapter + tokenizer)
W&B project ronitraj/QuantumScribe-GRPO
W&B — SFT run runs/yli513jl
W&B — GRPO run runs/4p7eurnc
Colab training notebooks/colab_train.ipynb
Local Gradio python app_gradio.py
OpenEnv manifest openenv.yaml

What this repo does (elevator pitch)

Quantum computers need a decoder: classical software that maps syndromes (detector results) to corrections. DeepMind’s AlphaQubit showed a transformer can beat a strong PyMatching baseline. We reimplement the idea with a commodity stack:

  • 3B instruction-tuned Qwen2.5 in 4-bit (Unsloth) + LoRA
  • SFT then GRPO (reward from a real Stim environment, not offline labels)
  • OpenEnv-compatible server: /reset / /step / state & schema
  • Five logged reward components (aggregate is weighted)
Dimension This project (typical) AlphaQubit (reference)
Decoder 3B LM + LoRA (off-the-shelf) Custom architecture, lab-scale data mix
Training signal SFT + GRPO on env reward Proprietary + SI1000 / Sycamore
Baseline PyMatching (sparse blossom) Same class of MWM decoder
Open source This repo + Hub weights Research partial

Latest measured eval (JSON)

These numbers come from a held-out run written to data/eval_grpo.json (1000 episodes, L2 target, adapter path recorded in the file). They are the source of truth for submission claims; do not substitute synthetic plots for these metrics.

Metric Value
logical_correction_rate 0.964
pymatching_beat_rate 0.0
format_compliance_rate 1.0
mean_hamming_overlap 0.8405
mean_total_reward ~0.821
exact_match_pymatching 0.734

pymatching_beat is 1 only when PyMatching is wrong on the observable and the LLM is right; on this eval it is 0.0—i.e. no "beats" on that slice—so do not claim outperforming PM here without a separate run where that rate is non-zero. High logical correction and overlap with the PM frame remain meaningful; interpret with reward definitions.

Reproduce:

python -m scripts.eval --adapter /path/to/grpo/adapter --episodes 1000 --out data/eval_grpo.json

(Adjust --adapter to your checkpoint, e.g. a downloaded ronitraj/quantumscribe adapter.)


Data in data/

File Purpose
data/eval_grpo.json Primary eval — single JSON summary (episodes, logical_correction_rate, pymatching_beat_rate, overlaps, level, etc.) from scripts.eval.
data/grpo_validation.jsonl GRPO validation prompts / episodes (one JSON object per line; curriculum, syndrome, seeds).
data/sft_dataset_analysis.json SFT dataset report — stats (completion lengths, level mix, train/val overlap, eval_windows).
data/sft_validation.jsonl SFT held-out set used during training.
data/sft_dataset_sample.jsonl Small sample of SFT training rows (prompt + metadata).

Generated on demand (not always committed) after make baselines / SFT / Willow runs, per .gitignore:

  • data/baseline_results.json — random / zeros / PyMatching baselines
  • data/sft_dataset.jsonl — full SFT train (from make sft-data or generate_sft_data)
  • data/willow_validation.json, data/willow_d3.dem — cross-distribution checks

Figures in figures/

Provenance and regeneration: figures/FIGURES.md. The three trajectory plots below are illustrative (from make plots / baseline-anchored synthetic mode), not a raw W&B export—replace with scripts/plot_results.py and real logs when you have them.

Training trajectories (illustrative)

Mean episode reward Logical correction rate PyMatching beat rate
Total reward Logical correction PyMatching beat

Grid animation (Stim + layout demo)

Surface-code grid animation

Reward & metrics from data (reproducible) — not time-series; single-run summaries from data/eval_grpo.json and data/sft_dataset_analysis.json. Regenerate: python -m scripts.plot_data_figures

Eval metrics (held-out) SFT curriculum mix (train split)
Eval metrics bars SFT curriculum mix

Note: For per-reward time series and KL during GRPO, use the main GRPO run: runs/4p7eurnc — e.g. rl/reward/total_mean, rl/reward/logical_correction_mean, alarms/kl_alarm_value.


The problem (in one story)

Qubits are noisy. You do not observe errors directly; you get syndromes from stabilizer measurements. A decoder turns syndromes into a Pauli correction. PyMatching is a strong classical baseline. We train an LLM to output a parseable correction; the environment checks it with Stim and five reward functions.


The environment

A FastAPI app exposes an OpenEnv-style flow (see qubit_medic/server/app.py and qubit_medic/server/openenv_adapter.py):

  • reset(seed) — sample a syndrome (curriculum), return a prompt.
  • step(text) — parse, score rewards, return reward + per-component info.

Episodes are single-step: one completion per episode. The trainer and W&B see each reward component separately.

+----------+  reset / step  +---------------------------+
| TRL/     | ------------>  | Qubit-Medic (Stim+PM)     |
| Unsloth  |  observation  | parse, 5 rewards, return   |
+----------+ <------------  +---------------------------+

Methodology checklist

Concern Status Pointer
Realistic noise (SI1000) Used Gidney & Fowler arXiv:2108.10457
Real code family Stim surface_code:rotated_memory_z Stim
Strong classical baseline PyMatching v2 arXiv:2303.15933
Policy optimisation GRPO arXiv:2402.03300
OOD / Willow (optional) scripts/willow_validation.py + data/willow_d3.dem Zenodo

Baselines (no LLM)

make baselines writes data/baseline_results.json (random, all-zeros, PyMatching). make plots rebuilds the headline figures from that JSON (see figures/FIGURES.md).

make baselines
make plots

Reward design (config-driven)

Weights are qubit_medic/config.pyREWARD_WEIGHTS (sum 1.0):

total = 0.35 * logical_correction
      + 0.25 * hamming_overlap
      + 0.20 * syndrome_consistency
      + 0.10 * format_compliance
      + 0.10 * pymatching_beat
Component Role
logical_correction 1 if the implied correction matches logical observable (Stim).
hamming_overlap Dense credit vs the PyMatching reference frame.
syndrome_consistency Implied final detectors vs observed syndrome.
format_compliance Parse success / partial / fail.
pymatching_beat 1 only if PM wrong and LLM right (rare; headline for beating PM).

Details: qubit_medic/server/rewards.py. GRPO uses a shared batch cache so all five components score the same (prompt, completion) (see W&B section in previous docs—qubit_medic/wandb_utils.py and trainer).


Weights & Biases

Defaults: WANDB_ENTITY=ronitraj, WANDB_PROJECT=QuantumScribe-GRPO. Trainers use qubit_medic/wandb_utils.py. Disable: WANDB_DISABLED=1 or QUBIT_MEDIC_WANDB=0.

Reference runs (2026-04-26, Colab / server)

Stage Run name Direct link
Project wandb.ai/ronitraj/QuantumScribe-GRPO
SFT sft-20260426-045056 runs/yli513jl
GRPO grpo-20260426-045324 runs/4p7eurnc

The GRPO run includes training curves, in-loop eval/*, alarms/kl_alarm_value, best checkpoint metadata (best/step ≈ 1300), and logged artifacts.

pip install -r requirements-train.txt
wandb login
GROUP=my-exp make train-sft
GROUP=my-exp make train-grpo
GROUP=my-exp make eval

Reproducibility (qubit_medic/config.py)

Item Value
Stim / PyMatching Pinned in requirements*.txt
SFT default base Qwen/Qwen2.5-3B-Instruct via Unsloth
GRPO default base unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
LoRA r=16, alpha=32, dropout=0.1, q/k/v/o
GRPO 1500 steps, short completions (max_completion 50), KL coeff 0.02, temperature=1.2 rollouts, etc.
Seeds 42, 1337, 2024

**Import from qubit_medic.config**—do not duplicate magic numbers in scripts.


Train and eval (local)

python3 -m venv .venv && . .venv/bin/activate
pip install -r requirements.txt
make validate

make sft-data
make baselines
make tests

python -m scripts.train_sft --output checkpoints/sft_warmup
python -m scripts.train_grpo \
  --sft-checkpoint checkpoints/sft_warmup/checkpoint-50 \
  --output checkpoints/grpo

python -m scripts.eval --adapter checkpoints/grpo --episodes 1000 --out data/eval_grpo.json

End-to-end: notebooks/colab_train.ipynb. Makefile shortcuts: make train-sft, make train-grpo, make eval (see Makefile).

Local dev: run everything (no Docker)

1. Base environment (CPU OK) — OpenEnv / Stim / tests:

cd /path/to/errorCorrection
python3 -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install -U pip
pip install -r requirements.txt
make validate
make tests

2. OpenEnv HTTP server (no LLM — physics + reward only) — good for API checks and curl / a browser:

# default: 0.0.0.0:7860 (or set QUBIT_MEDIC_PORT)
python -m qubit_medic.server.app
# dev reload:
uvicorn qubit_medic.server.app:app --reload --host 0.0.0.0 --port 7860

3. Gradio grid demo (Stim + PyMatching only)does not load the trained LLM in code today; it visualises the classical decoder.

pip install "gradio>=4"
PORT=7860 python app_gradio.py
# open http://127.0.0.1:7860 — if the OpenEnv server is already on 7860, use e.g. PORT=7861

4. Run with the real model (Unsloth + LoRA) — this is the supported path — needs a GPU and training deps. The eval harness loads the adapter and uses LocalDecoderClient (in-process env, no separate server).

pip install -r requirements-train.txt
# optional: export HF_TOKEN=...  for gated/private Hub repos
python -m scripts.eval \
  --adapter ronitraj/quantumscribe \
  --episodes 50 \
  --level L2_target \
  --max-new-tokens 160
  • Use a local LoRA folder the same way: --adapter /path/to/checkpoints/grpo/final (the directory that contains adapter_model.safetensors).
  • The script calls FastLanguageModel.from_pretrained(model_name=adapter, …); for Hub PEFT repos, Unsloth/transformers should resolve the base from adapter_config.json. If loading fails, run hf download ronitraj/quantumscribe and point --adapter at the local folder.
  • Shorter run first (e.g. --episodes 5) to confirm VRAM, then increase.

5. What is not wired — the Docker Space image does not install torch/Unsloth; the Gradio app’s markdown mentions QUBIT_MEDIC_ADAPTER but there is no LLM inference in app_gradio.py yet—use scripts.eval for the trained policy.


Publish the adapter to the Hub

Released weights: ronitraj/quantumscribe. Load as PEFT on the same base used for training:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit"
model = AutoModelForCausalLM.from_pretrained(base, device_map="auto", trust_remote_code=True)
model = PeftModel.from_pretrained(model, "ronitraj/quantumscribe")
tokenizer = AutoTokenizer.from_pretrained("ronitraj/quantumscribe")

Re-upload: hf upload ronitraj/quantumscribe /path/to/final . with Hub authentication.


Space deployment


Cross-distribution (optional)

python -m scripts.willow_validation — see scripts/willow_validation.py.


Repository layout

qubit_medic/
  config.py, models.py, prompts.py, wandb_utils.py
  client/
  server/   (app, environment, rewards, curriculum, physics, openenv_adapter)
scripts/
  validate_env.py, generate_sft_data.py, train_sft.py, train_grpo.py, eval.py
  baseline_policies.py, plot_results.py, plot_data_figures.py, animate_grid.py, willow_validation.py
  format_test.py, diversity_preflight.py, deploy_to_space.py, sync_kaggle_bundle.py
tests/     data/     figures/     checkpoints/     notebooks/colab_train.ipynb
app_gradio.py   Dockerfile   openenv.yaml   Makefile

Citations

@article{bausch_alphaqubit_2024,
  title   = {Learning high-accuracy error decoding for quantum processors},
  author  = {Bausch, Johannes and others},
  journal = {Nature},
  volume  = {635},
  pages   = {834},
  year    = {2024},
  doi     = {10.1038/s41586-024-08148-8}
}
@article{acharya_willow_2024,
  title   = {Quantum error correction below the surface code threshold},
  author  = {Acharya, R. and others (Google Quantum AI)},
  journal = {arXiv:2408.13687},
  year    = {2024}
}
@article{gidney_si1000_2021,
  title   = {A fault-tolerant honeycomb memory},
  author  = {Gidney, Craig and Fowler, Austin G.},
  journal = {arXiv:2108.10457},
  year    = {2021}
}
@article{higgott_pymatching_2023,
  title   = {Sparse Blossom: correcting a million errors per core second
             with minimum-weight matching},
  author  = {Higgott, Oscar and Gidney, Craig},
  journal = {arXiv:2303.15933},
  year    = {2023}
}
@article{shao_grpo_2024,
  title   = {DeepSeekMath: pushing the limits of mathematical reasoning
             in open language models},
  author  = {Shao, Zhihong and others},
  journal = {arXiv:2402.03300},
  year    = {2024}
}

Acknowledgments

DeepMind (AlphaQubit), Google Quantum AI (Stim, Willow data), Gidney (SI1000), Higgott (PyMatching), Hugging Face, Unsloth, OpenEnv.


License

MIT — LICENSE.