Spaces:
Sleeping
Sleeping
deploy via scripts/deploy_to_space.py
Browse files- docs/EVALUATION.md +92 -0
docs/EVALUATION.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Evaluation Protocol
|
| 2 |
+
|
| 3 |
+
This document describes the held-out evaluation protocol used to populate [`results/comparison_table.md`](../results/comparison_table.md). It was split out of the main README for readability — see the project [README](../README.md) for the high-level overview and headline numbers.
|
| 4 |
+
|
| 5 |
+
## Episode budget, seeds, and reproducibility commands
|
| 6 |
+
|
| 7 |
+
End-to-end evaluation protocol used for the figures in [results/comparison_table.md](../results/comparison_table.md). To reproduce, see "Reproducibility commands" below.
|
| 8 |
+
|
| 9 |
+
### Episode budget
|
| 10 |
+
|
| 11 |
+
| Cohort | Cells | Episodes / cell | Total |
|
| 12 |
+
|---|---|---|---|
|
| 13 |
+
| Trained model (SFT-only + SFT+RL × 4 levels) | 8 | 500 | **4,000** |
|
| 14 |
+
| Baselines (zeros / random / pymatching × 4 levels) | 12 | 100 | **1,200** |
|
| 15 |
+
| **Total** | 20 | — | **5,200 evaluation episodes** |
|
| 16 |
+
|
| 17 |
+
(The headline 3,200 figure is for a single-adapter run: 2,000 trained + 1,200 baseline.)
|
| 18 |
+
|
| 19 |
+
### Random seeds
|
| 20 |
+
|
| 21 |
+
Eval seed range: **5000 – 7199** (held out from training seeds 1–4999 and SFT-validation seeds 4242 + offset). Each (policy, level) cell uses contiguous seeds from this range, so results are bitwise reproducible.
|
| 22 |
+
|
| 23 |
+
### Confidence intervals
|
| 24 |
+
|
| 25 |
+
At 500 episodes per cell, a 95% Wilson CI on a 0.85-LCR estimate is approximately **±2.5%**. Baseline cells at 100 episodes carry a wider ±5% CI — they are deliberately cheaper because the metrics there (≥90% LCR for PyMatching, ~95%+ on L1/L2) are well-separated from the trained-model regime where the improvement is tested.
|
| 26 |
+
|
| 27 |
+
### Hard-syndrome subset definition
|
| 28 |
+
|
| 29 |
+
A "hard syndrome" is an evaluation episode where the **simulated true error pattern contains ≥ 2 X|Z error qubits**. Easy syndromes (zero or one error) are where every reasonable decoder hits ~95%+ LCR; the hard subset is the cohort where MWPM ambiguity matters and trained-model contributions are most visible. The subset metric is reported as `hard_syndrome_lcr` in each per-cell JSON.
|
| 30 |
+
|
| 31 |
+
### Curriculum levels (noise-model parameters)
|
| 32 |
+
|
| 33 |
+
Defined in [`qubit_medic/config.py:CURRICULUM`](../qubit_medic/config.py). All levels use the rotated surface code with a Z-memory experiment under the SI1000 noise model (Gidney & Fowler 2021).
|
| 34 |
+
|
| 35 |
+
| Level | Distance | Rounds | Physical error rate `p` | Notes |
|
| 36 |
+
|---|---|---|---|---|
|
| 37 |
+
| `L1_warmup` | 3 | 1 | 0.0005 | trivial; warmup |
|
| 38 |
+
| `L2_target` | 3 | 3 | 0.001 | primary benchmark (AlphaQubit Fig. 2b geometry) |
|
| 39 |
+
| `L3_stretch` | 5 | 5 | 0.001 | distance-5 stretch goal |
|
| 40 |
+
| `L4_stress` | 5 | 5 | 0.005 | 5× higher noise; eval-only stress test where baselines drop and headroom opens |
|
| 41 |
+
|
| 42 |
+
### Deployed environment
|
| 43 |
+
|
| 44 |
+
Live OpenEnv server: **[https://ronitraj-quantumscribe.hf.space](https://ronitraj-quantumscribe.hf.space)** — health probe at `/healthz`. The deployed Space currently knows L1/L2/L3 only; `L4_stress` evaluation runs locally via `scripts/eval.py` against the in-process `DecoderEnvironment`.
|
| 45 |
+
|
| 46 |
+
### Reproducibility commands
|
| 47 |
+
|
| 48 |
+
End-to-end (12 baseline cells + 4 trained-model cells + table generation) — run from the repo root:
|
| 49 |
+
|
| 50 |
+
```bash
|
| 51 |
+
SPACE_URL=https://ronitraj-quantumscribe.hf.space \
|
| 52 |
+
ADAPTER=checkpoints/grpo_v2 \
|
| 53 |
+
TRAINED_EPISODES=500 BASELINE_EPISODES=100 \
|
| 54 |
+
bash scripts/run_full_eval.sh
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
Outputs:
|
| 58 |
+
- `data/remote_eval/eval_remote_{policy}_{level}.json` — 12 baseline cells
|
| 59 |
+
- `data/trained_eval/eval_trained_{level}.json` — 4 trained-model cells
|
| 60 |
+
- `results/comparison_table.md` — final pivot table
|
| 61 |
+
|
| 62 |
+
Individual steps if you only need to refresh part of the matrix:
|
| 63 |
+
|
| 64 |
+
```bash
|
| 65 |
+
# Remote baselines on L1/L2/L3 only (Space-known levels)
|
| 66 |
+
python -m scripts.eval_remote --url https://ronitraj-quantumscribe.hf.space \
|
| 67 |
+
--episodes 100 --levels L1_warmup L2_target L3_stretch \
|
| 68 |
+
--all-policies --out-dir data/remote_eval/
|
| 69 |
+
|
| 70 |
+
# L4_stress baselines (local; Space rejects forced_level=L4_stress until redeployed)
|
| 71 |
+
for policy in zeros random pymatching; do
|
| 72 |
+
python -m scripts.eval --policy $policy --episodes 100 \
|
| 73 |
+
--level L4_stress \
|
| 74 |
+
--out data/remote_eval/eval_remote_${policy}_L4_stress.json
|
| 75 |
+
done
|
| 76 |
+
|
| 77 |
+
# Trained-model evaluation (local; needs GPU)
|
| 78 |
+
for level in L1_warmup L2_target L3_stretch L4_stress; do
|
| 79 |
+
python -m scripts.eval --adapter checkpoints/grpo_v2 \
|
| 80 |
+
--episodes 500 --level $level \
|
| 81 |
+
--out data/trained_eval/eval_trained_${level}.json
|
| 82 |
+
done
|
| 83 |
+
|
| 84 |
+
# Build the comparison table from whatever cells are present
|
| 85 |
+
python -m scripts.comparison_table_full \
|
| 86 |
+
--remote-eval-dir data/remote_eval/ \
|
| 87 |
+
--trained-eval-dir data/trained_eval/ \
|
| 88 |
+
--output results/comparison_table.md
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
The runner is idempotent — `SKIP_BASELINES=1` reuses existing baseline JSONs; `SKIP_TRAINED=1` reuses existing trained-model JSONs.
|
| 92 |
+
|