ronitraj commited on
Commit
0ed122f
·
verified ·
1 Parent(s): 78263dc

deploy via scripts/deploy_to_space.py

Browse files
Files changed (1) hide show
  1. docs/EVALUATION.md +92 -0
docs/EVALUATION.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluation Protocol
2
+
3
+ This document describes the held-out evaluation protocol used to populate [`results/comparison_table.md`](../results/comparison_table.md). It was split out of the main README for readability — see the project [README](../README.md) for the high-level overview and headline numbers.
4
+
5
+ ## Episode budget, seeds, and reproducibility commands
6
+
7
+ End-to-end evaluation protocol used for the figures in [results/comparison_table.md](../results/comparison_table.md). To reproduce, see "Reproducibility commands" below.
8
+
9
+ ### Episode budget
10
+
11
+ | Cohort | Cells | Episodes / cell | Total |
12
+ |---|---|---|---|
13
+ | Trained model (SFT-only + SFT+RL × 4 levels) | 8 | 500 | **4,000** |
14
+ | Baselines (zeros / random / pymatching × 4 levels) | 12 | 100 | **1,200** |
15
+ | **Total** | 20 | — | **5,200 evaluation episodes** |
16
+
17
+ (The headline 3,200 figure is for a single-adapter run: 2,000 trained + 1,200 baseline.)
18
+
19
+ ### Random seeds
20
+
21
+ Eval seed range: **5000 – 7199** (held out from training seeds 1–4999 and SFT-validation seeds 4242 + offset). Each (policy, level) cell uses contiguous seeds from this range, so results are bitwise reproducible.
22
+
23
+ ### Confidence intervals
24
+
25
+ At 500 episodes per cell, a 95% Wilson CI on a 0.85-LCR estimate is approximately **±2.5%**. Baseline cells at 100 episodes carry a wider ±5% CI — they are deliberately cheaper because the metrics there (≥90% LCR for PyMatching, ~95%+ on L1/L2) are well-separated from the trained-model regime where the improvement is tested.
26
+
27
+ ### Hard-syndrome subset definition
28
+
29
+ A "hard syndrome" is an evaluation episode where the **simulated true error pattern contains ≥ 2 X|Z error qubits**. Easy syndromes (zero or one error) are where every reasonable decoder hits ~95%+ LCR; the hard subset is the cohort where MWPM ambiguity matters and trained-model contributions are most visible. The subset metric is reported as `hard_syndrome_lcr` in each per-cell JSON.
30
+
31
+ ### Curriculum levels (noise-model parameters)
32
+
33
+ Defined in [`qubit_medic/config.py:CURRICULUM`](../qubit_medic/config.py). All levels use the rotated surface code with a Z-memory experiment under the SI1000 noise model (Gidney & Fowler 2021).
34
+
35
+ | Level | Distance | Rounds | Physical error rate `p` | Notes |
36
+ |---|---|---|---|---|
37
+ | `L1_warmup` | 3 | 1 | 0.0005 | trivial; warmup |
38
+ | `L2_target` | 3 | 3 | 0.001 | primary benchmark (AlphaQubit Fig. 2b geometry) |
39
+ | `L3_stretch` | 5 | 5 | 0.001 | distance-5 stretch goal |
40
+ | `L4_stress` | 5 | 5 | 0.005 | 5× higher noise; eval-only stress test where baselines drop and headroom opens |
41
+
42
+ ### Deployed environment
43
+
44
+ Live OpenEnv server: **[https://ronitraj-quantumscribe.hf.space](https://ronitraj-quantumscribe.hf.space)** — health probe at `/healthz`. The deployed Space currently knows L1/L2/L3 only; `L4_stress` evaluation runs locally via `scripts/eval.py` against the in-process `DecoderEnvironment`.
45
+
46
+ ### Reproducibility commands
47
+
48
+ End-to-end (12 baseline cells + 4 trained-model cells + table generation) — run from the repo root:
49
+
50
+ ```bash
51
+ SPACE_URL=https://ronitraj-quantumscribe.hf.space \
52
+ ADAPTER=checkpoints/grpo_v2 \
53
+ TRAINED_EPISODES=500 BASELINE_EPISODES=100 \
54
+ bash scripts/run_full_eval.sh
55
+ ```
56
+
57
+ Outputs:
58
+ - `data/remote_eval/eval_remote_{policy}_{level}.json` — 12 baseline cells
59
+ - `data/trained_eval/eval_trained_{level}.json` — 4 trained-model cells
60
+ - `results/comparison_table.md` — final pivot table
61
+
62
+ Individual steps if you only need to refresh part of the matrix:
63
+
64
+ ```bash
65
+ # Remote baselines on L1/L2/L3 only (Space-known levels)
66
+ python -m scripts.eval_remote --url https://ronitraj-quantumscribe.hf.space \
67
+ --episodes 100 --levels L1_warmup L2_target L3_stretch \
68
+ --all-policies --out-dir data/remote_eval/
69
+
70
+ # L4_stress baselines (local; Space rejects forced_level=L4_stress until redeployed)
71
+ for policy in zeros random pymatching; do
72
+ python -m scripts.eval --policy $policy --episodes 100 \
73
+ --level L4_stress \
74
+ --out data/remote_eval/eval_remote_${policy}_L4_stress.json
75
+ done
76
+
77
+ # Trained-model evaluation (local; needs GPU)
78
+ for level in L1_warmup L2_target L3_stretch L4_stress; do
79
+ python -m scripts.eval --adapter checkpoints/grpo_v2 \
80
+ --episodes 500 --level $level \
81
+ --out data/trained_eval/eval_trained_${level}.json
82
+ done
83
+
84
+ # Build the comparison table from whatever cells are present
85
+ python -m scripts.comparison_table_full \
86
+ --remote-eval-dir data/remote_eval/ \
87
+ --trained-eval-dir data/trained_eval/ \
88
+ --output results/comparison_table.md
89
+ ```
90
+
91
+ The runner is idempotent — `SKIP_BASELINES=1` reuses existing baseline JSONs; `SKIP_TRAINED=1` reuses existing trained-model JSONs.
92
+