ronitraj commited on
Commit
0139454
Β·
verified Β·
1 Parent(s): ae72eb9

Upload docs/ENVIRONMENT_API.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. docs/ENVIRONMENT_API.md +256 -0
docs/ENVIRONMENT_API.md ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Environment API
2
+
3
+ Qubit-Medic exposes an OpenEnv-compliant HTTP server built on top of
4
+ `openenv.core.create_fastapi_app`. The server wraps an internal
5
+ `DecoderEnvironment` (Stim + PyMatching) through the standard
6
+ `Action` / `Observation` / `State` Pydantic shapes.
7
+
8
+ > **Simulation substrate.** Surface-code syndromes are generated with
9
+ > **Stim** ([Gidney 2021](https://arxiv.org/abs/2103.02202), *Quantum*
10
+ > 5:497), the field-standard Clifford simulator for quantum error
11
+ > correction. This is the same simulation engine used by AlphaQubit
12
+ > (Bausch et al., *Nature* 2024) and Willow (Acharya et al., 2024) β€”
13
+ > training data is drawn from the same physical model as the published
14
+ > benchmarks, not a homemade simulator.
15
+
16
+ Source files:
17
+
18
+ - `qubit_medic/server/openenv_adapter.py`
19
+ - `qubit_medic/server/app.py`
20
+ - `qubit_medic/server/environment.py`
21
+
22
+ ## OpenEnv contract
23
+
24
+ | Method | Path | Request model | Response model |
25
+ |--------|------|---------------|----------------|
26
+ | POST | `/reset` | `openenv.core.types.ResetRequest` | `openenv.core.types.ResetResponse` |
27
+ | POST | `/step` | `openenv.core.types.StepRequest` | `openenv.core.types.StepResponse` |
28
+ | GET | `/state` | (none) | `qubit_medic.server.openenv_adapter.QubitMedicState` |
29
+ | POST | `/state` | (none) | `dict` (mirror of GET; compliance audit 2026-04) |
30
+ | POST | `/close` | (none) | `{"ok": True, "closed": True}` |
31
+ | GET | `/schema` | (none) | JSON Schema for action/observation models |
32
+ | GET | `/metadata` | (none) | `EnvironmentMetadata` |
33
+ | GET | `/health` | (none) | liveness payload |
34
+ | GET | `/healthz` | (none) | versions probe (Stim, PyMatching, openenv, Python) |
35
+ | POST | `/decode` | `{"syndrome": [int], "level": str}` | PyMatching baseline result |
36
+
37
+ The OpenEnv canonical routes (`/reset`, `/step`, `/state`, `/health`,
38
+ `/schema`, `/metadata`, `/mcp`) are wired automatically by
39
+ `create_fastapi_app`. The `/healthz`, `/decode`, `POST /state`,
40
+ `POST /close`, and `/` (HTML landing) routes are mounted on top by
41
+ `qubit_medic/server/app.py`.
42
+
43
+ Server entry point: `python -m qubit_medic.server.app` or
44
+ `uvicorn qubit_medic.server.app:app --host 0.0.0.0 --port 7860`.
45
+
46
+ ## Action dataclass
47
+
48
+ ```python
49
+ class QubitMedicAction(Action):
50
+ """LLM-emitted action: the raw text the model generated."""
51
+
52
+ raw_response: str = Field(
53
+ default="",
54
+ description="Raw LLM completion text. Server parses to x/z error lists.",
55
+ )
56
+ parsed_x_errors: Optional[list[int]] = Field(
57
+ default=None,
58
+ description="Optional pre-parsed X-error qubit ids (LLM-space). "
59
+ "When provided, the server skips text parsing.",
60
+ )
61
+ parsed_z_errors: Optional[list[int]] = Field(
62
+ default=None,
63
+ description="Optional pre-parsed Z-error qubit ids (LLM-space).",
64
+ )
65
+ episode_id: Optional[int] = Field(
66
+ default=None,
67
+ description="Server-assigned episode id from the matching reset(). "
68
+ "If omitted, the most-recent active episode is used.",
69
+ )
70
+ ```
71
+
72
+ Field-level notes:
73
+
74
+ - `raw_response`: the canonical wire format. The server runs
75
+ `qubit_medic.prompts.parse_action(raw_response, num_data_qubits)` to
76
+ recover both error lists. Keeping the wire format as raw text means the
77
+ server retains full control over parsing, and unparseable outputs surface
78
+ cleanly via `format_compliance = 0`.
79
+ - `parsed_x_errors` / `parsed_z_errors`: a trainer-only escape hatch for
80
+ baseline policies and unit tests. When set, the server formats a
81
+ synthetic `<answer>X: ... | Z: ...</answer>` string before parsing β€” the
82
+ same parser path runs either way, so reward semantics are identical.
83
+ - `episode_id`: must match the `episode_id` returned by the matching
84
+ `reset()` call. If `None`, the adapter falls back to the most recent
85
+ active episode (`self._last_episode_id`). Stale or unknown ids raise
86
+ `ValueError` from `DecoderEnvironment.step` (compliance audit 2026-04).
87
+
88
+ ## Observation dataclass
89
+
90
+ ```python
91
+ class QubitMedicObservation(Observation):
92
+ """OpenEnv observation - mirrors DecoderObservation plus done/reward."""
93
+
94
+ model_config = ConfigDict(extra="forbid", validate_assignment=True,
95
+ arbitrary_types_allowed=True)
96
+
97
+ prompt: str = Field(default="", description="Pre-formatted LLM prompt.")
98
+ syndrome_bits: list[int] = Field(default_factory=list,
99
+ description="Detector activations (0/1).")
100
+ distance: int = Field(default=0, description="Code distance for this episode.")
101
+ rounds: int = Field(default=0, description="Number of stabilizer rounds.")
102
+ p: float = Field(default=0.0, description="SI1000 base error rate.")
103
+ curriculum_level: str = Field(default="",
104
+ description="Curriculum level name.")
105
+ episode_id: int = Field(default=0,
106
+ description="Server-assigned episode counter.")
107
+ dem_digest: str = Field(default="",
108
+ description="Short hash of the detector error model.")
109
+ info: dict[str, Any] = Field(default_factory=dict,
110
+ description="Per-step extras (reward "
111
+ "breakdown, ground-truth flip, "
112
+ "PyMatching baseline, etc.).")
113
+ ```
114
+
115
+ Plus the standard inherited OpenEnv fields:
116
+
117
+ - `done: bool` β€” `True` after every `step` (single-step episodes).
118
+ - `reward: Optional[float]` β€” `None` on `reset`, the weighted total in
119
+ `[0, 1]` after `step`.
120
+
121
+ `info` payload (after `step`) carries:
122
+
123
+ | Key | Type | Meaning |
124
+ |-----|------|---------|
125
+ | `rewards` | `dict[str, float]` | Per-component breakdown (`logical_correction`, `syndrome_consistency`, `hamming_overlap`, `format_compliance`, `pymatching_beat`, `total`) |
126
+ | `parsed_action` | `dict` | Deserialised `DecoderAction` (parsed x/z lists, `parse_success`) |
127
+ | `actual_observable_flip` | `int` | Stim ground-truth flip of the logical Z observable |
128
+ | `pymatching_observable_pred` | `int` | PyMatching's predicted observable flip |
129
+ | `pymatching_x_errors` | `list[int]` | PyMatching reference Pauli frame, X axis |
130
+ | `pymatching_z_errors` | `list[int]` | PyMatching reference Pauli frame, Z axis |
131
+ | `elapsed_seconds` | `float` | Wall time between `reset` and `step` |
132
+ | `timed_out` | `bool` | `True` iff `elapsed > EPISODE_TIMEOUT_SECONDS` |
133
+ | `curriculum_stats` | `dict` | Live promotion-tracker counters |
134
+
135
+ ## State dataclass
136
+
137
+ ```python
138
+ class QubitMedicState(State):
139
+ """Externally-visible state. Physics-truth fields stay server-side."""
140
+
141
+ model_config = ConfigDict(extra="allow", validate_assignment=True,
142
+ arbitrary_types_allowed=True)
143
+
144
+ episodes_started: int = 0
145
+ active_episodes: int = 0
146
+ cached_levels: list[str] = Field(default_factory=list)
147
+ curriculum: dict[str, Any] = Field(default_factory=dict)
148
+ last_reward_breakdown: Optional[dict[str, float]] = None
149
+ ```
150
+
151
+ The adapter populates a few inherited base-class fields too: `episode_id`
152
+ (stringified) and `step_count` (which equals `episodes_started`).
153
+
154
+ Crucially, `QubitMedicState` deliberately omits the ground-truth fields
155
+ held by the inner `DecoderState`: `true_x_errors`, `true_z_errors`,
156
+ `actual_observable_flip`, `pymatching_observable_pred`, `circuit_text`,
157
+ `dem_text`. Those are visible only inside the reward functions β€” see
158
+ `docs/REWARD_HACKING.md`.
159
+
160
+ ## Episode lifecycle
161
+
162
+ Single-step episodes (`done=True` after every `step`):
163
+
164
+ ```
165
+ client server
166
+ ------ ------
167
+ POST /reset ────────────► scheduler.sample(level)
168
+ _cache_for(level) (compile Stim circuit
169
+ and PyMatching matrix
170
+ once per level)
171
+ sample_episode(seed) (Stim shot ->
172
+ syndrome bits +
173
+ observable flip)
174
+ build_prompt(...)
175
+ ◄──────────── Observation { prompt,
176
+ syndrome_bits,
177
+ distance, rounds, p,
178
+ curriculum_level,
179
+ episode_id,
180
+ dem_digest,
181
+ done=False,
182
+ reward=None }
183
+
184
+ POST /step (action) ────────────► parse_action(raw_response)
185
+ compute_all_rewards(...)
186
+ scheduler.update(...) (curriculum promotion)
187
+ ◄──────────── Observation { ..., done=True,
188
+ reward=total,
189
+ info={rewards: {...},
190
+ ...} }
191
+ ```
192
+
193
+ Calling `step()` with an unknown `episode_id` raises `ValueError` (turned
194
+ into HTTP 400). Calling `step()` after `EPISODE_TIMEOUT_SECONDS` returns
195
+ all-zero rewards and `info["timed_out"] = True`.
196
+
197
+ ## Reward computation
198
+
199
+ After parsing, the env converts predicted qubit IDs from LLM-space
200
+ (`0..num_data_qubits-1`) into Stim's internal coordinate system via
201
+ `layout.llm_to_stim`, then runs `compute_all_rewards`
202
+ (`qubit_medic/server/rewards.py`). Each of the five rewards is a pure
203
+ function over `(parsed, sample, layout, final_detector_supports)`; the
204
+ combined total is a weighted sum (weights in
205
+ `qubit_medic.config.REWARD_WEIGHTS`, mirrored in `openenv.yaml`) clamped
206
+ to `[0, 1]`. The breakdown is exposed in `info["rewards"]`, the curriculum
207
+ scheduler is updated using only `logical_correction`, and the episode
208
+ bookkeeping is dropped (`self._active.pop(episode_id)`). See
209
+ `docs/REWARD_HACKING.md` for the per-reward semantics.
210
+
211
+ ## Curriculum
212
+
213
+ Source: `openenv.yaml` (`curriculum:` block) plus
214
+ `qubit_medic.server.curriculum.CurriculumScheduler`.
215
+
216
+ | Level | Distance | Rounds | p (SI1000) | Promotion threshold |
217
+ |-------|----------|--------|------------|---------------------|
218
+ | `L1_warmup` | 3 | 1 | 0.0001 | 0.80 |
219
+ | `L2_target` | 3 | 3 | 0.001 | 0.70 |
220
+ | `L3_stretch` | 5 | 5 | 0.001 | 0.30 |
221
+
222
+ The scheduler samples a level on each `reset()`. Promotion thresholds
223
+ gate progression via the running `logical_correction` rate at the current
224
+ level. Levels `L1_warmup` and `L2_target` are pre-warmed at server boot
225
+ (`_get_shared_inner` in the adapter calls `_cache_for` on both);
226
+ `L3_stretch` compiles lazily on first selection.
227
+
228
+ ## Local rollout example
229
+
230
+ ```python
231
+ from qubit_medic.server.openenv_adapter import (
232
+ QubitMedicAction,
233
+ QubitMedicEnvironment,
234
+ )
235
+
236
+ env = QubitMedicEnvironment()
237
+ obs = env.reset(seed=42) # QubitMedicObservation
238
+ print("level:", obs.curriculum_level, "syndrome bits:", len(obs.syndrome_bits))
239
+ print("prompt preview:", obs.prompt[:120], "...")
240
+
241
+ # Pretend the LLM emitted nothing useful: the parser will return empty
242
+ # lists, format_compliance = 0, syndrome_consistency capped at 0.5.
243
+ action = QubitMedicAction(
244
+ raw_response="X_ERRORS=[]\nZ_ERRORS=[]",
245
+ episode_id=obs.episode_id,
246
+ )
247
+ result = env.step(action)
248
+ print("reward:", result.reward, "done:", result.done)
249
+ print("breakdown:", result.info["rewards"])
250
+ print("pymatching reference frame:", result.info["pymatching_x_errors"],
251
+ result.info["pymatching_z_errors"])
252
+ ```
253
+
254
+ For HTTP usage, hit the live server with `curl` against `/reset` then
255
+ `/step` (see the swagger UI at `/docs`), or use any OpenEnv-compatible
256
+ client.