File size: 11,982 Bytes
0139454
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
# Environment API

Qubit-Medic exposes an OpenEnv-compliant HTTP server built on top of
`openenv.core.create_fastapi_app`. The server wraps an internal
`DecoderEnvironment` (Stim + PyMatching) through the standard
`Action` / `Observation` / `State` Pydantic shapes.

> **Simulation substrate.** Surface-code syndromes are generated with
> **Stim** ([Gidney 2021](https://arxiv.org/abs/2103.02202), *Quantum*
> 5:497), the field-standard Clifford simulator for quantum error
> correction. This is the same simulation engine used by AlphaQubit
> (Bausch et al., *Nature* 2024) and Willow (Acharya et al., 2024) β€”
> training data is drawn from the same physical model as the published
> benchmarks, not a homemade simulator.

Source files:

- `qubit_medic/server/openenv_adapter.py`
- `qubit_medic/server/app.py`
- `qubit_medic/server/environment.py`

## OpenEnv contract

| Method | Path | Request model | Response model |
|--------|------|---------------|----------------|
| POST | `/reset` | `openenv.core.types.ResetRequest` | `openenv.core.types.ResetResponse` |
| POST | `/step` | `openenv.core.types.StepRequest` | `openenv.core.types.StepResponse` |
| GET | `/state` | (none) | `qubit_medic.server.openenv_adapter.QubitMedicState` |
| POST | `/state` | (none) | `dict` (mirror of GET; compliance audit 2026-04) |
| POST | `/close` | (none) | `{"ok": True, "closed": True}` |
| GET | `/schema` | (none) | JSON Schema for action/observation models |
| GET | `/metadata` | (none) | `EnvironmentMetadata` |
| GET | `/health` | (none) | liveness payload |
| GET | `/healthz` | (none) | versions probe (Stim, PyMatching, openenv, Python) |
| POST | `/decode` | `{"syndrome": [int], "level": str}` | PyMatching baseline result |

The OpenEnv canonical routes (`/reset`, `/step`, `/state`, `/health`,
`/schema`, `/metadata`, `/mcp`) are wired automatically by
`create_fastapi_app`. The `/healthz`, `/decode`, `POST /state`,
`POST /close`, and `/` (HTML landing) routes are mounted on top by
`qubit_medic/server/app.py`.

Server entry point: `python -m qubit_medic.server.app` or
`uvicorn qubit_medic.server.app:app --host 0.0.0.0 --port 7860`.

## Action dataclass

```python
class QubitMedicAction(Action):
    """LLM-emitted action: the raw text the model generated."""

    raw_response: str = Field(
        default="",
        description="Raw LLM completion text. Server parses to x/z error lists.",
    )
    parsed_x_errors: Optional[list[int]] = Field(
        default=None,
        description="Optional pre-parsed X-error qubit ids (LLM-space). "
                    "When provided, the server skips text parsing.",
    )
    parsed_z_errors: Optional[list[int]] = Field(
        default=None,
        description="Optional pre-parsed Z-error qubit ids (LLM-space).",
    )
    episode_id: Optional[int] = Field(
        default=None,
        description="Server-assigned episode id from the matching reset(). "
                    "If omitted, the most-recent active episode is used.",
    )
```

Field-level notes:

- `raw_response`: the canonical wire format. The server runs
  `qubit_medic.prompts.parse_action(raw_response, num_data_qubits)` to
  recover both error lists. Keeping the wire format as raw text means the
  server retains full control over parsing, and unparseable outputs surface
  cleanly via `format_compliance = 0`.
- `parsed_x_errors` / `parsed_z_errors`: a trainer-only escape hatch for
  baseline policies and unit tests. When set, the server formats a
  synthetic `<answer>X: ... | Z: ...</answer>` string before parsing β€” the
  same parser path runs either way, so reward semantics are identical.
- `episode_id`: must match the `episode_id` returned by the matching
  `reset()` call. If `None`, the adapter falls back to the most recent
  active episode (`self._last_episode_id`). Stale or unknown ids raise
  `ValueError` from `DecoderEnvironment.step` (compliance audit 2026-04).

## Observation dataclass

```python
class QubitMedicObservation(Observation):
    """OpenEnv observation - mirrors DecoderObservation plus done/reward."""

    model_config = ConfigDict(extra="forbid", validate_assignment=True,
                              arbitrary_types_allowed=True)

    prompt: str = Field(default="", description="Pre-formatted LLM prompt.")
    syndrome_bits: list[int] = Field(default_factory=list,
                                     description="Detector activations (0/1).")
    distance: int = Field(default=0, description="Code distance for this episode.")
    rounds: int = Field(default=0, description="Number of stabilizer rounds.")
    p: float = Field(default=0.0, description="SI1000 base error rate.")
    curriculum_level: str = Field(default="",
                                  description="Curriculum level name.")
    episode_id: int = Field(default=0,
                            description="Server-assigned episode counter.")
    dem_digest: str = Field(default="",
                            description="Short hash of the detector error model.")
    info: dict[str, Any] = Field(default_factory=dict,
                                 description="Per-step extras (reward "
                                             "breakdown, ground-truth flip, "
                                             "PyMatching baseline, etc.).")
```

Plus the standard inherited OpenEnv fields:

- `done: bool` β€” `True` after every `step` (single-step episodes).
- `reward: Optional[float]` β€” `None` on `reset`, the weighted total in
  `[0, 1]` after `step`.

`info` payload (after `step`) carries:

| Key | Type | Meaning |
|-----|------|---------|
| `rewards` | `dict[str, float]` | Per-component breakdown (`logical_correction`, `syndrome_consistency`, `hamming_overlap`, `format_compliance`, `pymatching_beat`, `total`) |
| `parsed_action` | `dict` | Deserialised `DecoderAction` (parsed x/z lists, `parse_success`) |
| `actual_observable_flip` | `int` | Stim ground-truth flip of the logical Z observable |
| `pymatching_observable_pred` | `int` | PyMatching's predicted observable flip |
| `pymatching_x_errors` | `list[int]` | PyMatching reference Pauli frame, X axis |
| `pymatching_z_errors` | `list[int]` | PyMatching reference Pauli frame, Z axis |
| `elapsed_seconds` | `float` | Wall time between `reset` and `step` |
| `timed_out` | `bool` | `True` iff `elapsed > EPISODE_TIMEOUT_SECONDS` |
| `curriculum_stats` | `dict` | Live promotion-tracker counters |

## State dataclass

```python
class QubitMedicState(State):
    """Externally-visible state. Physics-truth fields stay server-side."""

    model_config = ConfigDict(extra="allow", validate_assignment=True,
                              arbitrary_types_allowed=True)

    episodes_started: int = 0
    active_episodes: int = 0
    cached_levels: list[str] = Field(default_factory=list)
    curriculum: dict[str, Any] = Field(default_factory=dict)
    last_reward_breakdown: Optional[dict[str, float]] = None
```

The adapter populates a few inherited base-class fields too: `episode_id`
(stringified) and `step_count` (which equals `episodes_started`).

Crucially, `QubitMedicState` deliberately omits the ground-truth fields
held by the inner `DecoderState`: `true_x_errors`, `true_z_errors`,
`actual_observable_flip`, `pymatching_observable_pred`, `circuit_text`,
`dem_text`. Those are visible only inside the reward functions β€” see
`docs/REWARD_HACKING.md`.

## Episode lifecycle

Single-step episodes (`done=True` after every `step`):

```
client                                 server
------                                 ------
POST /reset            ────────────►   scheduler.sample(level)
                                       _cache_for(level)            (compile Stim circuit
                                                                     and PyMatching matrix
                                                                     once per level)
                                       sample_episode(seed)         (Stim shot ->
                                                                     syndrome bits +
                                                                     observable flip)
                                       build_prompt(...)
                       ◄────────────   Observation { prompt,
                                                     syndrome_bits,
                                                     distance, rounds, p,
                                                     curriculum_level,
                                                     episode_id,
                                                     dem_digest,
                                                     done=False,
                                                     reward=None }

POST /step (action)    ────────────►   parse_action(raw_response)
                                       compute_all_rewards(...)
                                       scheduler.update(...)        (curriculum promotion)
                       ◄────────────   Observation { ..., done=True,
                                                     reward=total,
                                                     info={rewards: {...},
                                                           ...} }
```

Calling `step()` with an unknown `episode_id` raises `ValueError` (turned
into HTTP 400). Calling `step()` after `EPISODE_TIMEOUT_SECONDS` returns
all-zero rewards and `info["timed_out"] = True`.

## Reward computation

After parsing, the env converts predicted qubit IDs from LLM-space
(`0..num_data_qubits-1`) into Stim's internal coordinate system via
`layout.llm_to_stim`, then runs `compute_all_rewards`
(`qubit_medic/server/rewards.py`). Each of the five rewards is a pure
function over `(parsed, sample, layout, final_detector_supports)`; the
combined total is a weighted sum (weights in
`qubit_medic.config.REWARD_WEIGHTS`, mirrored in `openenv.yaml`) clamped
to `[0, 1]`. The breakdown is exposed in `info["rewards"]`, the curriculum
scheduler is updated using only `logical_correction`, and the episode
bookkeeping is dropped (`self._active.pop(episode_id)`). See
`docs/REWARD_HACKING.md` for the per-reward semantics.

## Curriculum

Source: `openenv.yaml` (`curriculum:` block) plus
`qubit_medic.server.curriculum.CurriculumScheduler`.

| Level | Distance | Rounds | p (SI1000) | Promotion threshold |
|-------|----------|--------|------------|---------------------|
| `L1_warmup` | 3 | 1 | 0.0001 | 0.80 |
| `L2_target` | 3 | 3 | 0.001 | 0.70 |
| `L3_stretch` | 5 | 5 | 0.001 | 0.30 |

The scheduler samples a level on each `reset()`. Promotion thresholds
gate progression via the running `logical_correction` rate at the current
level. Levels `L1_warmup` and `L2_target` are pre-warmed at server boot
(`_get_shared_inner` in the adapter calls `_cache_for` on both);
`L3_stretch` compiles lazily on first selection.

## Local rollout example

```python
from qubit_medic.server.openenv_adapter import (
    QubitMedicAction,
    QubitMedicEnvironment,
)

env = QubitMedicEnvironment()
obs = env.reset(seed=42)                 # QubitMedicObservation
print("level:", obs.curriculum_level, "syndrome bits:", len(obs.syndrome_bits))
print("prompt preview:", obs.prompt[:120], "...")

# Pretend the LLM emitted nothing useful: the parser will return empty
# lists, format_compliance = 0, syndrome_consistency capped at 0.5.
action = QubitMedicAction(
    raw_response="X_ERRORS=[]\nZ_ERRORS=[]",
    episode_id=obs.episode_id,
)
result = env.step(action)
print("reward:", result.reward, "done:", result.done)
print("breakdown:", result.info["rewards"])
print("pymatching reference frame:", result.info["pymatching_x_errors"],
      result.info["pymatching_z_errors"])
```

For HTTP usage, hit the live server with `curl` against `/reset` then
`/step` (see the swagger UI at `/docs`), or use any OpenEnv-compatible
client.