# Environment API Qubit-Medic exposes an OpenEnv-compliant HTTP server built on top of `openenv.core.create_fastapi_app`. The server wraps an internal `DecoderEnvironment` (Stim + PyMatching) through the standard `Action` / `Observation` / `State` Pydantic shapes. > **Simulation substrate.** Surface-code syndromes are generated with > **Stim** ([Gidney 2021](https://arxiv.org/abs/2103.02202), *Quantum* > 5:497), the field-standard Clifford simulator for quantum error > correction. This is the same simulation engine used by AlphaQubit > (Bausch et al., *Nature* 2024) and Willow (Acharya et al., 2024) — > training data is drawn from the same physical model as the published > benchmarks, not a homemade simulator. Source files: - `qubit_medic/server/openenv_adapter.py` - `qubit_medic/server/app.py` - `qubit_medic/server/environment.py` ## OpenEnv contract | Method | Path | Request model | Response model | |--------|------|---------------|----------------| | POST | `/reset` | `openenv.core.types.ResetRequest` | `openenv.core.types.ResetResponse` | | POST | `/step` | `openenv.core.types.StepRequest` | `openenv.core.types.StepResponse` | | GET | `/state` | (none) | `qubit_medic.server.openenv_adapter.QubitMedicState` | | POST | `/state` | (none) | `dict` (mirror of GET; compliance audit 2026-04) | | POST | `/close` | (none) | `{"ok": True, "closed": True}` | | GET | `/schema` | (none) | JSON Schema for action/observation models | | GET | `/metadata` | (none) | `EnvironmentMetadata` | | GET | `/health` | (none) | liveness payload | | GET | `/healthz` | (none) | versions probe (Stim, PyMatching, openenv, Python) | | POST | `/decode` | `{"syndrome": [int], "level": str}` | PyMatching baseline result | The OpenEnv canonical routes (`/reset`, `/step`, `/state`, `/health`, `/schema`, `/metadata`, `/mcp`) are wired automatically by `create_fastapi_app`. The `/healthz`, `/decode`, `POST /state`, `POST /close`, and `/` (HTML landing) routes are mounted on top by `qubit_medic/server/app.py`. Server entry point: `python -m qubit_medic.server.app` or `uvicorn qubit_medic.server.app:app --host 0.0.0.0 --port 7860`. ## Action dataclass ```python class QubitMedicAction(Action): """LLM-emitted action: the raw text the model generated.""" raw_response: str = Field( default="", description="Raw LLM completion text. Server parses to x/z error lists.", ) parsed_x_errors: Optional[list[int]] = Field( default=None, description="Optional pre-parsed X-error qubit ids (LLM-space). " "When provided, the server skips text parsing.", ) parsed_z_errors: Optional[list[int]] = Field( default=None, description="Optional pre-parsed Z-error qubit ids (LLM-space).", ) episode_id: Optional[int] = Field( default=None, description="Server-assigned episode id from the matching reset(). " "If omitted, the most-recent active episode is used.", ) ``` Field-level notes: - `raw_response`: the canonical wire format. The server runs `qubit_medic.prompts.parse_action(raw_response, num_data_qubits)` to recover both error lists. Keeping the wire format as raw text means the server retains full control over parsing, and unparseable outputs surface cleanly via `format_compliance = 0`. - `parsed_x_errors` / `parsed_z_errors`: a trainer-only escape hatch for baseline policies and unit tests. When set, the server formats a synthetic `X: ... | Z: ...` string before parsing — the same parser path runs either way, so reward semantics are identical. - `episode_id`: must match the `episode_id` returned by the matching `reset()` call. If `None`, the adapter falls back to the most recent active episode (`self._last_episode_id`). Stale or unknown ids raise `ValueError` from `DecoderEnvironment.step` (compliance audit 2026-04). ## Observation dataclass ```python class QubitMedicObservation(Observation): """OpenEnv observation - mirrors DecoderObservation plus done/reward.""" model_config = ConfigDict(extra="forbid", validate_assignment=True, arbitrary_types_allowed=True) prompt: str = Field(default="", description="Pre-formatted LLM prompt.") syndrome_bits: list[int] = Field(default_factory=list, description="Detector activations (0/1).") distance: int = Field(default=0, description="Code distance for this episode.") rounds: int = Field(default=0, description="Number of stabilizer rounds.") p: float = Field(default=0.0, description="SI1000 base error rate.") curriculum_level: str = Field(default="", description="Curriculum level name.") episode_id: int = Field(default=0, description="Server-assigned episode counter.") dem_digest: str = Field(default="", description="Short hash of the detector error model.") info: dict[str, Any] = Field(default_factory=dict, description="Per-step extras (reward " "breakdown, ground-truth flip, " "PyMatching baseline, etc.).") ``` Plus the standard inherited OpenEnv fields: - `done: bool` — `True` after every `step` (single-step episodes). - `reward: Optional[float]` — `None` on `reset`, the weighted total in `[0, 1]` after `step`. `info` payload (after `step`) carries: | Key | Type | Meaning | |-----|------|---------| | `rewards` | `dict[str, float]` | Per-component breakdown (`logical_correction`, `syndrome_consistency`, `hamming_overlap`, `format_compliance`, `pymatching_beat`, `total`) | | `parsed_action` | `dict` | Deserialised `DecoderAction` (parsed x/z lists, `parse_success`) | | `actual_observable_flip` | `int` | Stim ground-truth flip of the logical Z observable | | `pymatching_observable_pred` | `int` | PyMatching's predicted observable flip | | `pymatching_x_errors` | `list[int]` | PyMatching reference Pauli frame, X axis | | `pymatching_z_errors` | `list[int]` | PyMatching reference Pauli frame, Z axis | | `elapsed_seconds` | `float` | Wall time between `reset` and `step` | | `timed_out` | `bool` | `True` iff `elapsed > EPISODE_TIMEOUT_SECONDS` | | `curriculum_stats` | `dict` | Live promotion-tracker counters | ## State dataclass ```python class QubitMedicState(State): """Externally-visible state. Physics-truth fields stay server-side.""" model_config = ConfigDict(extra="allow", validate_assignment=True, arbitrary_types_allowed=True) episodes_started: int = 0 active_episodes: int = 0 cached_levels: list[str] = Field(default_factory=list) curriculum: dict[str, Any] = Field(default_factory=dict) last_reward_breakdown: Optional[dict[str, float]] = None ``` The adapter populates a few inherited base-class fields too: `episode_id` (stringified) and `step_count` (which equals `episodes_started`). Crucially, `QubitMedicState` deliberately omits the ground-truth fields held by the inner `DecoderState`: `true_x_errors`, `true_z_errors`, `actual_observable_flip`, `pymatching_observable_pred`, `circuit_text`, `dem_text`. Those are visible only inside the reward functions — see `docs/REWARD_HACKING.md`. ## Episode lifecycle Single-step episodes (`done=True` after every `step`): ``` client server ------ ------ POST /reset ────────────► scheduler.sample(level) _cache_for(level) (compile Stim circuit and PyMatching matrix once per level) sample_episode(seed) (Stim shot -> syndrome bits + observable flip) build_prompt(...) ◄──────────── Observation { prompt, syndrome_bits, distance, rounds, p, curriculum_level, episode_id, dem_digest, done=False, reward=None } POST /step (action) ────────────► parse_action(raw_response) compute_all_rewards(...) scheduler.update(...) (curriculum promotion) ◄──────────── Observation { ..., done=True, reward=total, info={rewards: {...}, ...} } ``` Calling `step()` with an unknown `episode_id` raises `ValueError` (turned into HTTP 400). Calling `step()` after `EPISODE_TIMEOUT_SECONDS` returns all-zero rewards and `info["timed_out"] = True`. ## Reward computation After parsing, the env converts predicted qubit IDs from LLM-space (`0..num_data_qubits-1`) into Stim's internal coordinate system via `layout.llm_to_stim`, then runs `compute_all_rewards` (`qubit_medic/server/rewards.py`). Each of the five rewards is a pure function over `(parsed, sample, layout, final_detector_supports)`; the combined total is a weighted sum (weights in `qubit_medic.config.REWARD_WEIGHTS`, mirrored in `openenv.yaml`) clamped to `[0, 1]`. The breakdown is exposed in `info["rewards"]`, the curriculum scheduler is updated using only `logical_correction`, and the episode bookkeeping is dropped (`self._active.pop(episode_id)`). See `docs/REWARD_HACKING.md` for the per-reward semantics. ## Curriculum Source: `openenv.yaml` (`curriculum:` block) plus `qubit_medic.server.curriculum.CurriculumScheduler`. | Level | Distance | Rounds | p (SI1000) | Promotion threshold | |-------|----------|--------|------------|---------------------| | `L1_warmup` | 3 | 1 | 0.0001 | 0.80 | | `L2_target` | 3 | 3 | 0.001 | 0.70 | | `L3_stretch` | 5 | 5 | 0.001 | 0.30 | The scheduler samples a level on each `reset()`. Promotion thresholds gate progression via the running `logical_correction` rate at the current level. Levels `L1_warmup` and `L2_target` are pre-warmed at server boot (`_get_shared_inner` in the adapter calls `_cache_for` on both); `L3_stretch` compiles lazily on first selection. ## Local rollout example ```python from qubit_medic.server.openenv_adapter import ( QubitMedicAction, QubitMedicEnvironment, ) env = QubitMedicEnvironment() obs = env.reset(seed=42) # QubitMedicObservation print("level:", obs.curriculum_level, "syndrome bits:", len(obs.syndrome_bits)) print("prompt preview:", obs.prompt[:120], "...") # Pretend the LLM emitted nothing useful: the parser will return empty # lists, format_compliance = 0, syndrome_consistency capped at 0.5. action = QubitMedicAction( raw_response="X_ERRORS=[]\nZ_ERRORS=[]", episode_id=obs.episode_id, ) result = env.step(action) print("reward:", result.reward, "done:", result.done) print("breakdown:", result.info["rewards"]) print("pymatching reference frame:", result.info["pymatching_x_errors"], result.info["pymatching_z_errors"]) ``` For HTTP usage, hit the live server with `curl` against `/reset` then `/step` (see the swagger UI at `/docs`), or use any OpenEnv-compatible client.