Spaces:
Sleeping
Environment API
QuantumScribe exposes an OpenEnv-compliant HTTP server built on top of
openenv.core.create_fastapi_app. The server wraps an internal
DecoderEnvironment (Stim + PyMatching) through the standard
Action / Observation / State Pydantic shapes.
Simulation substrate. Surface-code syndromes are generated with Stim (Gidney 2021, Quantum 5:497), the field-standard Clifford simulator for quantum error correction. This is the same simulation engine used by AlphaQubit (Bausch et al., Nature 2024) and Willow (Acharya et al., 2024) β training data is drawn from the same physical model as the published benchmarks, not a homemade simulator.
Source files:
qubit_medic/server/openenv_adapter.pyqubit_medic/server/app.pyqubit_medic/server/environment.py
OpenEnv contract
| Method | Path | Request model | Response model |
|---|---|---|---|
| POST | /reset |
openenv.core.types.ResetRequest |
openenv.core.types.ResetResponse |
| POST | /step |
openenv.core.types.StepRequest |
openenv.core.types.StepResponse |
| GET | /state |
(none) | qubit_medic.server.openenv_adapter.QubitMedicState |
| POST | /state |
(none) | dict (mirror of GET; compliance audit 2026-04) |
| POST | /close |
(none) | {"ok": True, "closed": True} |
| GET | /schema |
(none) | JSON Schema for action/observation models |
| GET | /metadata |
(none) | EnvironmentMetadata |
| GET | /health |
(none) | liveness payload |
| GET | /healthz |
(none) | versions probe (Stim, PyMatching, openenv, Python) |
| POST | /decode |
{"syndrome": [int], "level": str} |
PyMatching baseline result |
The OpenEnv canonical routes (/reset, /step, /state, /health,
/schema, /metadata, /mcp) are wired automatically by
create_fastapi_app. The /healthz, /decode, POST /state,
POST /close, and / (HTML landing) routes are mounted on top by
qubit_medic/server/app.py.
Server entry point: python -m qubit_medic.server.app or
uvicorn qubit_medic.server.app:app --host 0.0.0.0 --port 7860.
Action dataclass
class QubitMedicAction(Action):
"""LLM-emitted action: the raw text the model generated."""
raw_response: str = Field(
default="",
description="Raw LLM completion text. Server parses to x/z error lists.",
)
parsed_x_errors: Optional[list[int]] = Field(
default=None,
description="Optional pre-parsed X-error qubit ids (LLM-space). "
"When provided, the server skips text parsing.",
)
parsed_z_errors: Optional[list[int]] = Field(
default=None,
description="Optional pre-parsed Z-error qubit ids (LLM-space).",
)
episode_id: Optional[int] = Field(
default=None,
description="Server-assigned episode id from the matching reset(). "
"If omitted, the most-recent active episode is used.",
)
Field-level notes:
raw_response: the canonical wire format. The server runsqubit_medic.prompts.parse_action(raw_response, num_data_qubits)to recover both error lists. Keeping the wire format as raw text means the server retains full control over parsing, and unparseable outputs surface cleanly viaformat_compliance = 0.parsed_x_errors/parsed_z_errors: a trainer-only escape hatch for baseline policies and unit tests. When set, the server formats a synthetic<answer>X: ... | Z: ...</answer>string before parsing β the same parser path runs either way, so reward semantics are identical.episode_id: must match theepisode_idreturned by the matchingreset()call. IfNone, the adapter falls back to the most recent active episode (self._last_episode_id). Stale or unknown ids raiseValueErrorfromDecoderEnvironment.step(compliance audit 2026-04).
Observation dataclass
class QubitMedicObservation(Observation):
"""OpenEnv observation - mirrors DecoderObservation plus done/reward."""
model_config = ConfigDict(extra="forbid", validate_assignment=True,
arbitrary_types_allowed=True)
prompt: str = Field(default="", description="Pre-formatted LLM prompt.")
syndrome_bits: list[int] = Field(default_factory=list,
description="Detector activations (0/1).")
distance: int = Field(default=0, description="Code distance for this episode.")
rounds: int = Field(default=0, description="Number of stabilizer rounds.")
p: float = Field(default=0.0, description="SI1000 base error rate.")
curriculum_level: str = Field(default="",
description="Curriculum level name.")
episode_id: int = Field(default=0,
description="Server-assigned episode counter.")
dem_digest: str = Field(default="",
description="Short hash of the detector error model.")
info: dict[str, Any] = Field(default_factory=dict,
description="Per-step extras (reward "
"breakdown, ground-truth flip, "
"PyMatching baseline, etc.).")
Plus the standard inherited OpenEnv fields:
done: boolβTrueafter everystep(single-step episodes).reward: Optional[float]βNoneonreset, the weighted total in[0, 1]afterstep.
info payload (after step) carries:
| Key | Type | Meaning |
|---|---|---|
rewards |
dict[str, float] |
Per-component breakdown (logical_correction, syndrome_consistency, hamming_overlap, format_compliance, pymatching_beat, total) |
parsed_action |
dict |
Deserialised DecoderAction (parsed x/z lists, parse_success) |
actual_observable_flip |
int |
Stim ground-truth flip of the logical Z observable |
pymatching_observable_pred |
int |
PyMatching's predicted observable flip |
pymatching_x_errors |
list[int] |
PyMatching reference Pauli frame, X axis |
pymatching_z_errors |
list[int] |
PyMatching reference Pauli frame, Z axis |
elapsed_seconds |
float |
Wall time between reset and step |
timed_out |
bool |
True iff elapsed > EPISODE_TIMEOUT_SECONDS |
curriculum_stats |
dict |
Live promotion-tracker counters |
State dataclass
class QubitMedicState(State):
"""Externally-visible state. Physics-truth fields stay server-side."""
model_config = ConfigDict(extra="allow", validate_assignment=True,
arbitrary_types_allowed=True)
episodes_started: int = 0
active_episodes: int = 0
cached_levels: list[str] = Field(default_factory=list)
curriculum: dict[str, Any] = Field(default_factory=dict)
last_reward_breakdown: Optional[dict[str, float]] = None
The adapter populates a few inherited base-class fields too: episode_id
(stringified) and step_count (which equals episodes_started).
Crucially, QubitMedicState deliberately omits the ground-truth fields
held by the inner DecoderState: true_x_errors, true_z_errors,
actual_observable_flip, pymatching_observable_pred, circuit_text,
dem_text. Those are visible only inside the reward functions β see
docs/REWARD_HACKING.md.
Episode lifecycle
Single-step episodes (done=True after every step):
client server
------ ------
POST /reset βββββββββββββΊ scheduler.sample(level)
_cache_for(level) (compile Stim circuit
and PyMatching matrix
once per level)
sample_episode(seed) (Stim shot ->
syndrome bits +
observable flip)
build_prompt(...)
βββββββββββββ Observation { prompt,
syndrome_bits,
distance, rounds, p,
curriculum_level,
episode_id,
dem_digest,
done=False,
reward=None }
POST /step (action) βββββββββββββΊ parse_action(raw_response)
compute_all_rewards(...)
scheduler.update(...) (curriculum promotion)
βββββββββββββ Observation { ..., done=True,
reward=total,
info={rewards: {...},
...} }
Calling step() with an unknown episode_id raises ValueError (turned
into HTTP 400). Calling step() after EPISODE_TIMEOUT_SECONDS returns
all-zero rewards and info["timed_out"] = True.
Reward computation
After parsing, the env converts predicted qubit IDs from LLM-space
(0..num_data_qubits-1) into Stim's internal coordinate system via
layout.llm_to_stim, then runs compute_all_rewards
(qubit_medic/server/rewards.py). Each of the five rewards is a pure
function over (parsed, sample, layout, final_detector_supports); the
combined total is a weighted sum (weights in
qubit_medic.config.REWARD_WEIGHTS, mirrored in openenv.yaml) clamped
to [0, 1]. The breakdown is exposed in info["rewards"], the curriculum
scheduler is updated using only logical_correction, and the episode
bookkeeping is dropped (self._active.pop(episode_id)). See
docs/REWARD_HACKING.md for the per-reward semantics.
Curriculum
Source: openenv.yaml (curriculum: block) plus
qubit_medic.server.curriculum.CurriculumScheduler.
| Level | Distance | Rounds | p (SI1000) | Promotion threshold |
|---|---|---|---|---|
L1_warmup |
3 | 1 | 0.0001 | 0.80 |
L2_target |
3 | 3 | 0.001 | 0.70 |
L3_stretch |
5 | 5 | 0.001 | 0.30 |
The scheduler samples a level on each reset(). Promotion thresholds
gate progression via the running logical_correction rate at the current
level. Levels L1_warmup and L2_target are pre-warmed at server boot
(_get_shared_inner in the adapter calls _cache_for on both);
L3_stretch compiles lazily on first selection.
Local rollout example
from qubit_medic.server.openenv_adapter import (
QubitMedicAction,
QubitMedicEnvironment,
)
env = QubitMedicEnvironment()
obs = env.reset(seed=42) # QubitMedicObservation
print("level:", obs.curriculum_level, "syndrome bits:", len(obs.syndrome_bits))
print("prompt preview:", obs.prompt[:120], "...")
# Pretend the LLM emitted nothing useful: the parser will return empty
# lists, format_compliance = 0, syndrome_consistency capped at 0.5.
action = QubitMedicAction(
raw_response="X_ERRORS=[]\nZ_ERRORS=[]",
episode_id=obs.episode_id,
)
result = env.step(action)
print("reward:", result.reward, "done:", result.done)
print("breakdown:", result.info["rewards"])
print("pymatching reference frame:", result.info["pymatching_x_errors"],
result.info["pymatching_z_errors"])
For HTTP usage, hit the live server with curl against /reset then
/step (see the swagger UI at /docs), or use any OpenEnv-compatible
client.