QuantumScribe / docs /ENVIRONMENT_API.md
ronitraj's picture
deploy via scripts/deploy_to_space.py
1b1af24 verified

Environment API

QuantumScribe exposes an OpenEnv-compliant HTTP server built on top of openenv.core.create_fastapi_app. The server wraps an internal DecoderEnvironment (Stim + PyMatching) through the standard Action / Observation / State Pydantic shapes.

Simulation substrate. Surface-code syndromes are generated with Stim (Gidney 2021, Quantum 5:497), the field-standard Clifford simulator for quantum error correction. This is the same simulation engine used by AlphaQubit (Bausch et al., Nature 2024) and Willow (Acharya et al., 2024) β€” training data is drawn from the same physical model as the published benchmarks, not a homemade simulator.

Source files:

  • qubit_medic/server/openenv_adapter.py
  • qubit_medic/server/app.py
  • qubit_medic/server/environment.py

OpenEnv contract

Method Path Request model Response model
POST /reset openenv.core.types.ResetRequest openenv.core.types.ResetResponse
POST /step openenv.core.types.StepRequest openenv.core.types.StepResponse
GET /state (none) qubit_medic.server.openenv_adapter.QubitMedicState
POST /state (none) dict (mirror of GET; compliance audit 2026-04)
POST /close (none) {"ok": True, "closed": True}
GET /schema (none) JSON Schema for action/observation models
GET /metadata (none) EnvironmentMetadata
GET /health (none) liveness payload
GET /healthz (none) versions probe (Stim, PyMatching, openenv, Python)
POST /decode {"syndrome": [int], "level": str} PyMatching baseline result

The OpenEnv canonical routes (/reset, /step, /state, /health, /schema, /metadata, /mcp) are wired automatically by create_fastapi_app. The /healthz, /decode, POST /state, POST /close, and / (HTML landing) routes are mounted on top by qubit_medic/server/app.py.

Server entry point: python -m qubit_medic.server.app or uvicorn qubit_medic.server.app:app --host 0.0.0.0 --port 7860.

Action dataclass

class QubitMedicAction(Action):
    """LLM-emitted action: the raw text the model generated."""

    raw_response: str = Field(
        default="",
        description="Raw LLM completion text. Server parses to x/z error lists.",
    )
    parsed_x_errors: Optional[list[int]] = Field(
        default=None,
        description="Optional pre-parsed X-error qubit ids (LLM-space). "
                    "When provided, the server skips text parsing.",
    )
    parsed_z_errors: Optional[list[int]] = Field(
        default=None,
        description="Optional pre-parsed Z-error qubit ids (LLM-space).",
    )
    episode_id: Optional[int] = Field(
        default=None,
        description="Server-assigned episode id from the matching reset(). "
                    "If omitted, the most-recent active episode is used.",
    )

Field-level notes:

  • raw_response: the canonical wire format. The server runs qubit_medic.prompts.parse_action(raw_response, num_data_qubits) to recover both error lists. Keeping the wire format as raw text means the server retains full control over parsing, and unparseable outputs surface cleanly via format_compliance = 0.
  • parsed_x_errors / parsed_z_errors: a trainer-only escape hatch for baseline policies and unit tests. When set, the server formats a synthetic <answer>X: ... | Z: ...</answer> string before parsing β€” the same parser path runs either way, so reward semantics are identical.
  • episode_id: must match the episode_id returned by the matching reset() call. If None, the adapter falls back to the most recent active episode (self._last_episode_id). Stale or unknown ids raise ValueError from DecoderEnvironment.step (compliance audit 2026-04).

Observation dataclass

class QubitMedicObservation(Observation):
    """OpenEnv observation - mirrors DecoderObservation plus done/reward."""

    model_config = ConfigDict(extra="forbid", validate_assignment=True,
                              arbitrary_types_allowed=True)

    prompt: str = Field(default="", description="Pre-formatted LLM prompt.")
    syndrome_bits: list[int] = Field(default_factory=list,
                                     description="Detector activations (0/1).")
    distance: int = Field(default=0, description="Code distance for this episode.")
    rounds: int = Field(default=0, description="Number of stabilizer rounds.")
    p: float = Field(default=0.0, description="SI1000 base error rate.")
    curriculum_level: str = Field(default="",
                                  description="Curriculum level name.")
    episode_id: int = Field(default=0,
                            description="Server-assigned episode counter.")
    dem_digest: str = Field(default="",
                            description="Short hash of the detector error model.")
    info: dict[str, Any] = Field(default_factory=dict,
                                 description="Per-step extras (reward "
                                             "breakdown, ground-truth flip, "
                                             "PyMatching baseline, etc.).")

Plus the standard inherited OpenEnv fields:

  • done: bool β€” True after every step (single-step episodes).
  • reward: Optional[float] β€” None on reset, the weighted total in [0, 1] after step.

info payload (after step) carries:

Key Type Meaning
rewards dict[str, float] Per-component breakdown (logical_correction, syndrome_consistency, hamming_overlap, format_compliance, pymatching_beat, total)
parsed_action dict Deserialised DecoderAction (parsed x/z lists, parse_success)
actual_observable_flip int Stim ground-truth flip of the logical Z observable
pymatching_observable_pred int PyMatching's predicted observable flip
pymatching_x_errors list[int] PyMatching reference Pauli frame, X axis
pymatching_z_errors list[int] PyMatching reference Pauli frame, Z axis
elapsed_seconds float Wall time between reset and step
timed_out bool True iff elapsed > EPISODE_TIMEOUT_SECONDS
curriculum_stats dict Live promotion-tracker counters

State dataclass

class QubitMedicState(State):
    """Externally-visible state. Physics-truth fields stay server-side."""

    model_config = ConfigDict(extra="allow", validate_assignment=True,
                              arbitrary_types_allowed=True)

    episodes_started: int = 0
    active_episodes: int = 0
    cached_levels: list[str] = Field(default_factory=list)
    curriculum: dict[str, Any] = Field(default_factory=dict)
    last_reward_breakdown: Optional[dict[str, float]] = None

The adapter populates a few inherited base-class fields too: episode_id (stringified) and step_count (which equals episodes_started).

Crucially, QubitMedicState deliberately omits the ground-truth fields held by the inner DecoderState: true_x_errors, true_z_errors, actual_observable_flip, pymatching_observable_pred, circuit_text, dem_text. Those are visible only inside the reward functions β€” see docs/REWARD_HACKING.md.

Episode lifecycle

Single-step episodes (done=True after every step):

client                                 server
------                                 ------
POST /reset            ────────────►   scheduler.sample(level)
                                       _cache_for(level)            (compile Stim circuit
                                                                     and PyMatching matrix
                                                                     once per level)
                                       sample_episode(seed)         (Stim shot ->
                                                                     syndrome bits +
                                                                     observable flip)
                                       build_prompt(...)
                       ◄────────────   Observation { prompt,
                                                     syndrome_bits,
                                                     distance, rounds, p,
                                                     curriculum_level,
                                                     episode_id,
                                                     dem_digest,
                                                     done=False,
                                                     reward=None }

POST /step (action)    ────────────►   parse_action(raw_response)
                                       compute_all_rewards(...)
                                       scheduler.update(...)        (curriculum promotion)
                       ◄────────────   Observation { ..., done=True,
                                                     reward=total,
                                                     info={rewards: {...},
                                                           ...} }

Calling step() with an unknown episode_id raises ValueError (turned into HTTP 400). Calling step() after EPISODE_TIMEOUT_SECONDS returns all-zero rewards and info["timed_out"] = True.

Reward computation

After parsing, the env converts predicted qubit IDs from LLM-space (0..num_data_qubits-1) into Stim's internal coordinate system via layout.llm_to_stim, then runs compute_all_rewards (qubit_medic/server/rewards.py). Each of the five rewards is a pure function over (parsed, sample, layout, final_detector_supports); the combined total is a weighted sum (weights in qubit_medic.config.REWARD_WEIGHTS, mirrored in openenv.yaml) clamped to [0, 1]. The breakdown is exposed in info["rewards"], the curriculum scheduler is updated using only logical_correction, and the episode bookkeeping is dropped (self._active.pop(episode_id)). See docs/REWARD_HACKING.md for the per-reward semantics.

Curriculum

Source: openenv.yaml (curriculum: block) plus qubit_medic.server.curriculum.CurriculumScheduler.

Level Distance Rounds p (SI1000) Promotion threshold
L1_warmup 3 1 0.0001 0.80
L2_target 3 3 0.001 0.70
L3_stretch 5 5 0.001 0.30

The scheduler samples a level on each reset(). Promotion thresholds gate progression via the running logical_correction rate at the current level. Levels L1_warmup and L2_target are pre-warmed at server boot (_get_shared_inner in the adapter calls _cache_for on both); L3_stretch compiles lazily on first selection.

Local rollout example

from qubit_medic.server.openenv_adapter import (
    QubitMedicAction,
    QubitMedicEnvironment,
)

env = QubitMedicEnvironment()
obs = env.reset(seed=42)                 # QubitMedicObservation
print("level:", obs.curriculum_level, "syndrome bits:", len(obs.syndrome_bits))
print("prompt preview:", obs.prompt[:120], "...")

# Pretend the LLM emitted nothing useful: the parser will return empty
# lists, format_compliance = 0, syndrome_consistency capped at 0.5.
action = QubitMedicAction(
    raw_response="X_ERRORS=[]\nZ_ERRORS=[]",
    episode_id=obs.episode_id,
)
result = env.step(action)
print("reward:", result.reward, "done:", result.done)
print("breakdown:", result.info["rewards"])
print("pymatching reference frame:", result.info["pymatching_x_errors"],
      result.info["pymatching_z_errors"])

For HTTP usage, hit the live server with curl against /reset then /step (see the swagger UI at /docs), or use any OpenEnv-compatible client.