# Environment API

Qubit-Medic exposes an OpenEnv-compliant HTTP server built on top of
`openenv.core.create_fastapi_app`. The server wraps an internal
`DecoderEnvironment` (Stim + PyMatching) through the standard
`Action` / `Observation` / `State` Pydantic shapes.

> **Simulation substrate.** Surface-code syndromes are generated with
> **Stim** ([Gidney 2021](https://arxiv.org/abs/2103.02202), *Quantum*
> 5:497), the field-standard Clifford simulator for quantum error
> correction. This is the same simulation engine used by AlphaQubit
> (Bausch et al., *Nature* 2024) and Willow (Acharya et al., 2024) —
> training data is drawn from the same physical model as the published
> benchmarks, not a homemade simulator.

Source files:

- `qubit_medic/server/openenv_adapter.py`
- `qubit_medic/server/app.py`
- `qubit_medic/server/environment.py`

## OpenEnv contract

| Method | Path | Request model | Response model |
|--------|------|---------------|----------------|
| POST | `/reset` | `openenv.core.types.ResetRequest` | `openenv.core.types.ResetResponse` |
| POST | `/step` | `openenv.core.types.StepRequest` | `openenv.core.types.StepResponse` |
| GET | `/state` | (none) | `qubit_medic.server.openenv_adapter.QubitMedicState` |
| POST | `/state` | (none) | `dict` (mirror of GET; compliance audit 2026-04) |
| POST | `/close` | (none) | `{"ok": True, "closed": True}` |
| GET | `/schema` | (none) | JSON Schema for action/observation models |
| GET | `/metadata` | (none) | `EnvironmentMetadata` |
| GET | `/health` | (none) | liveness payload |
| GET | `/healthz` | (none) | versions probe (Stim, PyMatching, openenv, Python) |
| POST | `/decode` | `{"syndrome": [int], "level": str}` | PyMatching baseline result |

The OpenEnv canonical routes (`/reset`, `/step`, `/state`, `/health`,
`/schema`, `/metadata`, `/mcp`) are wired automatically by
`create_fastapi_app`. The `/healthz`, `/decode`, `POST /state`,
`POST /close`, and `/` (HTML landing) routes are mounted on top by
`qubit_medic/server/app.py`.

Server entry point: `python -m qubit_medic.server.app` or
`uvicorn qubit_medic.server.app:app --host 0.0.0.0 --port 7860`.

## Action dataclass

```python
class QubitMedicAction(Action):
    """LLM-emitted action: the raw text the model generated."""

    raw_response: str = Field(
        default="",
        description="Raw LLM completion text. Server parses to x/z error lists.",
    )
    parsed_x_errors: Optional[list[int]] = Field(
        default=None,
        description="Optional pre-parsed X-error qubit ids (LLM-space). "
                    "When provided, the server skips text parsing.",
    )
    parsed_z_errors: Optional[list[int]] = Field(
        default=None,
        description="Optional pre-parsed Z-error qubit ids (LLM-space).",
    )
    episode_id: Optional[int] = Field(
        default=None,
        description="Server-assigned episode id from the matching reset(). "
                    "If omitted, the most-recent active episode is used.",
    )
```

Field-level notes:

- `raw_response`: the canonical wire format. The server runs
  `qubit_medic.prompts.parse_action(raw_response, num_data_qubits)` to
  recover both error lists. Keeping the wire format as raw text means the
  server retains full control over parsing, and unparseable outputs surface
  cleanly via `format_compliance = 0`.
- `parsed_x_errors` / `parsed_z_errors`: a trainer-only escape hatch for
  baseline policies and unit tests. When set, the server formats a
  synthetic `<answer>X: ... | Z: ...</answer>` string before parsing — the
  same parser path runs either way, so reward semantics are identical.
- `episode_id`: must match the `episode_id` returned by the matching
  `reset()` call. If `None`, the adapter falls back to the most recent
  active episode (`self._last_episode_id`). Stale or unknown ids raise
  `ValueError` from `DecoderEnvironment.step` (compliance audit 2026-04).

## Observation dataclass

```python
class QubitMedicObservation(Observation):
    """OpenEnv observation - mirrors DecoderObservation plus done/reward."""

    model_config = ConfigDict(extra="forbid", validate_assignment=True,
                              arbitrary_types_allowed=True)

    prompt: str = Field(default="", description="Pre-formatted LLM prompt.")
    syndrome_bits: list[int] = Field(default_factory=list,
                                     description="Detector activations (0/1).")
    distance: int = Field(default=0, description="Code distance for this episode.")
    rounds: int = Field(default=0, description="Number of stabilizer rounds.")
    p: float = Field(default=0.0, description="SI1000 base error rate.")
    curriculum_level: str = Field(default="",
                                  description="Curriculum level name.")
    episode_id: int = Field(default=0,
                            description="Server-assigned episode counter.")
    dem_digest: str = Field(default="",
                            description="Short hash of the detector error model.")
    info: dict[str, Any] = Field(default_factory=dict,
                                 description="Per-step extras (reward "
                                             "breakdown, ground-truth flip, "
                                             "PyMatching baseline, etc.).")
```

Plus the standard inherited OpenEnv fields:

- `done: bool` — `True` after every `step` (single-step episodes).
- `reward: Optional[float]` — `None` on `reset`, the weighted total in
  `[0, 1]` after `step`.

`info` payload (after `step`) carries:

| Key | Type | Meaning |
|-----|------|---------|
| `rewards` | `dict[str, float]` | Per-component breakdown (`logical_correction`, `syndrome_consistency`, `hamming_overlap`, `format_compliance`, `pymatching_beat`, `total`) |
| `parsed_action` | `dict` | Deserialised `DecoderAction` (parsed x/z lists, `parse_success`) |
| `actual_observable_flip` | `int` | Stim ground-truth flip of the logical Z observable |
| `pymatching_observable_pred` | `int` | PyMatching's predicted observable flip |
| `pymatching_x_errors` | `list[int]` | PyMatching reference Pauli frame, X axis |
| `pymatching_z_errors` | `list[int]` | PyMatching reference Pauli frame, Z axis |
| `elapsed_seconds` | `float` | Wall time between `reset` and `step` |
| `timed_out` | `bool` | `True` iff `elapsed > EPISODE_TIMEOUT_SECONDS` |
| `curriculum_stats` | `dict` | Live promotion-tracker counters |

## State dataclass

```python
class QubitMedicState(State):
    """Externally-visible state. Physics-truth fields stay server-side."""

    model_config = ConfigDict(extra="allow", validate_assignment=True,
                              arbitrary_types_allowed=True)

    episodes_started: int = 0
    active_episodes: int = 0
    cached_levels: list[str] = Field(default_factory=list)
    curriculum: dict[str, Any] = Field(default_factory=dict)
    last_reward_breakdown: Optional[dict[str, float]] = None
```

The adapter populates a few inherited base-class fields too: `episode_id`
(stringified) and `step_count` (which equals `episodes_started`).

Crucially, `QubitMedicState` deliberately omits the ground-truth fields
held by the inner `DecoderState`: `true_x_errors`, `true_z_errors`,
`actual_observable_flip`, `pymatching_observable_pred`, `circuit_text`,
`dem_text`. Those are visible only inside the reward functions — see
`docs/REWARD_HACKING.md`.

## Episode lifecycle

Single-step episodes (`done=True` after every `step`):

```
client                                 server
------                                 ------
POST /reset            ────────────►   scheduler.sample(level)
                                       _cache_for(level)            (compile Stim circuit
                                                                     and PyMatching matrix
                                                                     once per level)
                                       sample_episode(seed)         (Stim shot ->
                                                                     syndrome bits +
                                                                     observable flip)
                                       build_prompt(...)
                       ◄────────────   Observation { prompt,
                                                     syndrome_bits,
                                                     distance, rounds, p,
                                                     curriculum_level,
                                                     episode_id,
                                                     dem_digest,
                                                     done=False,
                                                     reward=None }

POST /step (action)    ────────────►   parse_action(raw_response)
                                       compute_all_rewards(...)
                                       scheduler.update(...)        (curriculum promotion)
                       ◄────────────   Observation { ..., done=True,
                                                     reward=total,
                                                     info={rewards: {...},
                                                           ...} }
```

Calling `step()` with an unknown `episode_id` raises `ValueError` (turned
into HTTP 400). Calling `step()` after `EPISODE_TIMEOUT_SECONDS` returns
all-zero rewards and `info["timed_out"] = True`.

## Reward computation

After parsing, the env converts predicted qubit IDs from LLM-space
(`0..num_data_qubits-1`) into Stim's internal coordinate system via
`layout.llm_to_stim`, then runs `compute_all_rewards`
(`qubit_medic/server/rewards.py`). Each of the five rewards is a pure
function over `(parsed, sample, layout, final_detector_supports)`; the
combined total is a weighted sum (weights in
`qubit_medic.config.REWARD_WEIGHTS`, mirrored in `openenv.yaml`) clamped
to `[0, 1]`. The breakdown is exposed in `info["rewards"]`, the curriculum
scheduler is updated using only `logical_correction`, and the episode
bookkeeping is dropped (`self._active.pop(episode_id)`). See
`docs/REWARD_HACKING.md` for the per-reward semantics.

## Curriculum

Source: `openenv.yaml` (`curriculum:` block) plus
`qubit_medic.server.curriculum.CurriculumScheduler`.

| Level | Distance | Rounds | p (SI1000) | Promotion threshold |
|-------|----------|--------|------------|---------------------|
| `L1_warmup` | 3 | 1 | 0.0001 | 0.80 |
| `L2_target` | 3 | 3 | 0.001 | 0.70 |
| `L3_stretch` | 5 | 5 | 0.001 | 0.30 |

The scheduler samples a level on each `reset()`. Promotion thresholds
gate progression via the running `logical_correction` rate at the current
level. Levels `L1_warmup` and `L2_target` are pre-warmed at server boot
(`_get_shared_inner` in the adapter calls `_cache_for` on both);
`L3_stretch` compiles lazily on first selection.

## Local rollout example

```python
from qubit_medic.server.openenv_adapter import (
    QubitMedicAction,
    QubitMedicEnvironment,
)

env = QubitMedicEnvironment()
obs = env.reset(seed=42)                 # QubitMedicObservation
print("level:", obs.curriculum_level, "syndrome bits:", len(obs.syndrome_bits))
print("prompt preview:", obs.prompt[:120], "...")

# Pretend the LLM emitted nothing useful: the parser will return empty
# lists, format_compliance = 0, syndrome_consistency capped at 0.5.
action = QubitMedicAction(
    raw_response="X_ERRORS=[]\nZ_ERRORS=[]",
    episode_id=obs.episode_id,
)
result = env.step(action)
print("reward:", result.reward, "done:", result.done)
print("breakdown:", result.info["rewards"])
print("pymatching reference frame:", result.info["pymatching_x_errors"],
      result.info["pymatching_z_errors"])
```

For HTTP usage, hit the live server with `curl` against `/reset` then
`/step` (see the swagger UI at `/docs`), or use any OpenEnv-compatible
client.