Spaces:

anugrah55
/

opensleuth-env-gemini-cli

Paused

App Files Files Community

anugrah55 commited on 12 days ago

Commit

31715b5

verified ·

1 Parent(s): e7fc062

OpenEnv 0.2.3 conformance: mount /openenv sub-app, add adapter + tests + example client

Browse files

Files changed (9) hide show

README.md +57 -0
example_client.py +80 -0
openenv.yaml +20 -4
opensleuth_env/openenv_adapter.py +267 -0
requirements.txt +10 -3
server.py +58 -2
tests/__init__.py +0 -0
tests/test_env.py +334 -0
tests/test_openenv_conformance.py +257 -0

README.md CHANGED Viewed

@@ -111,6 +111,55 @@ additive. `/functions` returns the same shape as before (with one *additive*
 `source` field). Open-ended/Hub tasks are exposed via the new `/tasks`
 endpoint so older clients aren't surprised.
 ## Hardware
 CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space.
@@ -120,4 +169,12 @@ CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space.
 ```bash
 pip install -r requirements.txt
 uvicorn server:app --port 7860 --reload
 ```

 `source` field). Open-ended/Hub tasks are exposed via the new `/tasks`
 endpoint so older clients aren't surprised.
+## OpenEnv conformance
+This Space targets the [meta-pytorch / OpenEnv](https://github.com/meta-pytorch/OpenEnv)
+v0.2.3 spec (`pip install openenv-core==0.2.3`). The OpenEnv-conformant
+surface is mounted at **`/openenv/*`** alongside (not on top of) the legacy
+endpoints listed above so the in-flight trainer keeps working unchanged.
+| OpenEnv route            | Path                  | Notes                                                    |
+|--------------------------|-----------------------|----------------------------------------------------------|
+| `GET  /health`           | `/openenv/health`     | `{"status": "healthy"}`                                  |
+| `GET  /metadata`         | `/openenv/metadata`   | `EnvironmentMetadata` (name, description, version, ...)  |
+| `GET  /schema`           | `/openenv/schema`     | JSON schemas for `action`, `observation`, `state`        |
+| `GET  /state`            | `/openenv/state`      | Episode `State` (episode_id, step_count, ...)            |
+| `POST /reset`            | `/openenv/reset`      | Returns `{"observation", "reward", "done"}` envelope     |
+| `POST /step`             | `/openenv/step`       | Body: `{"action": {"action_type": "probe"|"submit", ...}}` |
+| `WS   /ws`               | `/openenv/ws`         | Persistent session: `reset` → `step`* → `state` → `close` |
+`OpenSleuthEnvironment` (in `opensleuth_env/openenv_adapter.py`) subclasses
+`openenv.core.env_server.interfaces.Environment`, so any OpenEnv-aware
+harness (`openenv` CLI, `GenericEnvClient`, TRL/torchforge integrations,
+LightningAI Studio, ...) can pick it up via standard introspection.
+### Talking to it as an OpenEnv client
+```python
+import asyncio
+from openenv import GenericEnvClient, GenericAction
+async def main():
+    base = "https://anugrah55-opensleuth-env-gemini-cli.hf.space/openenv"
+    async with GenericEnvClient(base_url=base) as env:
+        result = await env.reset(target_name="fibonacci", max_steps=8)
+        result = await env.step(GenericAction(action_type="probe", input_repr="10"))
+        print(result.observation["probe_history"][-1])
+asyncio.run(main())
+```
+A runnable end-to-end example lives in [`example_client.py`](example_client.py).
+### What is *not* yet conformant
+* No MCP tool surface (RFC 003). Our actions are typed Pydantic models, not
+  MCP tools, because the underlying probe/submit semantics map cleanly to a
+  single `OpenSleuthAction` discriminator. Adding MCP would be additive.
+* No Rubric/EvalHarness integration (RFC 004) — reward shaping lives in
+  `opensleuth_env/env.py` and is intentionally not split into a separate
+  rubric for now.
 ## Hardware
 CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space.
 ```bash
 pip install -r requirements.txt
 uvicorn server:app --port 7860 --reload
+# legacy contract:                http://localhost:7860/{health,reset,step,state/{eid}}
+# OpenEnv-conformant surface:     http://localhost:7860/openenv/{health,reset,step,state,schema,metadata,ws}
+```
+To run only the OpenEnv conformance tests:
+```bash
+PYTHONPATH=. python -m pytest tests/test_openenv_conformance.py -v
 ```

example_client.py ADDED Viewed

	@@ -0,0 +1,80 @@

+"""Example: talk to the OpenSleuth env via the upstream OpenEnv client.
+This script connects to the deployed Space using the canonical OpenEnv
+``GenericEnvClient`` (HTTP+WebSocket) and runs one episode end-to-end.
+Usage::
+    pip install openenv-core==0.2.3
+    python example_client.py                 # hits the deployed Space
+    python example_client.py http://localhost:7860  # against a local server
+We hit the ``/openenv`` sub-app rather than the legacy bare routes, because
+the OpenEnv client requires an OpenEnv-conformant ``/ws`` WebSocket. The
+legacy ``/reset`` and ``/step`` endpoints used by the in-flight trainer are
+preserved unchanged at the root.
+"""
+from __future__ import annotations
+import asyncio
+import sys
+DEFAULT_BASE = (
+    "https://anugrah55-opensleuth-env-gemini-cli.hf.space/openenv"
+)
+async def main(base_url: str) -> None:
+    from openenv import GenericEnvClient, GenericAction
+    print(f"Connecting to {base_url} ...")
+    async with GenericEnvClient(base_url=base_url) as env:
+        # Reset with the default ('fibonacci') target. Pass any of the legacy
+        # OpenSleuth reset kwargs as extra fields; OpenEnv ResetRequest has
+        # extra='allow', so target_name / target_code / max_steps / etc. all
+        # flow through.
+        result = await env.reset(target_name="fibonacci", max_steps=8, seed=42)
+        obs = result.observation
+        print("\n[reset]")
+        print(f"  episode_id = {obs['episode_id']}")
+        print(f"  target = {obs['target_function_name']} ({obs['difficulty']})")
+        # Probe a few inputs.
+        for repr_input in ("1", "5", "10", "-1", "'oops'"):
+            result = await env.step(
+                GenericAction(action_type="probe", input_repr=repr_input)
+            )
+            last = result.observation["probe_history"][-1]
+            print(
+                f"[probe {repr_input!s:>8}] -> output={last['output_repr']!r:>30} "
+                f"reward={result.reward:+.2f} done={result.done}"
+            )
+        # Submit a perfect implementation.
+        code = (
+            "def fibonacci(n):\n"
+            "    if not isinstance(n, int) or isinstance(n, bool) or n <= 0 or n > 90:\n"
+            "        raise ValueError('bad')\n"
+            "    a, b = 0, 1\n"
+            "    for _ in range(n - 1):\n"
+            "        a, b = b, a + b\n"
+            "    return b\n"
+        )
+        result = await env.step(GenericAction(action_type="submit", code=code))
+        info = result.observation.get("info", {})
+        print("\n[submit reference impl]")
+        print(f"  reward = {result.reward:.2f}")
+        print(f"  done = {result.done}")
+        print(f"  info = {info}")
+        # State endpoint sanity check.
+        state = await env.state()
+        print(f"\n[state] {state}")
+if __name__ == "__main__":
+    base = sys.argv[1] if len(sys.argv) > 1 else DEFAULT_BASE
+    if not base.rstrip("/").endswith("/openenv"):
+        base = base.rstrip("/") + "/openenv"
+    asyncio.run(main(base))

openenv.yaml CHANGED Viewed

@@ -1,5 +1,21 @@
 name: opensleuth
-version: 0.1.0
-description: An OpenEnv environment for training LLMs to reverse-engineer black-box functions.
-author: Gemini
-contact: gemini@google.com

+spec_version: 1
 name: opensleuth
+version: "0.5.0"
+description: >-
+  OpenSleuth: an OpenEnv-conformant environment that trains LLMs to
+  reverse-engineer hidden Python functions by probing them and submitting code
+  that reproduces them. Used for GRPO RL post-training.
+author: anugrah55
+type: space
+runtime: fastapi
+app: server:app
+port: 7860
+action: OpenSleuthAction
+observation: OpenSleuthObservation
+documentation_url: https://huggingface.co/spaces/anugrah55/opensleuth-env-gemini-cli
+tags:
+  - rl
+  - grpo
+  - code
+  - openenv
+  - openenv-conformant

opensleuth_env/openenv_adapter.py ADDED Viewed

	@@ -0,0 +1,267 @@

+"""OpenEnv-conformant adapter for OpenSleuthEnv.
+Wraps the existing multi-episode :class:`OpenSleuthEnv` registry as a
+single-episode-per-session :class:`openenv.core.env_server.interfaces.Environment`
+so the canonical OpenEnv HTTP / WebSocket protocol can be served alongside
+the legacy ``/reset`` + ``/step`` endpoints the in-flight trainer uses.
+This module is *additive*. It does not touch the legacy server contract;
+``server.py`` mounts the OpenEnv-style sub-application at ``/openenv/*`` so the
+trainer (which talks to the bare ``/reset`` and ``/step``) is unaffected.
+The adapter conforms to OpenEnv 0.2.x:
+* ``Environment.reset(seed, episode_id, **kwargs) -> Observation``
+* ``Environment.step(action, timeout_s, **kwargs) -> Observation``
+* ``Environment.state -> State``
+* ``Environment.get_metadata() -> EnvironmentMetadata``
+See https://github.com/meta-pytorch/OpenEnv (v0.2.3, BSD-3) for the spec.
+"""
+from __future__ import annotations
+from typing import Any, List, Literal, Optional
+from uuid import uuid4
+from pydantic import Field
+try:
+    from openenv.core.env_server.interfaces import Environment
+    from openenv.core.env_server.types import (
+        Action as OEAction,
+        EnvironmentMetadata,
+        Observation as OEObservation,
+        State as OEState,
+    )
+    OPENENV_AVAILABLE = True
+except ImportError:  # pragma: no cover - openenv is required at runtime in the Space
+    OPENENV_AVAILABLE = False
+    OEAction = object  # type: ignore[assignment, misc]
+    OEObservation = object  # type: ignore[assignment, misc]
+    OEState = object  # type: ignore[assignment, misc]
+    Environment = object  # type: ignore[assignment, misc]
+    EnvironmentMetadata = object  # type: ignore[assignment, misc]
+from .env import OpenSleuthEnv
+from .models import ProbeAction, SubmitAction
+if OPENENV_AVAILABLE:
+    class OpenSleuthAction(OEAction):
+        """Unified OpenEnv-style action.
+        The OpenEnv spec wants a single concrete Action subclass per
+        environment; we encode the probe / submit choice via the
+        ``action_type`` discriminator field. Internally we still translate
+        to the original :class:`ProbeAction` / :class:`SubmitAction` so the
+        legacy reward shaping is preserved bit-for-bit.
+        """
+        action_type: Literal["probe", "submit"] = Field(
+            ..., description="Either 'probe' (with input_repr) or 'submit' (with code)."
+        )
+        input_repr: Optional[str] = Field(
+            default=None,
+            description="Python literal repr of the probe input. Required when action_type='probe'.",
+        )
+        code: Optional[str] = Field(
+            default=None,
+            description="Python source defining the target function. Required when action_type='submit'.",
+        )
+    class OpenSleuthObservation(OEObservation):
+        """OpenEnv observation wrapper.
+        OpenEnv's ``Observation`` base class supplies ``done``, ``reward``,
+        and ``metadata``. We add OpenSleuth-specific fields for the agent
+        (target signature, probe history, etc.). Trainer-facing structured
+        info is also surfaced via ``info`` for backwards compat.
+        """
+        episode_id: str = Field(default="", description="Per-session episode id.")
+        target_function_name: str = Field(default="")
+        target_function_signature: str = Field(
+            default="", description="Public signature + docstring for the target."
+        )
+        probe_history: List[dict] = Field(
+            default_factory=list,
+            description="Recent probe records (input_repr, output_repr, is_error, ...).",
+        )
+        last_error: str = Field(default="", description="Last error string, if any.")
+        steps_taken: int = Field(default=0)
+        max_steps: int = Field(default=25)
+        difficulty: Optional[str] = Field(default=None)
+        coverage_buckets_seen: int = Field(default=0)
+        seen_outputs_count: int = Field(default=0)
+        seen_error_types_count: int = Field(default=0)
+        info: dict = Field(
+            default_factory=dict,
+            description="Structured info from the underlying step (matches the legacy info dict).",
+        )
+    class OpenSleuthState(OEState):
+        """OpenEnv-style episode state."""
+        target_function_name: Optional[str] = Field(default=None)
+        max_steps: int = Field(default=25)
+        finished: bool = Field(default=False)
+    class OpenSleuthEnvironment(Environment):
+        """OpenEnv-conformant adapter around :class:`OpenSleuthEnv`.
+        One adapter instance == one episode (one WebSocket session). Inside,
+        we keep a single :class:`OpenSleuthEnv` registry but only ever populate
+        a single episode at a time.
+        ``SUPPORTS_CONCURRENT_SESSIONS = True`` is safe because each WebSocket
+        connection in OpenEnv's :class:`HTTPEnvServer` instantiates its own
+        :class:`OpenSleuthEnvironment`, and our underlying registries are
+        per-instance.
+        """
+        SUPPORTS_CONCURRENT_SESSIONS = True
+        def __init__(self) -> None:
+            super().__init__()
+            self._env = OpenSleuthEnv()
+            self._episode_id: Optional[str] = None
+            self._target_function_name: Optional[str] = None
+            self._max_steps: int = 25
+            self._step_count: int = 0
+            self._done: bool = False
+        def reset(  # type: ignore[override]
+            self,
+            seed: Optional[int] = None,
+            episode_id: Optional[str] = None,
+            target_name: Optional[str] = None,
+            target_code: Optional[str] = None,
+            target_function_name: Optional[str] = None,
+            max_steps: int = 25,
+            edge_cases: Optional[list] = None,
+            fuzz_spec: Optional[dict] = None,
+            **kwargs: Any,
+        ) -> "OpenSleuthObservation":
+            # Default to a builtin so a bare reset() still produces a valid
+            # episode (per OpenEnv spec, reset() with no args must work).
+            if not target_name and not target_code:
+                target_name = "fibonacci"
+            obs = self._env.reset(
+                target_name=target_name,
+                seed=seed if seed is not None else 0,
+                max_steps=max_steps,
+                target_code=target_code,
+                target_function_name=target_function_name,
+                edge_cases=edge_cases,
+                fuzz_spec=fuzz_spec,
+            )
+            self._episode_id = episode_id or obs.episode_id
+            self._target_function_name = obs.target_function_name
+            self._max_steps = max_steps
+            self._step_count = 0
+            self._done = False
+            return self._wrap_obs(obs, reward=None, done=False, info={})
+        def step(  # type: ignore[override]
+            self,
+            action: "OpenSleuthAction",
+            timeout_s: Optional[float] = None,
+            **kwargs: Any,
+        ) -> "OpenSleuthObservation":
+            if self._episode_id is None:
+                # Auto-reset on first step with the default target so HTTP /step
+                # smoke tests don't 500 just because /reset wasn't called first.
+                self.reset()
+            internal_action: Any
+            if action.action_type == "probe":
+                if action.input_repr is None:
+                    raise ValueError(
+                        "OpenSleuthAction(action_type='probe') requires input_repr."
+                    )
+                internal_action = ProbeAction(input_repr=action.input_repr)
+            elif action.action_type == "submit":
+                if action.code is None:
+                    raise ValueError(
+                        "OpenSleuthAction(action_type='submit') requires code."
+                    )
+                internal_action = SubmitAction(code=action.code)
+            else:  # pragma: no cover - Pydantic Literal already constrains this
+                raise ValueError(f"Unknown action_type: {action.action_type!r}")
+            assert self._episode_id is not None
+            resp = self._env.step(self._episode_id, internal_action)
+            self._step_count += 1
+            self._done = resp.done
+            return self._wrap_obs(
+                resp.observation, reward=resp.reward, done=resp.done, info=resp.info
+            )
+        @property
+        def state(self) -> "OpenSleuthState":  # type: ignore[override]
+            return OpenSleuthState(
+                episode_id=self._episode_id,
+                step_count=self._step_count,
+                target_function_name=self._target_function_name,
+                max_steps=self._max_steps,
+                finished=self._done,
+            )
+        def get_metadata(self) -> "EnvironmentMetadata":  # type: ignore[override]
+            return EnvironmentMetadata(
+                name="OpenSleuth",
+                description=(
+                    "Algorithmic detective: probe a hidden Python function then submit "
+                    "code that reproduces it. Used for GRPO RL training on Qwen-2.5."
+                ),
+                version="0.4.1",
+                author="OpenSleuth team",
+                documentation_url=(
+                    "https://huggingface.co/spaces/anugrah55/opensleuth-env-gemini-cli"
+                ),
+            )
+        def close(self) -> None:  # type: ignore[override]
+            self._episode_id = None
+            self._target_function_name = None
+            self._step_count = 0
+            self._done = False
+        def _wrap_obs(
+            self,
+            internal_obs: Any,
+            *,
+            reward: Optional[float],
+            done: bool,
+            info: dict,
+        ) -> "OpenSleuthObservation":
+            return OpenSleuthObservation(
+                done=done,
+                reward=reward,
+                episode_id=internal_obs.episode_id,
+                target_function_name=internal_obs.target_function_name,
+                target_function_signature=internal_obs.target_function_signature,
+                probe_history=[r.model_dump() for r in internal_obs.probe_history],
+                last_error=internal_obs.last_error,
+                steps_taken=internal_obs.steps_taken,
+                max_steps=internal_obs.max_steps,
+                difficulty=internal_obs.difficulty,
+                coverage_buckets_seen=internal_obs.coverage_buckets_seen,
+                seen_outputs_count=internal_obs.seen_outputs_count,
+                seen_error_types_count=internal_obs.seen_error_types_count,
+                info=info,
+                metadata={"info": info},
+            )
+__all__ = ["OPENENV_AVAILABLE"]
+if OPENENV_AVAILABLE:
+    __all__ += [
+        "OpenSleuthAction",
+        "OpenSleuthObservation",
+        "OpenSleuthState",
+        "OpenSleuthEnvironment",
+    ]

requirements.txt CHANGED Viewed

@@ -1,8 +1,15 @@
-fastapi==0.115.6
-uvicorn[standard]==0.32.1
-pydantic==2.10.3
 # Level 2: Hub-driven task catalog. We swallow load failures at runtime so
 # the env still functions if Hub is offline, but the dependency is required
 # for Hub-backed tasks to be discoverable.
 datasets>=3.0.0
 huggingface_hub>=0.25.0

+# fastapi >=0.118 / starlette >=0.48 are required because openenv-core 0.2.3
+# references status.HTTP_422_UNPROCESSABLE_CONTENT (added in starlette 0.48).
+fastapi>=0.118.0
+starlette>=0.48.0
+uvicorn[standard]>=0.32.1
+pydantic>=2.10.3
 # Level 2: Hub-driven task catalog. We swallow load failures at runtime so
 # the env still functions if Hub is offline, but the dependency is required
 # for Hub-backed tasks to be discoverable.
 datasets>=3.0.0
 huggingface_hub>=0.25.0
+# Hackathon conformance: meta-pytorch/OpenEnv 0.2.x -- exposes the canonical
+# /openenv/{reset,step,state,health,metadata,schema,ws} surface alongside our
+# legacy contract. See opensleuth_env/openenv_adapter.py.
+openenv-core==0.2.3

server.py CHANGED Viewed

@@ -1,4 +1,17 @@
-"""FastAPI server exposing the OpenSleuth environment over HTTP."""
 from __future__ import annotations
@@ -23,10 +36,53 @@ from opensleuth_env.task_catalog import TaskResolutionError
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
 log = logging.getLogger("opensleuth.server")
-app = FastAPI(title="OpenSleuth Env", version="0.4.1")
 env = OpenSleuthEnv()
 @app.get("/health")
 def health():
     return {

+"""FastAPI server exposing the OpenSleuth environment over HTTP.
+Two HTTP surfaces are served from this app:
+* The legacy OpenSleuth contract (``/health``, ``/functions``, ``/tasks``,
+  ``/reset``, ``/step``, ``/state/{episode_id}``, ``/probe_once``) used by the
+  in-flight trainer and eval harness.
+* The OpenEnv-conformant sub-app mounted at ``/openenv/*`` (added in v0.5.0
+  for hackathon conformance) -- exposes ``/openenv/reset``, ``/openenv/step``,
+  ``/openenv/state``, ``/openenv/health``, ``/openenv/metadata``,
+  ``/openenv/schema``, and the canonical ``/openenv/ws`` WebSocket. See
+  :mod:`opensleuth_env.openenv_adapter` and
+  https://github.com/meta-pytorch/OpenEnv (v0.2.3).
+"""
 from __future__ import annotations
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
 log = logging.getLogger("opensleuth.server")
+app = FastAPI(title="OpenSleuth Env", version="0.5.0")
 env = OpenSleuthEnv()
+# ---------------------------------------------------------------------------
+# OpenEnv conformance: mount an upstream-spec sub-app at /openenv.
+# This is kept additive so the existing trainer (which talks to the bare
+# /reset and /step routes above) is completely unaffected.
+# ---------------------------------------------------------------------------
+try:
+    from openenv.core.env_server.http_server import HTTPEnvServer
+    from opensleuth_env.openenv_adapter import (
+        OPENENV_AVAILABLE,
+        OpenSleuthAction,
+        OpenSleuthEnvironment,
+        OpenSleuthObservation,
+    )
+    if OPENENV_AVAILABLE:
+        openenv_app = FastAPI(
+            title="OpenSleuth (OpenEnv-conformant)",
+            version="0.5.0",
+            description=(
+                "OpenEnv 0.2.x conformant surface for the OpenSleuth environment.\n\n"
+                "See https://github.com/meta-pytorch/OpenEnv -- this sub-app implements"
+                " the canonical reset/step/state/health/metadata/schema HTTP routes plus"
+                " the /ws WebSocket session protocol."
+            ),
+        )
+        _openenv_server = HTTPEnvServer(
+            env=OpenSleuthEnvironment,
+            action_cls=OpenSleuthAction,
+            observation_cls=OpenSleuthObservation,
+            max_concurrent_envs=8,
+        )
+        _openenv_server.register_routes(openenv_app)
+        app.mount("/openenv", openenv_app)
+        log.info("Mounted OpenEnv-conformant sub-app at /openenv (openenv-core %s)",
+                 _openenv_server.__class__.__module__)
+    else:  # pragma: no cover
+        log.warning("openenv-core not importable; /openenv/* will be unavailable.")
+except Exception as e:  # pragma: no cover - fail open so legacy routes keep working
+    log.warning("Could not register OpenEnv sub-app: %s: %s", type(e).__name__, e)
 @app.get("/health")
 def health():
     return {

tests/__init__.py ADDED Viewed

File without changes

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,334 @@

+"""Unit tests for the OpenSleuth env + verifier.
+Run with `pytest -q` from the env/ directory.
+"""
+from __future__ import annotations
+import pytest
+from opensleuth_env import (
+    BLACK_BOX_FUNCTIONS,
+    OpenSleuthEnv,
+    ProbeAction,
+    SubmitAction,
+)
+from opensleuth_env.env import _bucket_of, NEW_BUCKET_BONUS, NEW_OUTPUT_BONUS, PROBE_STEP_COST
+from opensleuth_env.verifier import (
+    calculate_complexity_penalty,
+    generate_fuzz_inputs,
+    get_edge_inputs,
+    verify_submission,
+    _looks_like_reference_import,
+)
+# ---------- env transitions ------------------------------------------------
+def test_reset_returns_episode_id_and_signature():
+    env = OpenSleuthEnv()
+    obs = env.reset("fibonacci")
+    assert obs.episode_id
+    assert obs.target_function_name == "fibonacci"
+    assert "fibonacci" in obs.target_function_signature
+    assert obs.probe_history == []
+    assert obs.steps_taken == 0
+    # New v0.3 metadata.
+    assert obs.difficulty == "easy"
+    assert obs.coverage_buckets_seen == 0
+def test_unknown_target_raises():
+    env = OpenSleuthEnv()
+    with pytest.raises(ValueError):
+        env.reset("not_a_real_function")
+def test_probe_with_int_input_records_output():
+    env = OpenSleuthEnv()
+    obs = env.reset("fibonacci")
+    resp = env.step(obs.episode_id, ProbeAction(input_repr="10"))
+    assert resp.done is False
+    assert resp.observation.probe_history[-1].is_error is False
+    assert resp.observation.probe_history[-1].output_repr == "55"
+    # First successful probe = NEW_OUTPUT_BONUS + NEW_BUCKET_BONUS + PROBE_STEP_COST.
+    expected = NEW_OUTPUT_BONUS + NEW_BUCKET_BONUS + PROBE_STEP_COST
+    assert resp.reward == pytest.approx(expected)
+    assert resp.info["coverage_bonus"] == pytest.approx(NEW_BUCKET_BONUS)
+    assert resp.info["bucket"] == "int:medium"
+    assert resp.observation.coverage_buckets_seen == 1
+    assert resp.observation.seen_outputs_count == 1
+def test_probe_with_invalid_literal_returns_parse_error():
+    env = OpenSleuthEnv()
+    obs = env.reset("fibonacci")
+    resp = env.step(obs.episode_id, ProbeAction(input_repr="not a literal"))
+    assert resp.done is False
+    assert resp.observation.probe_history[-1].error_type == "ParseError"
+def test_repeated_output_only_pays_intrinsic_once():
+    env = OpenSleuthEnv()
+    obs = env.reset("fibonacci")
+    r1 = env.step(obs.episode_id, ProbeAction(input_repr="10"))
+    r2 = env.step(obs.episode_id, ProbeAction(input_repr="10"))
+    assert r1.reward > r2.reward
+    # Second hit on the same bucket+output: just the per-step cost.
+    assert r2.reward == pytest.approx(PROBE_STEP_COST)
+def test_step_limit_terminates_episode():
+    env = OpenSleuthEnv()
+    obs = env.reset("fibonacci", max_steps=2)
+    env.step(obs.episode_id, ProbeAction(input_repr="1"))
+    resp = env.step(obs.episode_id, ProbeAction(input_repr="2"))
+    assert resp.done is True
+def test_unknown_episode_id_raises():
+    env = OpenSleuthEnv()
+    with pytest.raises(KeyError):
+        env.step("does-not-exist", ProbeAction(input_repr="1"))
+# ---------- coverage bucketing (CovRL-Fuzz inspired) -----------------------
+def test_bucket_of_distinguishes_qualitative_input_classes():
+    assert _bucket_of(0) == "int:zero"
+    assert _bucket_of(-1) == "int:negative"
+    assert _bucket_of(5) == "int:small"
+    assert _bucket_of(50) == "int:medium"
+    assert _bucket_of(5000) == "int:large"
+    assert _bucket_of(50_000) == "int:huge"
+    assert _bucket_of("") == "str:empty"
+    assert _bucket_of("a") == "str:singleton"
+    assert _bucket_of([]) == "list:empty"
+    assert _bucket_of((1, 2)) == "tuple:short"
+    assert _bucket_of(True) == "bool:True"  # bool isolated from int
+    assert _bucket_of(None) == "none"
+def test_probe_distinct_buckets_each_pay_coverage_bonus():
+    env = OpenSleuthEnv()
+    obs = env.reset("fibonacci")
+    # 1 (small), 50 (medium), 5 (already small)
+    r1 = env.step(obs.episode_id, ProbeAction(input_repr="1"))
+    r2 = env.step(obs.episode_id, ProbeAction(input_repr="50"))
+    r3 = env.step(obs.episode_id, ProbeAction(input_repr="5"))
+    assert r1.info["coverage_bonus"] == pytest.approx(NEW_BUCKET_BONUS)
+    assert r2.info["coverage_bonus"] == pytest.approx(NEW_BUCKET_BONUS)
+    assert r3.info["coverage_bonus"] == pytest.approx(0.0)
+    assert r3.observation.coverage_buckets_seen == 2
+# ---------- verifier -------------------------------------------------------
+def test_verifier_perfect_score_on_reference_impl():
+    spec = BLACK_BOX_FUNCTIONS["fibonacci"]
+    code = (
+        "def fibonacci(n):\n"
+        "    if not isinstance(n, int) or n <= 0 or n > 90:\n"
+        "        raise ValueError('bad')\n"
+        "    a, b = 0, 1\n"
+        "    for _ in range(n - 1):\n"
+        "        a, b = b, a + b\n"
+        "    return b\n"
+    )
+    inputs = generate_fuzz_inputs(spec, count=30, seed=0)
+    edges = get_edge_inputs(spec)
+    result = verify_submission(code, spec.fn, inputs, target_name="fibonacci", edge_inputs=edges)
+    assert result.matches == 30 + len(edges)
+    assert result.execution_reward == pytest.approx(100.0)
+    assert result.edge_pass_rate == pytest.approx(1.0)
+    assert result.floor_penalty == 0.0
+    assert result.reward_hack_penalty == 0.0
+def test_verifier_partial_score_on_buggy_impl():
+    spec = BLACK_BOX_FUNCTIONS["fibonacci"]
+    buggy = (
+        "def fibonacci(n):\n"
+        "    if not isinstance(n, int) or n <= 0 or n > 90:\n"
+        "        raise ValueError('bad')\n"
+        "    a, b = 0, 1\n"
+        "    for _ in range(n - 1):\n"
+        "        a, b = b, a + b\n"
+        "    return b + 1\n"
+    )
+    inputs = generate_fuzz_inputs(spec, count=30, seed=0)
+    result = verify_submission(buggy, spec.fn, inputs, target_name="fibonacci")
+    assert result.execution_reward == pytest.approx(0.0)
+    assert result.matches == 0
+    # Sub-50% match rate triggers the hard floor.
+    assert result.floor_penalty == 25.0
+def test_verifier_syntax_error_returns_define_error_and_full_penalty():
+    spec = BLACK_BOX_FUNCTIONS["fibonacci"]
+    inputs = generate_fuzz_inputs(spec, count=10, seed=0)
+    result = verify_submission("def fib(:\n  pass", spec.fn, inputs, target_name="fibonacci")
+    assert result.define_error is not None
+    assert result.execution_reward == 0.0
+    assert result.complexity_penalty == 50.0
+    assert result.floor_penalty == 25.0
+def test_verifier_missing_target_returns_error():
+    spec = BLACK_BOX_FUNCTIONS["fibonacci"]
+    inputs = generate_fuzz_inputs(spec, count=10, seed=0)
+    result = verify_submission("def other(x): return x", spec.fn, inputs, target_name="fibonacci")
+    assert result.define_error is not None
+    assert result.execution_reward == 0.0
+def test_complexity_penalty_low_for_simple_fn():
+    code = "def f(x): return x\n"
+    assert calculate_complexity_penalty(code) < 1.0
+def test_complexity_penalty_high_for_branchy_fn():
+    body = "\n    ".join(f"if x == {i}: return {i}" for i in range(100))
+    code = f"def f(x):\n    {body}\n    return -1\n"
+    assert calculate_complexity_penalty(code) > 5.0
+# ---------- anti-reward-hacking --------------------------------------------
+def test_sandbox_blocks_import_of_reference_module():
+    """Critical regression: previously an agent could write::
+        from opensleuth_env.black_box import _fibonacci
+        def fibonacci(n): return _fibonacci(n)
+    and reward-hack to a perfect score. The hardened sandbox must block this.
+    """
+    spec = BLACK_BOX_FUNCTIONS["fibonacci"]
+    hack = (
+        "def fibonacci(n):\n"
+        "    from opensleuth_env.black_box import _fibonacci\n"
+        "    return _fibonacci(n)\n"
+    )
+    inputs = generate_fuzz_inputs(spec, count=10, seed=0)
+    result = verify_submission(hack, spec.fn, inputs, target_name="fibonacci")
+    # Either definition fails (no __import__) or per-call fails. Either way
+    # the agent must NOT score positively.
+    assert result.execution_reward < 50.0
+    # Static detector flagged the import attempt.
+    assert result.reward_hack_penalty >= 25.0
+def test_static_detector_flags_opensleuth_import():
+    code = "import opensleuth_env\ndef f(x): return x\n"
+    assert _looks_like_reference_import(code) is True
+    assert _looks_like_reference_import("def f(x): return x\n") is False
+def test_constant_function_collapse_is_penalised():
+    """An agent that learns to always return the same value should be
+    penalised even if some random inputs happen to match (e.g. for
+    `digit_sum`, `lambda x: 0` matches only x=0)."""
+    spec = BLACK_BOX_FUNCTIONS["digit_sum"]
+    code = "def digit_sum(n):\n    return 0\n"
+    inputs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100, 999]
+    result = verify_submission(code, spec.fn, inputs, target_name="digit_sum")
+    # All distinct inputs return 0 (one signature) while ref produces many.
+    assert result.reward_hack_penalty >= 15.0
+def test_sandbox_blocks_open_and_eval():
+    spec = BLACK_BOX_FUNCTIONS["fibonacci"]
+    bad = (
+        "def fibonacci(n):\n"
+        "    open('/tmp/x', 'w')\n"
+        "    return 0\n"
+    )
+    inputs = generate_fuzz_inputs(spec, count=5, seed=0)
+    result = verify_submission(bad, spec.fn, inputs, target_name="fibonacci")
+    # Either the per-call NameError on `open` makes everything mismatch,
+    # or it raises at definition time. Either way, low reward.
+    assert result.execution_reward < 50.0
+# ---------- stratified scoring (edge vs random) ----------------------------
+def test_edge_cases_are_always_evaluated():
+    spec = BLACK_BOX_FUNCTIONS["reverse_string"]
+    # Submission that fails the empty-string edge case but works for non-empty.
+    code = (
+        "def reverse_string(s):\n"
+        "    if s == '':\n"
+        "        return 'OOPS'\n"
+        "    return s[::-1]\n"
+    )
+    inputs = generate_fuzz_inputs(spec, count=20, seed=0)
+    edges = get_edge_inputs(spec)
+    assert "" in edges
+    result = verify_submission(
+        code, spec.fn, inputs, target_name="reverse_string", edge_inputs=edges
+    )
+    # Should pass most random + most edge except the empty-string edge case.
+    assert result.matches_by_category["edge"] == len(edges) - 1
+    assert result.edge_pass_rate < 1.0
+    assert result.matches_by_category["random"] >= 18  # very rare to roll empty
+# ---------- end-to-end submission via env ----------------------------------
+def test_env_submit_reference_implementation_gives_high_reward():
+    env = OpenSleuthEnv(fuzz_count=20)
+    obs = env.reset("reverse_string")
+    code = "def reverse_string(s):\n    return s[::-1]\n"
+    resp = env.step(obs.episode_id, SubmitAction(code=code))
+    assert resp.done is True
+    # 100 - tiny complexity penalty + 50 perfect bonus.
+    assert resp.reward > 140.0
+    assert resp.info["execution_reward"] == pytest.approx(100.0)
+    assert resp.info["edge_pass_rate"] == pytest.approx(1.0)
+    assert resp.info["floor_penalty"] == 0.0
+    assert resp.info["reward_hack_penalty"] == 0.0
+    assert resp.info["perfect_bonus"] == 50.0
+def test_env_submit_buggy_function_lands_clearly_negative():
+    """Wrong submissions must end up clearly negative so the trainer's GRPO
+    advantage penalises 'just emit any function'."""
+    env = OpenSleuthEnv(fuzz_count=10)
+    obs = env.reset("digit_sum")
+    code = "def digit_sum(n):\n    return -1\n"
+    resp = env.step(obs.episode_id, SubmitAction(code=code))
+    assert resp.done is True
+    assert resp.info["execution_reward"] < 50.0
+    assert resp.reward < 0.0
+    assert resp.info["floor_penalty"] == 25.0
+def test_env_submit_import_hack_scores_clearly_negative():
+    env = OpenSleuthEnv(fuzz_count=10)
+    obs = env.reset("fibonacci")
+    code = (
+        "def fibonacci(n):\n"
+        "    from opensleuth_env.black_box import _fibonacci\n"
+        "    return _fibonacci(n)\n"
+    )
+    resp = env.step(obs.episode_id, SubmitAction(code=code))
+    assert resp.done is True
+    assert resp.reward < 0.0
+    assert resp.info["reward_hack_penalty"] >= 25.0
+# ---------- spec metadata --------------------------------------------------
+def test_all_specs_have_difficulty_and_edge_cases():
+    valid = {"easy", "medium", "hard"}
+    for name, spec in BLACK_BOX_FUNCTIONS.items():
+        assert spec.difficulty in valid, f"{name} has invalid difficulty {spec.difficulty!r}"
+        assert isinstance(spec.edge_cases, list)
+        assert len(spec.edge_cases) >= 3, f"{name} should declare >=3 edge cases for robust scoring"

tests/test_openenv_conformance.py ADDED Viewed

	@@ -0,0 +1,257 @@

+"""OpenEnv 0.2.x protocol conformance tests for the OpenSleuth env.
+These tests are *additive* and orthogonal to the existing legacy contract
+covered in ``test_env.py`` / ``test_open_env.py``.
+What we verify:
+* The OpenEnv ``Environment`` adapter (:class:`OpenSleuthEnvironment`) implements
+  all four required methods (``reset`` / ``step`` / ``state`` / ``get_metadata``)
+  and returns instances of OpenEnv's ``Observation`` / ``State`` /
+  ``EnvironmentMetadata`` base classes (so it would pass any ``isinstance``
+  check by an OpenEnv-aware harness).
+* The ``/openenv/*`` HTTP sub-app exposes every endpoint OpenEnv 0.2.x
+  promises: ``/health``, ``/metadata``, ``/schema``, ``/state``, ``/reset``,
+  ``/step``. (The ``/ws`` WebSocket is exercised separately via the
+  ``smoke_openenv_client.py`` script run against the live Space.)
+* ``/openenv/reset`` returns the canonical ``{"observation", "reward", "done"}``
+  envelope (NOT a bare observation, which is the legacy shape).
+* ``/openenv/step`` accepts the canonical ``{"action": {...}}`` envelope (NOT
+  ``{"episode_id", "action"}``, which is the legacy shape).
+* The legacy bare ``/reset`` and ``/step`` routes the trainer uses are
+  untouched.
+"""
+from __future__ import annotations
+import pytest
+pytest.importorskip(
+    "openenv.core.env_server.types",
+    reason="openenv-core not installed; conformance tests skipped.",
+)
+from fastapi.testclient import TestClient
+from openenv.core.env_server.types import (
+    EnvironmentMetadata,
+    Observation as OEObservation,
+    State as OEState,
+)
+from opensleuth_env.openenv_adapter import (
+    OpenSleuthAction,
+    OpenSleuthEnvironment,
+    OpenSleuthObservation,
+    OpenSleuthState,
+)
+# ---------------------------------------------------------------------------
+# Adapter-level: exercises the Environment subclass directly (no HTTP).
+# ---------------------------------------------------------------------------
+class TestEnvironmentSubclass:
+    def test_observation_inherits_openenv_base(self) -> None:
+        env = OpenSleuthEnvironment()
+        obs = env.reset()
+        assert isinstance(obs, OEObservation), (
+            "OpenSleuthObservation must subclass openenv.core...types.Observation "
+            "so OpenEnv tooling (rubrics, evals, web UI) can introspect it."
+        )
+        # Must expose the OpenEnv-required fields.
+        assert obs.done is False
+        assert obs.reward is None
+        assert isinstance(obs.metadata, dict)
+    def test_state_inherits_openenv_base(self) -> None:
+        env = OpenSleuthEnvironment()
+        env.reset()
+        state = env.state
+        assert isinstance(state, OEState)
+        assert state.episode_id is not None
+        assert state.step_count == 0
+    def test_metadata_is_openenv_environment_metadata(self) -> None:
+        env = OpenSleuthEnvironment()
+        meta = env.get_metadata()
+        assert isinstance(meta, EnvironmentMetadata)
+        assert meta.name == "OpenSleuth"
+        assert meta.description
+        assert meta.version
+    def test_reset_step_full_loop(self) -> None:
+        env = OpenSleuthEnvironment()
+        env.reset(target_name="fibonacci", max_steps=10, seed=0)
+        probe = env.step(
+            OpenSleuthAction(action_type="probe", input_repr="10")
+        )
+        assert probe.done is False
+        assert probe.reward is not None and probe.reward > 0
+        assert probe.probe_history[-1]["output_repr"] == "55"
+        assert env.state.step_count == 1
+        submit = env.step(
+            OpenSleuthAction(
+                action_type="submit",
+                code="def fibonacci(n):\n    a,b=0,1\n    for _ in range(n-1):\n        a,b=b,a+b\n    return b\n",
+            )
+        )
+        assert submit.done is True
+        assert submit.reward is not None
+        assert env.state.finished is True
+    def test_reset_with_no_args_uses_safe_default(self) -> None:
+        """OpenEnv requires reset() to work with zero arguments. We use
+        'fibonacci' as the implicit default so a bare reset always produces
+        a valid episode."""
+        env = OpenSleuthEnvironment()
+        obs = env.reset()
+        assert obs.target_function_name == "fibonacci"
+    def test_supports_concurrent_sessions_flag(self) -> None:
+        """OpenEnv's HTTPEnvServer refuses max_concurrent_envs > 1 unless
+        the env opts in via SUPPORTS_CONCURRENT_SESSIONS."""
+        assert OpenSleuthEnvironment.SUPPORTS_CONCURRENT_SESSIONS is True
+    def test_action_is_extra_forbid(self) -> None:
+        """OpenEnv Action base sets extra='forbid' to catch typo'd fields
+        early. Our OpenSleuthAction must inherit that behavior."""
+        from pydantic import ValidationError
+        with pytest.raises(ValidationError):
+            OpenSleuthAction(action_type="probe", input_repr="1", made_up_field=1)
+# ---------------------------------------------------------------------------
+# HTTP-level: verifies the /openenv/* sub-app routes that judges will hit.
+# ---------------------------------------------------------------------------
+@pytest.fixture(scope="module")
+def http_client() -> TestClient:
+    from server import app
+    with TestClient(app) as client:
+        yield client
+class TestOpenEnvHttpSurface:
+    """The endpoints the OpenEnv spec / `openenv validate` look for."""
+    def test_health(self, http_client: TestClient) -> None:
+        r = http_client.get("/openenv/health")
+        assert r.status_code == 200, r.text
+        assert r.json() == {"status": "healthy"}
+    def test_metadata(self, http_client: TestClient) -> None:
+        r = http_client.get("/openenv/metadata")
+        assert r.status_code == 200, r.text
+        body = r.json()
+        for key in ("name", "description", "version"):
+            assert key in body, f"missing {key} in /openenv/metadata"
+        assert body["name"] == "OpenSleuth"
+    def test_schema(self, http_client: TestClient) -> None:
+        r = http_client.get("/openenv/schema")
+        assert r.status_code == 200, r.text
+        body = r.json()
+        for key in ("action", "observation", "state"):
+            assert key in body, f"missing {key} in /openenv/schema"
+            assert "properties" in body[key], (
+                f"/openenv/schema {key!r} is not a valid JSON schema"
+            )
+        # action discriminator should be visible in the schema
+        assert "action_type" in body["action"]["properties"]
+    def test_state(self, http_client: TestClient) -> None:
+        r = http_client.get("/openenv/state")
+        assert r.status_code == 200, r.text
+        body = r.json()
+        assert "episode_id" in body
+        assert "step_count" in body
+    def test_reset_returns_canonical_envelope(self, http_client: TestClient) -> None:
+        r = http_client.post("/openenv/reset", json={"target_name": "fibonacci"})
+        assert r.status_code == 200, r.text
+        body = r.json()
+        # Canonical OpenEnv shape: {"observation": {...}, "reward": ..., "done": ...}
+        assert set(body.keys()) == {"observation", "reward", "done"}, (
+            f"Expected OpenEnv envelope, got keys: {sorted(body)}"
+        )
+        assert body["done"] is False
+        assert body["observation"]["target_function_name"] == "fibonacci"
+    def test_reset_with_no_body_works(self, http_client: TestClient) -> None:
+        """OpenEnv ResetRequest defaults to an empty body. Must still work."""
+        r = http_client.post("/openenv/reset")
+        assert r.status_code == 200, r.text
+        body = r.json()
+        assert "observation" in body
+    def test_step_canonical_envelope_with_probe(self, http_client: TestClient) -> None:
+        r = http_client.post(
+            "/openenv/step",
+            json={"action": {"action_type": "probe", "input_repr": "10"}},
+        )
+        assert r.status_code == 200, r.text
+        body = r.json()
+        assert set(body.keys()) == {"observation", "reward", "done"}
+        # Note: under HTTP (stateless), each /openenv/step gets a fresh env;
+        # we auto-reset so a probe still produces a valid history.
+        assert body["observation"]["probe_history"], "probe should produce history"
+    def test_step_rejects_unknown_action_field(self, http_client: TestClient) -> None:
+        r = http_client.post(
+            "/openenv/step",
+            json={"action": {"action_type": "probe", "input_repr": "1", "wat": True}},
+        )
+        # OpenEnv's deserialize_action raises ValidationError -> 422.
+        assert r.status_code == 422
+# ---------------------------------------------------------------------------
+# Regression: the legacy trainer-facing routes must still work unchanged.
+# ---------------------------------------------------------------------------
+class TestLegacyContractPreserved:
+    def test_legacy_health(self, http_client: TestClient) -> None:
+        r = http_client.get("/health")
+        assert r.status_code == 200
+        assert r.json()["status"] == "ok"
+    def test_legacy_reset_returns_bare_observation(self, http_client: TestClient) -> None:
+        """Trainer expects {episode_id, target_function_name, ...} at the top
+        level (NOT wrapped in {observation: ...}). Must NOT regress."""
+        r = http_client.post(
+            "/reset",
+            json={"target_name": "fibonacci", "seed": 0, "max_steps": 5},
+        )
+        assert r.status_code == 200, r.text
+        body = r.json()
+        assert "episode_id" in body, (
+            "Legacy /reset must return a bare observation, not the OpenEnv envelope. "
+            "If this fails the trainer will break."
+        )
+        assert "observation" not in body  # don't accidentally double-wrap
+    def test_legacy_step_returns_step_response(self, http_client: TestClient) -> None:
+        reset = http_client.post(
+            "/reset",
+            json={"target_name": "fibonacci", "seed": 0, "max_steps": 5},
+        ).json()
+        eid = reset["episode_id"]
+        r = http_client.post(
+            "/step",
+            json={
+                "episode_id": eid,
+                "action": {"action_type": "probe", "input_repr": "5"},
+            },
+        )
+        assert r.status_code == 200, r.text
+        body = r.json()
+        # Legacy shape: {observation, reward, done, info}
+        assert {"observation", "reward", "done", "info"} <= set(body.keys())
+        assert "execution_reward" not in body  # only present on submit info