| --- |
| title: OpenSleuth Env |
| emoji: π΅οΈ |
| colorFrom: indigo |
| colorTo: pink |
| sdk: docker |
| app_port: 7860 |
| pinned: false |
| suggested_hardware: cpu-basic |
| --- |
| |
| # OpenSleuth β Environment |
|
|
| FastAPI service that exposes an OpenEnv-style `/reset` + `/step` API for the |
| **Algorithmic Detective** task. An agent has to figure out an unknown Python |
| function by probing it, then submit Python source that replicates it. |
|
|
| ## Endpoints |
|
|
| | Method | Path | Body | Notes | |
| |-------:|---------------|----------------------------------------|----------------------------------------| |
| | GET | `/health` | β | Liveness probe (also reports Hub-catalog status). | |
| | GET | `/functions` | optional `?difficulty=easy\|medium\|hard` | Catalogue of the 9 builtin black-boxes (back-compat shape). | |
| | GET | `/tasks` | optional `?source=builtin\|hub\|all` | Open-ended catalog (Level 2): builtins + Hub-loaded rows. | |
| | POST | `/reset` | `{"target_name": "fibonacci", "seed": 0}` *or* `{"target_code": "...", "target_function_name": "..."}` | Starts an episode. Caller-supplied target_code wins over target_name. | |
| | POST | `/step` | `{"episode_id": "...", "action": {...}}` | One agent action. | |
| | GET | `/state/{eid}`| β | Inspect the live state of an episode (debug). | |
|
|
| ### Action shapes |
|
|
| ```json |
| {"action_type": "probe", "input_repr": "5"} // input_repr is parsed via ast.literal_eval |
| {"action_type": "submit", "code": "def fibonacci(n):..."} |
| ``` |
|
|
| ### Reward (v0.3 β paper-driven update) |
|
|
| Inspired by Masud et al. 2026 (*Reward Engineering for RL in Software Tasks*, |
| arXiv:2601.19100) and Ibrahim et al. 2024 (*Comprehensive Overview of Reward |
| Engineering and Shaping*, arXiv:2408.10215). |
|
|
| * **Probe:** `-1` step cost, plus `+2` per newly-seen output, `+5` per |
| newly-seen exception type, **and `+0.5` per newly-explored input bucket** |
| (CovRL-Fuzz / SimHash-style coverage bonus). |
| * **Submit (terminal):** |
| `execution_reward β complexity_penalty β reward_hack_penalty β floor_penalty |
| (+50 perfect bonus if 100% match)` where: |
| * `execution_reward` β `[0, 100]` is computed over **stratified** fuzz |
| inputs: spec-defined `edge_cases` are *always* tested in addition to the |
| random fuzz batch, and the per-category match counts are returned in |
| `info["matches_by_category"]`. |
| * `floor_penalty` is a hard `-25` for sub-50% match-rate submissions |
| (Vul-R2 style; Wen et al. 2025), preventing agents from learning that |
| emitting *any* function pays out. |
| * `reward_hack_penalty` fires for static import-of-reference attempts |
| (`+25`) and for "constant-output" collapse against a diverse reference |
| (`+15`). The sandbox additionally **blocks** `__import__`, `open`, |
| `eval`, `exec`, `compile`, etc. |
| |
| ### Open-ended tasks (Level 2) |
|
|
| The env resolves a target function from three sources, in priority order: |
|
|
| 1. **Caller-supplied** β `POST /reset` with `target_code` + `target_function_name` |
| (and optionally `edge_cases` + `fuzz_spec`). The source is compiled in the |
| same hardened sandbox the verifier uses for agent submissions; static-import |
| of `opensleuth_*` is rejected up front. This lets a trainer hand the env an |
| arbitrary unseen task per rollout without any redeploy. |
|
|
| 2. **Hub dataset** β [`anugrah55/opensleuth-tasks`](https://huggingface.co/datasets/anugrah55/opensleuth-tasks). |
| Loaded lazily on first `/reset`, cached in-process. Each row has |
| `{name, target_function_name, signature, description, difficulty, |
| source_code, edge_cases_json, fuzz_spec_json}`. |
| |
| 3. **Builtin registry** β the original 9 functions in `black_box.py` are kept |
| as the safety-net so the in-flight trainer keeps working unchanged. Builtins |
| *win* by name over Hub copies, so `target_name="fibonacci"` always resolves |
| to the in-process oracle. |
|
|
| #### Adding new tasks |
|
|
| * **Per-reset (one-shot)**: pass `target_code` + `target_function_name` to |
| `/reset`. Multi-arg signatures are supported via the auto-fuzzer (which |
| introspects `inspect.signature` + `typing.get_type_hints`); pass |
| `edge_cases` as a list of Python literal strings and `fuzz_spec` as a |
| per-parameter override map. |
|
|
| * **Persistent**: append a row to the Hub dataset and the env will pick it |
| up on its next process-start. The bootstrap script |
| (`opensleuth_env/scripts/bootstrap_tasks_dataset.py`) is idempotent β |
| re-running it overwrites the dataset with the latest builtin + curated |
| rows. |
|
|
| ```bash |
| # Push the curated 9 + 6 = 15-task seed catalog. |
| PYTHONPATH=. python -m opensleuth_env.scripts.bootstrap_tasks_dataset |
| ``` |
|
|
| ### Backwards compatibility |
|
|
| Existing trainer / eval clients only read `info["execution_reward"]`, |
| `info["matches"]`, `info["fuzz_count"]` and `resp["reward"]` β all preserved |
| with the same meaning. New fields (`difficulty`, `coverage_buckets_seen`, |
| `matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`, |
| `floor_penalty`, `perfect_bonus`) are additive and ignored by older clients. |
|
|
| `/reset` retains its v0.3 shape: `{"target_name": "fibonacci", "seed": 0, |
| "max_steps": 25}` works exactly as before. The four new optional fields |
| (`target_code`, `target_function_name`, `edge_cases`, `fuzz_spec`) are |
| additive. `/functions` returns the same shape as before (with one *additive* |
| `source` field). Open-ended/Hub tasks are exposed via the new `/tasks` |
| endpoint so older clients aren't surprised. |
|
|
| ## OpenEnv conformance |
|
|
| This Space targets the [meta-pytorch / OpenEnv](https://github.com/meta-pytorch/OpenEnv) |
| v0.2.3 spec (`pip install openenv-core==0.2.3`). The OpenEnv-conformant |
| surface is mounted at **`/openenv/*`** alongside (not on top of) the legacy |
| endpoints listed above so the in-flight trainer keeps working unchanged. |
| |
| | OpenEnv route | Path | Notes | |
| |--------------------------|-----------------------|----------------------------------------------------------| |
| | `GET /health` | `/openenv/health` | `{"status": "healthy"}` | |
| | `GET /metadata` | `/openenv/metadata` | `EnvironmentMetadata` (name, description, version, ...) | |
| | `GET /schema` | `/openenv/schema` | JSON schemas for `action`, `observation`, `state` | |
| | `GET /state` | `/openenv/state` | Episode `State` (episode_id, step_count, ...) | |
| | `POST /reset` | `/openenv/reset` | Returns `{"observation", "reward", "done"}` envelope | |
| | `POST /step` | `/openenv/step` | Body: `{"action": {"action_type": "probe"|"submit", ...}}` | |
| | `WS /ws` | `/openenv/ws` | Persistent session: `reset` β `step`* β `state` β `close` | |
| |
| `OpenSleuthEnvironment` (in `opensleuth_env/openenv_adapter.py`) subclasses |
| `openenv.core.env_server.interfaces.Environment`, so any OpenEnv-aware |
| harness (`openenv` CLI, `GenericEnvClient`, TRL/torchforge integrations, |
| LightningAI Studio, ...) can pick it up via standard introspection. |
| |
| ### Talking to it as an OpenEnv client |
| |
| ```python |
| import asyncio |
| from openenv import GenericEnvClient, GenericAction |
| |
| async def main(): |
| base = "https://anugrah55-opensleuth-env-gemini-cli.hf.space/openenv" |
| async with GenericEnvClient(base_url=base) as env: |
| result = await env.reset(target_name="fibonacci", max_steps=8) |
| result = await env.step(GenericAction(action_type="probe", input_repr="10")) |
| print(result.observation["probe_history"][-1]) |
| |
| asyncio.run(main()) |
| ``` |
| |
| A runnable end-to-end example lives in [`example_client.py`](example_client.py). |
| |
| ### What is *not* yet conformant |
| |
| * No MCP tool surface (RFC 003). Our actions are typed Pydantic models, not |
| MCP tools, because the underlying probe/submit semantics map cleanly to a |
| single `OpenSleuthAction` discriminator. Adding MCP would be additive. |
| * No Rubric/EvalHarness integration (RFC 004) β reward shaping lives in |
| `opensleuth_env/env.py` and is intentionally not split into a separate |
| rubric for now. |
| |
| ## Hardware |
| |
| CPU-only β `cpu-basic` is plenty. Do **not** assign GPU to this Space. |
|
|
| ## Running locally |
|
|
| ```bash |
| pip install -r requirements.txt |
| uvicorn server:app --port 7860 --reload |
| # legacy contract: http://localhost:7860/{health,reset,step,state/{eid}} |
| # OpenEnv-conformant surface: http://localhost:7860/openenv/{health,reset,step,state,schema,metadata,ws} |
| ``` |
|
|
| To run only the OpenEnv conformance tests: |
|
|
| ```bash |
| PYTHONPATH=. python -m pytest tests/test_openenv_conformance.py -v |
| ``` |
|
|