--- title: OpenSleuth Env emoji: 🕵️ colorFrom: indigo colorTo: pink sdk: docker app_port: 7860 pinned: false suggested_hardware: cpu-basic --- # OpenSleuth — Environment FastAPI service that exposes an OpenEnv-style `/reset` + `/step` API for the **Algorithmic Detective** task. An agent has to figure out an unknown Python function by probing it, then submit Python source that replicates it. ## Endpoints | Method | Path | Body | Notes | |-------:|---------------|----------------------------------------|----------------------------------------| | GET | `/health` | — | Liveness probe (also reports Hub-catalog status). | | GET | `/functions` | optional `?difficulty=easy\|medium\|hard` | Catalogue of the 9 builtin black-boxes (back-compat shape). | | GET | `/tasks` | optional `?source=builtin\|hub\|all` | Open-ended catalog (Level 2): builtins + Hub-loaded rows. | | POST | `/reset` | `{"target_name": "fibonacci", "seed": 0}` *or* `{"target_code": "...", "target_function_name": "..."}` | Starts an episode. Caller-supplied target_code wins over target_name. | | POST | `/step` | `{"episode_id": "...", "action": {...}}` | One agent action. | | GET | `/state/{eid}`| — | Inspect the live state of an episode (debug). | ### Action shapes ```json {"action_type": "probe", "input_repr": "5"} // input_repr is parsed via ast.literal_eval {"action_type": "submit", "code": "def fibonacci(n):..."} ``` ### Reward (v0.3 – paper-driven update) Inspired by Masud et al. 2026 (*Reward Engineering for RL in Software Tasks*, arXiv:2601.19100) and Ibrahim et al. 2024 (*Comprehensive Overview of Reward Engineering and Shaping*, arXiv:2408.10215). * **Probe:** `-1` step cost, plus `+2` per newly-seen output, `+5` per newly-seen exception type, **and `+0.5` per newly-explored input bucket** (CovRL-Fuzz / SimHash-style coverage bonus). * **Submit (terminal):** `execution_reward − complexity_penalty − reward_hack_penalty − floor_penalty (+50 perfect bonus if 100% match)` where: * `execution_reward` ∈ `[0, 100]` is computed over **stratified** fuzz inputs: spec-defined `edge_cases` are *always* tested in addition to the random fuzz batch, and the per-category match counts are returned in `info["matches_by_category"]`. * `floor_penalty` is a hard `-25` for sub-50% match-rate submissions (Vul-R2 style; Wen et al. 2025), preventing agents from learning that emitting *any* function pays out. * `reward_hack_penalty` fires for static import-of-reference attempts (`+25`) and for "constant-output" collapse against a diverse reference (`+15`). The sandbox additionally **blocks** `__import__`, `open`, `eval`, `exec`, `compile`, etc. ### Open-ended tasks (Level 2) The env resolves a target function from three sources, in priority order: 1. **Caller-supplied** — `POST /reset` with `target_code` + `target_function_name` (and optionally `edge_cases` + `fuzz_spec`). The source is compiled in the same hardened sandbox the verifier uses for agent submissions; static-import of `opensleuth_*` is rejected up front. This lets a trainer hand the env an arbitrary unseen task per rollout without any redeploy. 2. **Hub dataset** — [`anugrah55/opensleuth-tasks`](https://huggingface.co/datasets/anugrah55/opensleuth-tasks). Loaded lazily on first `/reset`, cached in-process. Each row has `{name, target_function_name, signature, description, difficulty, source_code, edge_cases_json, fuzz_spec_json}`. 3. **Builtin registry** — the original 9 functions in `black_box.py` are kept as the safety-net so the in-flight trainer keeps working unchanged. Builtins *win* by name over Hub copies, so `target_name="fibonacci"` always resolves to the in-process oracle. #### Adding new tasks * **Per-reset (one-shot)**: pass `target_code` + `target_function_name` to `/reset`. Multi-arg signatures are supported via the auto-fuzzer (which introspects `inspect.signature` + `typing.get_type_hints`); pass `edge_cases` as a list of Python literal strings and `fuzz_spec` as a per-parameter override map. * **Persistent**: append a row to the Hub dataset and the env will pick it up on its next process-start. The bootstrap script (`opensleuth_env/scripts/bootstrap_tasks_dataset.py`) is idempotent — re-running it overwrites the dataset with the latest builtin + curated rows. ```bash # Push the curated 9 + 6 = 15-task seed catalog. PYTHONPATH=. python -m opensleuth_env.scripts.bootstrap_tasks_dataset ``` ### Backwards compatibility Existing trainer / eval clients only read `info["execution_reward"]`, `info["matches"]`, `info["fuzz_count"]` and `resp["reward"]` — all preserved with the same meaning. New fields (`difficulty`, `coverage_buckets_seen`, `matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`, `floor_penalty`, `perfect_bonus`) are additive and ignored by older clients. `/reset` retains its v0.3 shape: `{"target_name": "fibonacci", "seed": 0, "max_steps": 25}` works exactly as before. The four new optional fields (`target_code`, `target_function_name`, `edge_cases`, `fuzz_spec`) are additive. `/functions` returns the same shape as before (with one *additive* `source` field). Open-ended/Hub tasks are exposed via the new `/tasks` endpoint so older clients aren't surprised. ## OpenEnv conformance This Space targets the [meta-pytorch / OpenEnv](https://github.com/meta-pytorch/OpenEnv) v0.2.3 spec (`pip install openenv-core==0.2.3`). The OpenEnv-conformant surface is mounted at **`/openenv/*`** alongside (not on top of) the legacy endpoints listed above so the in-flight trainer keeps working unchanged. | OpenEnv route | Path | Notes | |--------------------------|-----------------------|----------------------------------------------------------| | `GET /health` | `/openenv/health` | `{"status": "healthy"}` | | `GET /metadata` | `/openenv/metadata` | `EnvironmentMetadata` (name, description, version, ...) | | `GET /schema` | `/openenv/schema` | JSON schemas for `action`, `observation`, `state` | | `GET /state` | `/openenv/state` | Episode `State` (episode_id, step_count, ...) | | `POST /reset` | `/openenv/reset` | Returns `{"observation", "reward", "done"}` envelope | | `POST /step` | `/openenv/step` | Body: `{"action": {"action_type": "probe"|"submit", ...}}` | | `WS /ws` | `/openenv/ws` | Persistent session: `reset` → `step`* → `state` → `close` | `OpenSleuthEnvironment` (in `opensleuth_env/openenv_adapter.py`) subclasses `openenv.core.env_server.interfaces.Environment`, so any OpenEnv-aware harness (`openenv` CLI, `GenericEnvClient`, TRL/torchforge integrations, LightningAI Studio, ...) can pick it up via standard introspection. ### Talking to it as an OpenEnv client ```python import asyncio from openenv import GenericEnvClient, GenericAction async def main(): base = "https://anugrah55-opensleuth-env-gemini-cli.hf.space/openenv" async with GenericEnvClient(base_url=base) as env: result = await env.reset(target_name="fibonacci", max_steps=8) result = await env.step(GenericAction(action_type="probe", input_repr="10")) print(result.observation["probe_history"][-1]) asyncio.run(main()) ``` A runnable end-to-end example lives in [`example_client.py`](example_client.py). ### What is *not* yet conformant * No MCP tool surface (RFC 003). Our actions are typed Pydantic models, not MCP tools, because the underlying probe/submit semantics map cleanly to a single `OpenSleuthAction` discriminator. Adding MCP would be additive. * No Rubric/EvalHarness integration (RFC 004) — reward shaping lives in `opensleuth_env/env.py` and is intentionally not split into a separate rubric for now. ## Hardware CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space. ## Running locally ```bash pip install -r requirements.txt uvicorn server:app --port 7860 --reload # legacy contract: http://localhost:7860/{health,reset,step,state/{eid}} # OpenEnv-conformant surface: http://localhost:7860/openenv/{health,reset,step,state,schema,metadata,ws} ``` To run only the OpenEnv conformance tests: ```bash PYTHONPATH=. python -m pytest tests/test_openenv_conformance.py -v ```