anugrah55's picture
OpenEnv 0.2.3 conformance: mount /openenv sub-app, add adapter + tests + example client
31715b5 verified
---
title: OpenSleuth Env
emoji: πŸ•΅οΈ
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
suggested_hardware: cpu-basic
---
# OpenSleuth β€” Environment
FastAPI service that exposes an OpenEnv-style `/reset` + `/step` API for the
**Algorithmic Detective** task. An agent has to figure out an unknown Python
function by probing it, then submit Python source that replicates it.
## Endpoints
| Method | Path | Body | Notes |
|-------:|---------------|----------------------------------------|----------------------------------------|
| GET | `/health` | β€” | Liveness probe (also reports Hub-catalog status). |
| GET | `/functions` | optional `?difficulty=easy\|medium\|hard` | Catalogue of the 9 builtin black-boxes (back-compat shape). |
| GET | `/tasks` | optional `?source=builtin\|hub\|all` | Open-ended catalog (Level 2): builtins + Hub-loaded rows. |
| POST | `/reset` | `{"target_name": "fibonacci", "seed": 0}` *or* `{"target_code": "...", "target_function_name": "..."}` | Starts an episode. Caller-supplied target_code wins over target_name. |
| POST | `/step` | `{"episode_id": "...", "action": {...}}` | One agent action. |
| GET | `/state/{eid}`| β€” | Inspect the live state of an episode (debug). |
### Action shapes
```json
{"action_type": "probe", "input_repr": "5"} // input_repr is parsed via ast.literal_eval
{"action_type": "submit", "code": "def fibonacci(n):..."}
```
### Reward (v0.3 – paper-driven update)
Inspired by Masud et al. 2026 (*Reward Engineering for RL in Software Tasks*,
arXiv:2601.19100) and Ibrahim et al. 2024 (*Comprehensive Overview of Reward
Engineering and Shaping*, arXiv:2408.10215).
* **Probe:** `-1` step cost, plus `+2` per newly-seen output, `+5` per
newly-seen exception type, **and `+0.5` per newly-explored input bucket**
(CovRL-Fuzz / SimHash-style coverage bonus).
* **Submit (terminal):**
`execution_reward βˆ’ complexity_penalty βˆ’ reward_hack_penalty βˆ’ floor_penalty
(+50 perfect bonus if 100% match)` where:
* `execution_reward` ∈ `[0, 100]` is computed over **stratified** fuzz
inputs: spec-defined `edge_cases` are *always* tested in addition to the
random fuzz batch, and the per-category match counts are returned in
`info["matches_by_category"]`.
* `floor_penalty` is a hard `-25` for sub-50% match-rate submissions
(Vul-R2 style; Wen et al. 2025), preventing agents from learning that
emitting *any* function pays out.
* `reward_hack_penalty` fires for static import-of-reference attempts
(`+25`) and for "constant-output" collapse against a diverse reference
(`+15`). The sandbox additionally **blocks** `__import__`, `open`,
`eval`, `exec`, `compile`, etc.
### Open-ended tasks (Level 2)
The env resolves a target function from three sources, in priority order:
1. **Caller-supplied** β€” `POST /reset` with `target_code` + `target_function_name`
(and optionally `edge_cases` + `fuzz_spec`). The source is compiled in the
same hardened sandbox the verifier uses for agent submissions; static-import
of `opensleuth_*` is rejected up front. This lets a trainer hand the env an
arbitrary unseen task per rollout without any redeploy.
2. **Hub dataset** β€” [`anugrah55/opensleuth-tasks`](https://huggingface.co/datasets/anugrah55/opensleuth-tasks).
Loaded lazily on first `/reset`, cached in-process. Each row has
`{name, target_function_name, signature, description, difficulty,
source_code, edge_cases_json, fuzz_spec_json}`.
3. **Builtin registry** β€” the original 9 functions in `black_box.py` are kept
as the safety-net so the in-flight trainer keeps working unchanged. Builtins
*win* by name over Hub copies, so `target_name="fibonacci"` always resolves
to the in-process oracle.
#### Adding new tasks
* **Per-reset (one-shot)**: pass `target_code` + `target_function_name` to
`/reset`. Multi-arg signatures are supported via the auto-fuzzer (which
introspects `inspect.signature` + `typing.get_type_hints`); pass
`edge_cases` as a list of Python literal strings and `fuzz_spec` as a
per-parameter override map.
* **Persistent**: append a row to the Hub dataset and the env will pick it
up on its next process-start. The bootstrap script
(`opensleuth_env/scripts/bootstrap_tasks_dataset.py`) is idempotent β€”
re-running it overwrites the dataset with the latest builtin + curated
rows.
```bash
# Push the curated 9 + 6 = 15-task seed catalog.
PYTHONPATH=. python -m opensleuth_env.scripts.bootstrap_tasks_dataset
```
### Backwards compatibility
Existing trainer / eval clients only read `info["execution_reward"]`,
`info["matches"]`, `info["fuzz_count"]` and `resp["reward"]` β€” all preserved
with the same meaning. New fields (`difficulty`, `coverage_buckets_seen`,
`matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`,
`floor_penalty`, `perfect_bonus`) are additive and ignored by older clients.
`/reset` retains its v0.3 shape: `{"target_name": "fibonacci", "seed": 0,
"max_steps": 25}` works exactly as before. The four new optional fields
(`target_code`, `target_function_name`, `edge_cases`, `fuzz_spec`) are
additive. `/functions` returns the same shape as before (with one *additive*
`source` field). Open-ended/Hub tasks are exposed via the new `/tasks`
endpoint so older clients aren't surprised.
## OpenEnv conformance
This Space targets the [meta-pytorch / OpenEnv](https://github.com/meta-pytorch/OpenEnv)
v0.2.3 spec (`pip install openenv-core==0.2.3`). The OpenEnv-conformant
surface is mounted at **`/openenv/*`** alongside (not on top of) the legacy
endpoints listed above so the in-flight trainer keeps working unchanged.
| OpenEnv route | Path | Notes |
|--------------------------|-----------------------|----------------------------------------------------------|
| `GET /health` | `/openenv/health` | `{"status": "healthy"}` |
| `GET /metadata` | `/openenv/metadata` | `EnvironmentMetadata` (name, description, version, ...) |
| `GET /schema` | `/openenv/schema` | JSON schemas for `action`, `observation`, `state` |
| `GET /state` | `/openenv/state` | Episode `State` (episode_id, step_count, ...) |
| `POST /reset` | `/openenv/reset` | Returns `{"observation", "reward", "done"}` envelope |
| `POST /step` | `/openenv/step` | Body: `{"action": {"action_type": "probe"|"submit", ...}}` |
| `WS /ws` | `/openenv/ws` | Persistent session: `reset` β†’ `step`* β†’ `state` β†’ `close` |
`OpenSleuthEnvironment` (in `opensleuth_env/openenv_adapter.py`) subclasses
`openenv.core.env_server.interfaces.Environment`, so any OpenEnv-aware
harness (`openenv` CLI, `GenericEnvClient`, TRL/torchforge integrations,
LightningAI Studio, ...) can pick it up via standard introspection.
### Talking to it as an OpenEnv client
```python
import asyncio
from openenv import GenericEnvClient, GenericAction
async def main():
base = "https://anugrah55-opensleuth-env-gemini-cli.hf.space/openenv"
async with GenericEnvClient(base_url=base) as env:
result = await env.reset(target_name="fibonacci", max_steps=8)
result = await env.step(GenericAction(action_type="probe", input_repr="10"))
print(result.observation["probe_history"][-1])
asyncio.run(main())
```
A runnable end-to-end example lives in [`example_client.py`](example_client.py).
### What is *not* yet conformant
* No MCP tool surface (RFC 003). Our actions are typed Pydantic models, not
MCP tools, because the underlying probe/submit semantics map cleanly to a
single `OpenSleuthAction` discriminator. Adding MCP would be additive.
* No Rubric/EvalHarness integration (RFC 004) β€” reward shaping lives in
`opensleuth_env/env.py` and is intentionally not split into a separate
rubric for now.
## Hardware
CPU-only β€” `cpu-basic` is plenty. Do **not** assign GPU to this Space.
## Running locally
```bash
pip install -r requirements.txt
uvicorn server:app --port 7860 --reload
# legacy contract: http://localhost:7860/{health,reset,step,state/{eid}}
# OpenEnv-conformant surface: http://localhost:7860/openenv/{health,reset,step,state,schema,metadata,ws}
```
To run only the OpenEnv conformance tests:
```bash
PYTHONPATH=. python -m pytest tests/test_openenv_conformance.py -v
```