Spaces:

anugrah55
/

opensleuth-env-gemini-cli

Paused

File size: 8,723 Bytes

---
title: OpenSleuth Env
emoji: 🕵️
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
suggested_hardware: cpu-basic
---

# OpenSleuth — Environment

FastAPI service that exposes an OpenEnv-style `/reset` + `/step` API for the
**Algorithmic Detective** task. An agent has to figure out an unknown Python
function by probing it, then submit Python source that replicates it.

## Endpoints

| Method | Path          | Body                                   | Notes                                  |
|-------:|---------------|----------------------------------------|----------------------------------------|
| GET    | `/health`     | —                                      | Liveness probe (also reports Hub-catalog status). |
| GET    | `/functions`  | optional `?difficulty=easy\|medium\|hard` | Catalogue of the 9 builtin black-boxes (back-compat shape). |
| GET    | `/tasks`      | optional `?source=builtin\|hub\|all`    | Open-ended catalog (Level 2): builtins + Hub-loaded rows. |
| POST   | `/reset`      | `{"target_name": "fibonacci", "seed": 0}` *or* `{"target_code": "...", "target_function_name": "..."}` | Starts an episode. Caller-supplied target_code wins over target_name. |
| POST   | `/step`       | `{"episode_id": "...", "action": {...}}` | One agent action.                      |
| GET    | `/state/{eid}`| —                                      | Inspect the live state of an episode (debug). |

### Action shapes

```json
{"action_type": "probe",  "input_repr": "5"}             // input_repr is parsed via ast.literal_eval
{"action_type": "submit", "code": "def fibonacci(n):..."}
```

### Reward (v0.3 – paper-driven update)

Inspired by Masud et al. 2026 (*Reward Engineering for RL in Software Tasks*,
arXiv:2601.19100) and Ibrahim et al. 2024 (*Comprehensive Overview of Reward
Engineering and Shaping*, arXiv:2408.10215).

* **Probe:** `-1` step cost, plus `+2` per newly-seen output, `+5` per
  newly-seen exception type, **and `+0.5` per newly-explored input bucket**
  (CovRL-Fuzz / SimHash-style coverage bonus).
* **Submit (terminal):**
  `execution_reward − complexity_penalty − reward_hack_penalty − floor_penalty
  (+50 perfect bonus if 100% match)` where:
  * `execution_reward` ∈ `[0, 100]` is computed over **stratified** fuzz
    inputs: spec-defined `edge_cases` are *always* tested in addition to the
    random fuzz batch, and the per-category match counts are returned in
    `info["matches_by_category"]`.
  * `floor_penalty` is a hard `-25` for sub-50% match-rate submissions
    (Vul-R2 style; Wen et al. 2025), preventing agents from learning that
    emitting *any* function pays out.
  * `reward_hack_penalty` fires for static import-of-reference attempts
    (`+25`) and for "constant-output" collapse against a diverse reference
    (`+15`). The sandbox additionally **blocks** `__import__`, `open`,
    `eval`, `exec`, `compile`, etc.

### Open-ended tasks (Level 2)

The env resolves a target function from three sources, in priority order:

1. **Caller-supplied** — `POST /reset` with `target_code` + `target_function_name`
   (and optionally `edge_cases` + `fuzz_spec`). The source is compiled in the
   same hardened sandbox the verifier uses for agent submissions; static-import
   of `opensleuth_*` is rejected up front. This lets a trainer hand the env an
   arbitrary unseen task per rollout without any redeploy.

2. **Hub dataset** — [`anugrah55/opensleuth-tasks`](https://huggingface.co/datasets/anugrah55/opensleuth-tasks).
   Loaded lazily on first `/reset`, cached in-process. Each row has
   `{name, target_function_name, signature, description, difficulty,
   source_code, edge_cases_json, fuzz_spec_json}`.

3. **Builtin registry** — the original 9 functions in `black_box.py` are kept
   as the safety-net so the in-flight trainer keeps working unchanged. Builtins
   *win* by name over Hub copies, so `target_name="fibonacci"` always resolves
   to the in-process oracle.

#### Adding new tasks

* **Per-reset (one-shot)**: pass `target_code` + `target_function_name` to
  `/reset`. Multi-arg signatures are supported via the auto-fuzzer (which
  introspects `inspect.signature` + `typing.get_type_hints`); pass
  `edge_cases` as a list of Python literal strings and `fuzz_spec` as a
  per-parameter override map.

* **Persistent**: append a row to the Hub dataset and the env will pick it
  up on its next process-start. The bootstrap script
  (`opensleuth_env/scripts/bootstrap_tasks_dataset.py`) is idempotent —
  re-running it overwrites the dataset with the latest builtin + curated
  rows.

```bash
# Push the curated 9 + 6 = 15-task seed catalog.
PYTHONPATH=. python -m opensleuth_env.scripts.bootstrap_tasks_dataset
```

### Backwards compatibility

Existing trainer / eval clients only read `info["execution_reward"]`,
`info["matches"]`, `info["fuzz_count"]` and `resp["reward"]` — all preserved
with the same meaning. New fields (`difficulty`, `coverage_buckets_seen`,
`matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`,
`floor_penalty`, `perfect_bonus`) are additive and ignored by older clients.

`/reset` retains its v0.3 shape: `{"target_name": "fibonacci", "seed": 0,
"max_steps": 25}` works exactly as before. The four new optional fields
(`target_code`, `target_function_name`, `edge_cases`, `fuzz_spec`) are
additive. `/functions` returns the same shape as before (with one *additive*
`source` field). Open-ended/Hub tasks are exposed via the new `/tasks`
endpoint so older clients aren't surprised.

## OpenEnv conformance

This Space targets the [meta-pytorch / OpenEnv](https://github.com/meta-pytorch/OpenEnv)
v0.2.3 spec (`pip install openenv-core==0.2.3`). The OpenEnv-conformant
surface is mounted at **`/openenv/*`** alongside (not on top of) the legacy
endpoints listed above so the in-flight trainer keeps working unchanged.

| OpenEnv route            | Path                  | Notes                                                    |
|--------------------------|-----------------------|----------------------------------------------------------|
| `GET  /health`           | `/openenv/health`     | `{"status": "healthy"}`                                  |
| `GET  /metadata`         | `/openenv/metadata`   | `EnvironmentMetadata` (name, description, version, ...)  |
| `GET  /schema`           | `/openenv/schema`     | JSON schemas for `action`, `observation`, `state`        |
| `GET  /state`            | `/openenv/state`      | Episode `State` (episode_id, step_count, ...)            |
| `POST /reset`            | `/openenv/reset`      | Returns `{"observation", "reward", "done"}` envelope     |
| `POST /step`             | `/openenv/step`       | Body: `{"action": {"action_type": "probe"|"submit", ...}}` |
| `WS   /ws`               | `/openenv/ws`         | Persistent session: `reset` → `step`* → `state` → `close` |

`OpenSleuthEnvironment` (in `opensleuth_env/openenv_adapter.py`) subclasses
`openenv.core.env_server.interfaces.Environment`, so any OpenEnv-aware
harness (`openenv` CLI, `GenericEnvClient`, TRL/torchforge integrations,
LightningAI Studio, ...) can pick it up via standard introspection.

### Talking to it as an OpenEnv client

```python
import asyncio
from openenv import GenericEnvClient, GenericAction

async def main():
    base = "https://anugrah55-opensleuth-env-gemini-cli.hf.space/openenv"
    async with GenericEnvClient(base_url=base) as env:
        result = await env.reset(target_name="fibonacci", max_steps=8)
        result = await env.step(GenericAction(action_type="probe", input_repr="10"))
        print(result.observation["probe_history"][-1])

asyncio.run(main())
```

A runnable end-to-end example lives in [`example_client.py`](example_client.py).

### What is *not* yet conformant

* No MCP tool surface (RFC 003). Our actions are typed Pydantic models, not
  MCP tools, because the underlying probe/submit semantics map cleanly to a
  single `OpenSleuthAction` discriminator. Adding MCP would be additive.
* No Rubric/EvalHarness integration (RFC 004) — reward shaping lives in
  `opensleuth_env/env.py` and is intentionally not split into a separate
  rubric for now.

## Hardware

CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space.

## Running locally

```bash
pip install -r requirements.txt
uvicorn server:app --port 7860 --reload
# legacy contract:                http://localhost:7860/{health,reset,step,state/{eid}}
# OpenEnv-conformant surface:     http://localhost:7860/openenv/{health,reset,step,state,schema,metadata,ws}
```

To run only the OpenEnv conformance tests:

```bash
PYTHONPATH=. python -m pytest tests/test_openenv_conformance.py -v
```