File size: 8,723 Bytes
07e79cc
d3cd20c
 
07e79cc
d3cd20c
07e79cc
d3cd20c
07e79cc
d3cd20c
07e79cc
 
d3cd20c
 
 
 
 
 
 
 
 
 
77e65fb
 
 
 
d3cd20c
 
 
 
 
 
 
 
 
 
536dda7
d3cd20c
536dda7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77e65fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
536dda7
 
 
 
 
 
 
d3cd20c
77e65fb
 
 
 
 
 
 
31715b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3cd20c
 
 
 
 
 
 
 
 
31715b5
 
 
 
 
 
 
 
d3cd20c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: OpenSleuth Env
emoji: πŸ•΅οΈ
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
suggested_hardware: cpu-basic
---

# OpenSleuth β€” Environment

FastAPI service that exposes an OpenEnv-style `/reset` + `/step` API for the
**Algorithmic Detective** task. An agent has to figure out an unknown Python
function by probing it, then submit Python source that replicates it.

## Endpoints

| Method | Path          | Body                                   | Notes                                  |
|-------:|---------------|----------------------------------------|----------------------------------------|
| GET    | `/health`     | β€”                                      | Liveness probe (also reports Hub-catalog status). |
| GET    | `/functions`  | optional `?difficulty=easy\|medium\|hard` | Catalogue of the 9 builtin black-boxes (back-compat shape). |
| GET    | `/tasks`      | optional `?source=builtin\|hub\|all`    | Open-ended catalog (Level 2): builtins + Hub-loaded rows. |
| POST   | `/reset`      | `{"target_name": "fibonacci", "seed": 0}` *or* `{"target_code": "...", "target_function_name": "..."}` | Starts an episode. Caller-supplied target_code wins over target_name. |
| POST   | `/step`       | `{"episode_id": "...", "action": {...}}` | One agent action.                      |
| GET    | `/state/{eid}`| β€”                                      | Inspect the live state of an episode (debug). |

### Action shapes

```json
{"action_type": "probe",  "input_repr": "5"}             // input_repr is parsed via ast.literal_eval
{"action_type": "submit", "code": "def fibonacci(n):..."}
```

### Reward (v0.3 – paper-driven update)

Inspired by Masud et al. 2026 (*Reward Engineering for RL in Software Tasks*,
arXiv:2601.19100) and Ibrahim et al. 2024 (*Comprehensive Overview of Reward
Engineering and Shaping*, arXiv:2408.10215).

* **Probe:** `-1` step cost, plus `+2` per newly-seen output, `+5` per
  newly-seen exception type, **and `+0.5` per newly-explored input bucket**
  (CovRL-Fuzz / SimHash-style coverage bonus).
* **Submit (terminal):**
  `execution_reward βˆ’ complexity_penalty βˆ’ reward_hack_penalty βˆ’ floor_penalty
  (+50 perfect bonus if 100% match)` where:
  * `execution_reward` ∈ `[0, 100]` is computed over **stratified** fuzz
    inputs: spec-defined `edge_cases` are *always* tested in addition to the
    random fuzz batch, and the per-category match counts are returned in
    `info["matches_by_category"]`.
  * `floor_penalty` is a hard `-25` for sub-50% match-rate submissions
    (Vul-R2 style; Wen et al. 2025), preventing agents from learning that
    emitting *any* function pays out.
  * `reward_hack_penalty` fires for static import-of-reference attempts
    (`+25`) and for "constant-output" collapse against a diverse reference
    (`+15`). The sandbox additionally **blocks** `__import__`, `open`,
    `eval`, `exec`, `compile`, etc.

### Open-ended tasks (Level 2)

The env resolves a target function from three sources, in priority order:

1. **Caller-supplied** β€” `POST /reset` with `target_code` + `target_function_name`
   (and optionally `edge_cases` + `fuzz_spec`). The source is compiled in the
   same hardened sandbox the verifier uses for agent submissions; static-import
   of `opensleuth_*` is rejected up front. This lets a trainer hand the env an
   arbitrary unseen task per rollout without any redeploy.

2. **Hub dataset** β€” [`anugrah55/opensleuth-tasks`](https://huggingface.co/datasets/anugrah55/opensleuth-tasks).
   Loaded lazily on first `/reset`, cached in-process. Each row has
   `{name, target_function_name, signature, description, difficulty,
   source_code, edge_cases_json, fuzz_spec_json}`.

3. **Builtin registry** β€” the original 9 functions in `black_box.py` are kept
   as the safety-net so the in-flight trainer keeps working unchanged. Builtins
   *win* by name over Hub copies, so `target_name="fibonacci"` always resolves
   to the in-process oracle.

#### Adding new tasks

* **Per-reset (one-shot)**: pass `target_code` + `target_function_name` to
  `/reset`. Multi-arg signatures are supported via the auto-fuzzer (which
  introspects `inspect.signature` + `typing.get_type_hints`); pass
  `edge_cases` as a list of Python literal strings and `fuzz_spec` as a
  per-parameter override map.

* **Persistent**: append a row to the Hub dataset and the env will pick it
  up on its next process-start. The bootstrap script
  (`opensleuth_env/scripts/bootstrap_tasks_dataset.py`) is idempotent β€”
  re-running it overwrites the dataset with the latest builtin + curated
  rows.

```bash
# Push the curated 9 + 6 = 15-task seed catalog.
PYTHONPATH=. python -m opensleuth_env.scripts.bootstrap_tasks_dataset
```

### Backwards compatibility

Existing trainer / eval clients only read `info["execution_reward"]`,
`info["matches"]`, `info["fuzz_count"]` and `resp["reward"]` β€” all preserved
with the same meaning. New fields (`difficulty`, `coverage_buckets_seen`,
`matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`,
`floor_penalty`, `perfect_bonus`) are additive and ignored by older clients.

`/reset` retains its v0.3 shape: `{"target_name": "fibonacci", "seed": 0,
"max_steps": 25}` works exactly as before. The four new optional fields
(`target_code`, `target_function_name`, `edge_cases`, `fuzz_spec`) are
additive. `/functions` returns the same shape as before (with one *additive*
`source` field). Open-ended/Hub tasks are exposed via the new `/tasks`
endpoint so older clients aren't surprised.

## OpenEnv conformance

This Space targets the [meta-pytorch / OpenEnv](https://github.com/meta-pytorch/OpenEnv)
v0.2.3 spec (`pip install openenv-core==0.2.3`). The OpenEnv-conformant
surface is mounted at **`/openenv/*`** alongside (not on top of) the legacy
endpoints listed above so the in-flight trainer keeps working unchanged.

| OpenEnv route            | Path                  | Notes                                                    |
|--------------------------|-----------------------|----------------------------------------------------------|
| `GET  /health`           | `/openenv/health`     | `{"status": "healthy"}`                                  |
| `GET  /metadata`         | `/openenv/metadata`   | `EnvironmentMetadata` (name, description, version, ...)  |
| `GET  /schema`           | `/openenv/schema`     | JSON schemas for `action`, `observation`, `state`        |
| `GET  /state`            | `/openenv/state`      | Episode `State` (episode_id, step_count, ...)            |
| `POST /reset`            | `/openenv/reset`      | Returns `{"observation", "reward", "done"}` envelope     |
| `POST /step`             | `/openenv/step`       | Body: `{"action": {"action_type": "probe"|"submit", ...}}` |
| `WS   /ws`               | `/openenv/ws`         | Persistent session: `reset` β†’ `step`* β†’ `state` β†’ `close` |

`OpenSleuthEnvironment` (in `opensleuth_env/openenv_adapter.py`) subclasses
`openenv.core.env_server.interfaces.Environment`, so any OpenEnv-aware
harness (`openenv` CLI, `GenericEnvClient`, TRL/torchforge integrations,
LightningAI Studio, ...) can pick it up via standard introspection.

### Talking to it as an OpenEnv client

```python
import asyncio
from openenv import GenericEnvClient, GenericAction

async def main():
    base = "https://anugrah55-opensleuth-env-gemini-cli.hf.space/openenv"
    async with GenericEnvClient(base_url=base) as env:
        result = await env.reset(target_name="fibonacci", max_steps=8)
        result = await env.step(GenericAction(action_type="probe", input_repr="10"))
        print(result.observation["probe_history"][-1])

asyncio.run(main())
```

A runnable end-to-end example lives in [`example_client.py`](example_client.py).

### What is *not* yet conformant

* No MCP tool surface (RFC 003). Our actions are typed Pydantic models, not
  MCP tools, because the underlying probe/submit semantics map cleanly to a
  single `OpenSleuthAction` discriminator. Adding MCP would be additive.
* No Rubric/EvalHarness integration (RFC 004) β€” reward shaping lives in
  `opensleuth_env/env.py` and is intentionally not split into a separate
  rubric for now.

## Hardware

CPU-only β€” `cpu-basic` is plenty. Do **not** assign GPU to this Space.

## Running locally

```bash
pip install -r requirements.txt
uvicorn server:app --port 7860 --reload
# legacy contract:                http://localhost:7860/{health,reset,step,state/{eid}}
# OpenEnv-conformant surface:     http://localhost:7860/openenv/{health,reset,step,state,schema,metadata,ws}
```

To run only the OpenEnv conformance tests:

```bash
PYTHONPATH=. python -m pytest tests/test_openenv_conformance.py -v
```