anugrah55 commited on
Commit
77e65fb
·
verified ·
1 Parent(s): 536dda7

Level 2 open-ended env: auto-fuzzer + TaskCatalog + Hub-driven catalog + extended /reset

Browse files

Adds opensleuth_env/auto_fuzzer.py (type-driven fuzz input generator), opensleuth_env/task_catalog.py (resolves caller-supplied target_code, Hub dataset rows, builtin registry in priority order), opensleuth_env/scripts/bootstrap_tasks_dataset.py (idempotent push of 9 builtin + 6 new tasks to anugrah55/opensleuth-tasks). Extends ResetRequest with target_code/target_function_name/edge_cases/fuzz_spec, adds GET /tasks endpoint, threads unpack_args through env+verifier for multi-arg targets. The 9 builtin functions and the v0.3 /reset shape are kept as the safety net so the in-flight trainer keeps working unchanged. 61 unit tests pass.

README.md CHANGED
@@ -19,9 +19,10 @@ function by probing it, then submit Python source that replicates it.
19
 
20
  | Method | Path | Body | Notes |
21
  |-------:|---------------|----------------------------------------|----------------------------------------|
22
- | GET | `/health` | — | Liveness probe. |
23
- | GET | `/functions` | optional `?difficulty=easy\|medium\|hard` | Catalogue of available black-boxes (with curriculum metadata). |
24
- | POST | `/reset` | `{"target_name": "fibonacci", "seed": 0}` | Starts a new episode, returns initial obs + `episode_id`. |
 
25
  | POST | `/step` | `{"episode_id": "...", "action": {...}}` | One agent action. |
26
  | GET | `/state/{eid}`| — | Inspect the live state of an episode (debug). |
27
 
@@ -56,6 +57,45 @@ Engineering and Shaping*, arXiv:2408.10215).
56
  (`+15`). The sandbox additionally **blocks** `__import__`, `open`,
57
  `eval`, `exec`, `compile`, etc.
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ### Backwards compatibility
60
 
61
  Existing trainer / eval clients only read `info["execution_reward"]`,
@@ -64,6 +104,13 @@ with the same meaning. New fields (`difficulty`, `coverage_buckets_seen`,
64
  `matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`,
65
  `floor_penalty`, `perfect_bonus`) are additive and ignored by older clients.
66
 
 
 
 
 
 
 
 
67
  ## Hardware
68
 
69
  CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space.
 
19
 
20
  | Method | Path | Body | Notes |
21
  |-------:|---------------|----------------------------------------|----------------------------------------|
22
+ | GET | `/health` | — | Liveness probe (also reports Hub-catalog status). |
23
+ | GET | `/functions` | optional `?difficulty=easy\|medium\|hard` | Catalogue of the 9 builtin black-boxes (back-compat shape). |
24
+ | GET | `/tasks` | optional `?source=builtin\|hub\|all` | Open-ended catalog (Level 2): builtins + Hub-loaded rows. |
25
+ | POST | `/reset` | `{"target_name": "fibonacci", "seed": 0}` *or* `{"target_code": "...", "target_function_name": "..."}` | Starts an episode. Caller-supplied target_code wins over target_name. |
26
  | POST | `/step` | `{"episode_id": "...", "action": {...}}` | One agent action. |
27
  | GET | `/state/{eid}`| — | Inspect the live state of an episode (debug). |
28
 
 
57
  (`+15`). The sandbox additionally **blocks** `__import__`, `open`,
58
  `eval`, `exec`, `compile`, etc.
59
 
60
+ ### Open-ended tasks (Level 2)
61
+
62
+ The env resolves a target function from three sources, in priority order:
63
+
64
+ 1. **Caller-supplied** — `POST /reset` with `target_code` + `target_function_name`
65
+ (and optionally `edge_cases` + `fuzz_spec`). The source is compiled in the
66
+ same hardened sandbox the verifier uses for agent submissions; static-import
67
+ of `opensleuth_*` is rejected up front. This lets a trainer hand the env an
68
+ arbitrary unseen task per rollout without any redeploy.
69
+
70
+ 2. **Hub dataset** — [`anugrah55/opensleuth-tasks`](https://huggingface.co/datasets/anugrah55/opensleuth-tasks).
71
+ Loaded lazily on first `/reset`, cached in-process. Each row has
72
+ `{name, target_function_name, signature, description, difficulty,
73
+ source_code, edge_cases_json, fuzz_spec_json}`.
74
+
75
+ 3. **Builtin registry** — the original 9 functions in `black_box.py` are kept
76
+ as the safety-net so the in-flight trainer keeps working unchanged. Builtins
77
+ *win* by name over Hub copies, so `target_name="fibonacci"` always resolves
78
+ to the in-process oracle.
79
+
80
+ #### Adding new tasks
81
+
82
+ * **Per-reset (one-shot)**: pass `target_code` + `target_function_name` to
83
+ `/reset`. Multi-arg signatures are supported via the auto-fuzzer (which
84
+ introspects `inspect.signature` + `typing.get_type_hints`); pass
85
+ `edge_cases` as a list of Python literal strings and `fuzz_spec` as a
86
+ per-parameter override map.
87
+
88
+ * **Persistent**: append a row to the Hub dataset and the env will pick it
89
+ up on its next process-start. The bootstrap script
90
+ (`opensleuth_env/scripts/bootstrap_tasks_dataset.py`) is idempotent —
91
+ re-running it overwrites the dataset with the latest builtin + curated
92
+ rows.
93
+
94
+ ```bash
95
+ # Push the curated 9 + 6 = 15-task seed catalog.
96
+ PYTHONPATH=. python -m opensleuth_env.scripts.bootstrap_tasks_dataset
97
+ ```
98
+
99
  ### Backwards compatibility
100
 
101
  Existing trainer / eval clients only read `info["execution_reward"]`,
 
104
  `matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`,
105
  `floor_penalty`, `perfect_bonus`) are additive and ignored by older clients.
106
 
107
+ `/reset` retains its v0.3 shape: `{"target_name": "fibonacci", "seed": 0,
108
+ "max_steps": 25}` works exactly as before. The four new optional fields
109
+ (`target_code`, `target_function_name`, `edge_cases`, `fuzz_spec`) are
110
+ additive. `/functions` returns the same shape as before (with one *additive*
111
+ `source` field). Open-ended/Hub tasks are exposed via the new `/tasks`
112
+ endpoint so older clients aren't surprised.
113
+
114
  ## Hardware
115
 
116
  CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space.
opensleuth_env/__init__.py CHANGED
@@ -12,6 +12,8 @@ from .models import (
12
  StepRequest,
13
  )
14
  from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
 
 
15
 
16
  __all__ = [
17
  "OpenSleuthEnv",
@@ -25,4 +27,9 @@ __all__ = [
25
  "StepRequest",
26
  "BLACK_BOX_FUNCTIONS",
27
  "FunctionSpec",
 
 
 
 
 
28
  ]
 
12
  StepRequest,
13
  )
14
  from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
15
+ from .task_catalog import TaskCatalog, TaskResolutionError, HUB_DATASET_ID
16
+ from .auto_fuzzer import auto_fuzz, make_fuzzer
17
 
18
  __all__ = [
19
  "OpenSleuthEnv",
 
27
  "StepRequest",
28
  "BLACK_BOX_FUNCTIONS",
29
  "FunctionSpec",
30
+ "TaskCatalog",
31
+ "TaskResolutionError",
32
+ "HUB_DATASET_ID",
33
+ "auto_fuzz",
34
+ "make_fuzzer",
35
  ]
opensleuth_env/auto_fuzzer.py ADDED
@@ -0,0 +1,383 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Generic, type-driven fuzz-input generator for OpenSleuth Level 2.
2
+
3
+ Given a Python callable annotated with ``typing`` hints, ``auto_fuzz`` produces
4
+ ``n`` argument tuples that respect the signature so the verifier can score
5
+ unannotated *arbitrary* targets without requiring a hand-written fuzzer the
6
+ way the 9 builtin BLACK_BOX_FUNCTIONS do.
7
+
8
+ Each per-type generator mixes a small set of "edge" values (``0``, ``-1``,
9
+ ``""``, ``None`` for ``Optional``, ...) with random values, weighted ~30/70.
10
+ This biases the fuzz batch toward the boundaries that actually distinguish
11
+ implementations while still covering the boring middle.
12
+
13
+ A caller-supplied ``fuzz_spec: dict`` overrides the type-based generation on
14
+ a per-parameter basis, e.g.::
15
+
16
+ auto_fuzz(my_fn, n=20, fuzz_spec={"n": {"type": "int", "min": 1, "max": 90}})
17
+
18
+ Returned shape: ``List[tuple]`` -- one tuple per fuzz input, with one element
19
+ per (positional) parameter of ``fn``. Even for unary ``fn`` we return tuples
20
+ so the catalog wrapper has a single, uniform calling convention.
21
+ """
22
+
23
+ from __future__ import annotations
24
+
25
+ import inspect
26
+ import random
27
+ import string
28
+ import typing
29
+ from typing import Any, Callable, Dict, List, Optional, Tuple, Union, get_args, get_origin
30
+
31
+
32
+ # Probability that a per-type generator emits an "edge" value (0, "", None,
33
+ # ...) instead of a random sample. Kept small enough that the boring middle
34
+ # still gets coverage but high enough that the edge cases reliably appear.
35
+ EDGE_PROB = 0.30
36
+
37
+
38
+ # Per-type edge pools. These are used by the ``_g_*`` helpers below.
39
+ _INT_EDGES = (0, 1, -1, 2, -2, 10, -10, 100, -100)
40
+ _FLOAT_EDGES = (0.0, 1.0, -1.0, 0.5, -0.5, 1e-9, -1e-9, 100.0)
41
+ _STR_EDGES = ("", "a", "ab", "Hello", " ", "0", "abc def")
42
+ _BYTES_EDGES = (b"", b"a", b"ab", b"\x00", b"abc")
43
+
44
+
45
+ # ---------------------------------------------------------------------------
46
+ # Per-type generators (do not assume any param-name dispatch).
47
+ # ---------------------------------------------------------------------------
48
+
49
+
50
+ def _maybe_edge(rng: random.Random, edges: tuple, random_fn: Callable[[], Any]) -> Any:
51
+ if edges and rng.random() < EDGE_PROB:
52
+ return rng.choice(edges)
53
+ return random_fn()
54
+
55
+
56
+ def _g_int(rng: random.Random, *, lo: int = -100, hi: int = 100) -> int:
57
+ # Filter the edge pool by [lo, hi] so a caller-supplied fuzz_spec
58
+ # ``{"type": "int", "min": 1, "max": 5}`` never emits ``-100``.
59
+ edges = tuple(v for v in _INT_EDGES if lo <= v <= hi) or (lo,)
60
+ return _maybe_edge(rng, edges, lambda: rng.randint(lo, hi))
61
+
62
+
63
+ def _g_float(rng: random.Random, *, lo: float = -100.0, hi: float = 100.0) -> float:
64
+ edges = tuple(v for v in _FLOAT_EDGES if lo <= v <= hi) or (lo,)
65
+ return _maybe_edge(rng, edges, lambda: rng.uniform(lo, hi))
66
+
67
+
68
+ def _g_bool(rng: random.Random) -> bool:
69
+ return bool(rng.getrandbits(1))
70
+
71
+
72
+ def _g_str(rng: random.Random, *, max_len: int = 12, alphabet: Optional[str] = None) -> str:
73
+ alpha = alphabet or (string.ascii_letters + string.digits)
74
+
75
+ def _rand():
76
+ return "".join(rng.choices(alpha, k=rng.randint(0, max_len)))
77
+
78
+ if alphabet is not None:
79
+ # When the caller restricts the alphabet, our generic edge pool
80
+ # ("Hello", " ", ...) would violate it. Build a deterministic
81
+ # alphabet-respecting edge set instead.
82
+ custom_edges = ("",)
83
+ if alphabet:
84
+ custom_edges = ("", alphabet[0], alphabet[0] * min(max_len, 2))
85
+ return _maybe_edge(rng, custom_edges, _rand)
86
+ return _maybe_edge(rng, _STR_EDGES, _rand)
87
+
88
+
89
+ def _g_bytes(rng: random.Random, *, max_len: int = 8) -> bytes:
90
+ def _rand():
91
+ return bytes(rng.randint(0, 255) for _ in range(rng.randint(0, max_len)))
92
+
93
+ return _maybe_edge(rng, _BYTES_EDGES, _rand)
94
+
95
+
96
+ def _g_list(rng: random.Random, elem_gen: Callable[[], Any], *, max_len: int = 6) -> list:
97
+ if rng.random() < EDGE_PROB / 2:
98
+ return []
99
+ return [elem_gen() for _ in range(rng.randint(0, max_len))]
100
+
101
+
102
+ def _g_tuple_homogeneous(
103
+ rng: random.Random, elem_gen: Callable[[], Any], *, max_len: int = 6
104
+ ) -> tuple:
105
+ return tuple(_g_list(rng, elem_gen, max_len=max_len))
106
+
107
+
108
+ def _g_tuple_heterogeneous(rng: random.Random, elem_gens: List[Callable[[], Any]]) -> tuple:
109
+ return tuple(g() for g in elem_gens)
110
+
111
+
112
+ def _g_set(rng: random.Random, elem_gen: Callable[[], Any], *, max_len: int = 6) -> set:
113
+ if rng.random() < EDGE_PROB / 2:
114
+ return set()
115
+ return {elem_gen() for _ in range(rng.randint(0, max_len))}
116
+
117
+
118
+ def _g_dict(
119
+ rng: random.Random,
120
+ key_gen: Callable[[], Any],
121
+ val_gen: Callable[[], Any],
122
+ *,
123
+ max_len: int = 5,
124
+ ) -> dict:
125
+ if rng.random() < EDGE_PROB / 2:
126
+ return {}
127
+ return {key_gen(): val_gen() for _ in range(rng.randint(0, max_len))}
128
+
129
+
130
+ # ---------------------------------------------------------------------------
131
+ # Type -> generator dispatch.
132
+ # ---------------------------------------------------------------------------
133
+
134
+
135
+ def _is_optional(tp: Any) -> bool:
136
+ """``Optional[X]`` is ``Union[X, None]`` under the hood."""
137
+ if get_origin(tp) is Union:
138
+ return type(None) in get_args(tp)
139
+ return False
140
+
141
+
142
+ def _strip_optional(tp: Any) -> Any:
143
+ """Return ``X`` for ``Optional[X]``; for unions with None + multiple, pick
144
+ the first non-None member (we can't satisfy a union in a single call)."""
145
+ if get_origin(tp) is Union:
146
+ non_none = [a for a in get_args(tp) if a is not type(None)]
147
+ if len(non_none) == 1:
148
+ return non_none[0]
149
+ if non_none:
150
+ return non_none[0]
151
+ return tp
152
+
153
+
154
+ def _make_generator(tp: Any, rng: random.Random) -> Callable[[], Any]:
155
+ """Return a 0-arg callable that produces one random value of type ``tp``.
156
+
157
+ The recursion handles container element types (``list[int]``,
158
+ ``dict[str, list[int]]``, etc).
159
+ """
160
+
161
+ if tp is None or tp is type(None):
162
+ return lambda: None
163
+
164
+ if _is_optional(tp):
165
+ inner = _strip_optional(tp)
166
+ inner_gen = _make_generator(inner, rng)
167
+
168
+ def _gen_opt():
169
+ if rng.random() < EDGE_PROB:
170
+ return None
171
+ return inner_gen()
172
+
173
+ return _gen_opt
174
+
175
+ origin = get_origin(tp)
176
+
177
+ if origin is typing.Literal:
178
+ choices = list(get_args(tp))
179
+ return lambda: rng.choice(choices)
180
+
181
+ if origin is None:
182
+ if tp is int:
183
+ return lambda: _g_int(rng)
184
+ if tp is float:
185
+ return lambda: _g_float(rng)
186
+ if tp is bool:
187
+ return lambda: _g_bool(rng)
188
+ if tp is str:
189
+ return lambda: _g_str(rng)
190
+ if tp is bytes:
191
+ return lambda: _g_bytes(rng)
192
+ if tp is list:
193
+ return lambda: _g_list(rng, lambda: _g_int(rng))
194
+ if tp is tuple:
195
+ return lambda: _g_tuple_homogeneous(rng, lambda: _g_int(rng))
196
+ if tp is set:
197
+ return lambda: _g_set(rng, lambda: _g_int(rng))
198
+ if tp is dict:
199
+ return lambda: _g_dict(rng, lambda: _g_str(rng, max_len=4), lambda: _g_int(rng))
200
+ if tp is type(None):
201
+ return lambda: None
202
+ if tp is typing.Any:
203
+ return lambda: _g_int(rng)
204
+ # Unknown bare type -> fall back to int.
205
+ return lambda: _g_int(rng)
206
+
207
+ args = get_args(tp)
208
+
209
+ if origin in (list, List):
210
+ elem_t = args[0] if args else int
211
+ elem_gen = _make_generator(elem_t, rng)
212
+ return lambda: _g_list(rng, elem_gen)
213
+
214
+ if origin in (set, frozenset):
215
+ elem_t = args[0] if args else int
216
+ elem_gen = _make_generator(elem_t, rng)
217
+ return lambda: _g_set(rng, elem_gen)
218
+
219
+ if origin in (tuple, Tuple):
220
+ if not args:
221
+ return lambda: _g_tuple_homogeneous(rng, lambda: _g_int(rng))
222
+ if len(args) == 2 and args[1] is Ellipsis:
223
+ elem_gen = _make_generator(args[0], rng)
224
+ return lambda: _g_tuple_homogeneous(rng, elem_gen)
225
+ elem_gens = [_make_generator(a, rng) for a in args]
226
+ return lambda: _g_tuple_heterogeneous(rng, elem_gens)
227
+
228
+ if origin in (dict, Dict):
229
+ key_t = args[0] if args else str
230
+ val_t = args[1] if len(args) > 1 else int
231
+ key_gen = _make_generator(key_t, rng)
232
+ val_gen = _make_generator(val_t, rng)
233
+ return lambda: _g_dict(rng, key_gen, val_gen)
234
+
235
+ if origin is Union:
236
+ # Already handled Optional above. For pure unions, pick first member.
237
+ return _make_generator(args[0], rng)
238
+
239
+ return lambda: _g_int(rng)
240
+
241
+
242
+ # ---------------------------------------------------------------------------
243
+ # fuzz_spec overrides
244
+ # ---------------------------------------------------------------------------
245
+
246
+
247
+ def _generator_from_spec(entry: Dict[str, Any], rng: random.Random) -> Callable[[], Any]:
248
+ """Build a generator from a ``fuzz_spec`` entry dict.
249
+
250
+ Supported keys (all optional except ``type``):
251
+ - ``type``: one of ``"int" | "float" | "bool" | "str" | "bytes" |
252
+ "list" | "tuple" | "set" | "dict" | "literal" | "any"``
253
+ - ``min``, ``max``: int/float bounds
254
+ - ``max_len``: container/string length cap
255
+ - ``alphabet``: str-only character pool
256
+ - ``elem``: nested ``fuzz_spec`` entry for container elements
257
+ - ``key``, ``value``: nested entries for dict
258
+ - ``elems``: list of nested entries for fixed-arity tuple
259
+ - ``choices``: list of literals to sample from
260
+ - ``optional``: bool; if True, occasionally yields ``None``
261
+ """
262
+ t = entry.get("type", "any")
263
+
264
+ def _maybe_optional(gen: Callable[[], Any]) -> Callable[[], Any]:
265
+ if not entry.get("optional"):
266
+ return gen
267
+
268
+ def _g():
269
+ if rng.random() < EDGE_PROB:
270
+ return None
271
+ return gen()
272
+
273
+ return _g
274
+
275
+ if t == "int":
276
+ lo = int(entry.get("min", -100))
277
+ hi = int(entry.get("max", 100))
278
+ return _maybe_optional(lambda: _g_int(rng, lo=lo, hi=hi))
279
+ if t == "float":
280
+ lo = float(entry.get("min", -100.0))
281
+ hi = float(entry.get("max", 100.0))
282
+ return _maybe_optional(lambda: _g_float(rng, lo=lo, hi=hi))
283
+ if t == "bool":
284
+ return _maybe_optional(lambda: _g_bool(rng))
285
+ if t == "str":
286
+ max_len = int(entry.get("max_len", 12))
287
+ alphabet = entry.get("alphabet")
288
+ return _maybe_optional(lambda: _g_str(rng, max_len=max_len, alphabet=alphabet))
289
+ if t == "bytes":
290
+ max_len = int(entry.get("max_len", 8))
291
+ return _maybe_optional(lambda: _g_bytes(rng, max_len=max_len))
292
+ if t == "literal":
293
+ choices = list(entry.get("choices", []))
294
+ if not choices:
295
+ return _maybe_optional(lambda: None)
296
+ return _maybe_optional(lambda: rng.choice(choices))
297
+ if t == "list":
298
+ elem = entry.get("elem", {"type": "int"})
299
+ elem_gen = _generator_from_spec(elem, rng)
300
+ max_len = int(entry.get("max_len", 6))
301
+ return _maybe_optional(lambda: _g_list(rng, elem_gen, max_len=max_len))
302
+ if t == "tuple":
303
+ if "elems" in entry:
304
+ elem_gens = [_generator_from_spec(e, rng) for e in entry["elems"]]
305
+ return _maybe_optional(lambda: _g_tuple_heterogeneous(rng, elem_gens))
306
+ elem = entry.get("elem", {"type": "int"})
307
+ elem_gen = _generator_from_spec(elem, rng)
308
+ max_len = int(entry.get("max_len", 6))
309
+ return _maybe_optional(lambda: _g_tuple_homogeneous(rng, elem_gen, max_len=max_len))
310
+ if t == "set":
311
+ elem = entry.get("elem", {"type": "int"})
312
+ elem_gen = _generator_from_spec(elem, rng)
313
+ max_len = int(entry.get("max_len", 6))
314
+ return _maybe_optional(lambda: _g_set(rng, elem_gen, max_len=max_len))
315
+ if t == "dict":
316
+ key = entry.get("key", {"type": "str", "max_len": 4})
317
+ value = entry.get("value", {"type": "int"})
318
+ key_gen = _generator_from_spec(key, rng)
319
+ val_gen = _generator_from_spec(value, rng)
320
+ max_len = int(entry.get("max_len", 5))
321
+ return _maybe_optional(lambda: _g_dict(rng, key_gen, val_gen, max_len=max_len))
322
+ return _maybe_optional(lambda: _g_int(rng))
323
+
324
+
325
+ # ---------------------------------------------------------------------------
326
+ # Public API
327
+ # ---------------------------------------------------------------------------
328
+
329
+
330
+ def auto_fuzz(
331
+ fn: Callable[..., Any],
332
+ n: int,
333
+ rng: Optional[random.Random] = None,
334
+ *,
335
+ fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
336
+ ) -> List[tuple]:
337
+ """Produce ``n`` argument tuples for calling ``fn``.
338
+
339
+ Each returned element is an ``args`` tuple, intended to be applied as
340
+ ``fn(*args)``. ``fuzz_spec`` is keyed by parameter name and overrides
341
+ the type-based generation per-parameter.
342
+ """
343
+ rng = rng or random.Random()
344
+ fuzz_spec = fuzz_spec or {}
345
+
346
+ sig = inspect.signature(fn)
347
+ try:
348
+ hints = typing.get_type_hints(fn)
349
+ except Exception: # noqa: BLE001 -- bad annotations shouldn't crash fuzzing
350
+ hints = {}
351
+
352
+ param_gens: List[Callable[[], Any]] = []
353
+ for pname, param in sig.parameters.items():
354
+ if param.kind in (
355
+ inspect.Parameter.VAR_POSITIONAL,
356
+ inspect.Parameter.VAR_KEYWORD,
357
+ inspect.Parameter.KEYWORD_ONLY,
358
+ ):
359
+ # We only fuzz positional / positional-or-keyword params.
360
+ continue
361
+ if pname in fuzz_spec:
362
+ param_gens.append(_generator_from_spec(fuzz_spec[pname], rng))
363
+ continue
364
+ annot = hints.get(pname, param.annotation)
365
+ if annot is inspect.Parameter.empty:
366
+ param_gens.append(lambda r=rng: _g_int(r))
367
+ else:
368
+ param_gens.append(_make_generator(annot, rng))
369
+
370
+ return [tuple(g() for g in param_gens) for _ in range(n)]
371
+
372
+
373
+ def make_fuzzer(
374
+ fn: Callable[..., Any],
375
+ fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
376
+ ) -> Callable[[random.Random, int], List[tuple]]:
377
+ """Adapt ``auto_fuzz`` to the ``FunctionSpec.fuzzer`` signature
378
+ (``(rng, n) -> list``)."""
379
+
380
+ def _fuzzer(rng: random.Random, n: int) -> List[tuple]:
381
+ return auto_fuzz(fn, n, rng, fuzz_spec=fuzz_spec)
382
+
383
+ return _fuzzer
opensleuth_env/black_box.py CHANGED
@@ -180,7 +180,7 @@ def _fuzz_prime_int(rng: random.Random, n: int) -> List[int]:
180
  @dataclass(frozen=True)
181
  class FunctionSpec:
182
  name: str
183
- fn: Callable[[Any], Any]
184
  signature: str
185
  description: str
186
  fuzzer: Callable[[random.Random, int], list]
@@ -189,6 +189,16 @@ class FunctionSpec:
189
  # fuzz batch. They are scored as their own category ("edge") so the
190
  # verifier can report stratified pass-rates back to the trainer.
191
  edge_cases: List[Any] = field(default_factory=list)
 
 
 
 
 
 
 
 
 
 
192
 
193
 
194
  BLACK_BOX_FUNCTIONS: Dict[str, FunctionSpec] = {
 
180
  @dataclass(frozen=True)
181
  class FunctionSpec:
182
  name: str
183
+ fn: Callable[..., Any]
184
  signature: str
185
  description: str
186
  fuzzer: Callable[[random.Random, int], list]
 
189
  # fuzz batch. They are scored as their own category ("edge") so the
190
  # verifier can report stratified pass-rates back to the trainer.
191
  edge_cases: List[Any] = field(default_factory=list)
192
+ # Calling convention. When False (the default, used by all 9 builtins),
193
+ # ``fn(arg)`` is invoked with a single positional argument -- whatever
194
+ # the fuzzer produced. When True (used by the auto-fuzzer-generated
195
+ # specs for multi-parameter target functions), each fuzz input is a
196
+ # *tuple of args* and is unpacked: ``fn(*args)``.
197
+ unpack_args: bool = False
198
+ # Provenance: where this spec came from. Useful for /tasks?source=...
199
+ # Defaults to "builtin" for backwards compatibility with the original
200
+ # 9 hand-written specs.
201
+ source: str = "builtin"
202
 
203
 
204
  BLACK_BOX_FUNCTIONS: Dict[str, FunctionSpec] = {
opensleuth_env/env.py CHANGED
@@ -38,7 +38,7 @@ from __future__ import annotations
38
  import ast
39
  import logging
40
  import uuid
41
- from typing import Any, Tuple
42
 
43
  from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
44
  from .models import (
@@ -50,6 +50,7 @@ from .models import (
50
  StepResponse,
51
  SubmitAction,
52
  )
 
53
  from .verifier import generate_fuzz_inputs, get_edge_inputs, verify_submission
54
 
55
  log = logging.getLogger("opensleuth.env")
@@ -121,39 +122,78 @@ def _bucket_of(x: Any) -> str:
121
  class OpenSleuthEnv:
122
  """Multi-episode environment registry."""
123
 
124
- def __init__(self, fuzz_count: int = 100) -> None:
 
 
 
 
125
  self._states: dict[str, State] = {}
126
  self._configs: dict[str, dict] = {}
 
 
 
 
127
  self.fuzz_count = fuzz_count
 
 
 
 
 
128
 
129
  # --- Lifecycle ---------------------------------------------------------
130
 
131
- def reset(self, target_name: str, seed: int = 0, max_steps: int = 25) -> Observation:
132
- if target_name not in BLACK_BOX_FUNCTIONS:
133
- raise ValueError(
134
- f"Unknown target function: {target_name!r}. "
135
- f"Available: {sorted(BLACK_BOX_FUNCTIONS)}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  )
137
- spec = BLACK_BOX_FUNCTIONS[target_name]
 
138
  episode_id = uuid.uuid4().hex
139
  self._states[episode_id] = State(
140
  episode_id=episode_id,
141
- target_function_name=target_name,
142
  seed=seed,
143
  )
144
  self._configs[episode_id] = {"max_steps": max_steps}
 
145
  return self._build_observation(episode_id, spec, last_error="")
146
 
 
 
 
 
 
 
 
 
 
147
  def step(self, episode_id: str, action: Action) -> StepResponse:
148
  state = self._states.get(episode_id)
149
  if state is None:
150
  raise KeyError(f"Unknown episode_id {episode_id!r}. Did you /reset first?")
151
  if state.done:
152
- spec = BLACK_BOX_FUNCTIONS[state.target_function_name]
153
  obs = self._build_observation(episode_id, spec, last_error="Episode already terminated.")
154
  return StepResponse(observation=obs, reward=0.0, done=True, info={"reason": "already_done"})
155
 
156
- spec = BLACK_BOX_FUNCTIONS[state.target_function_name]
157
  state.steps_taken += 1
158
  max_steps = self._configs[episode_id]["max_steps"]
159
 
@@ -205,7 +245,15 @@ class OpenSleuthEnv:
205
  intrinsic = 0.0
206
  last_error = ""
207
  try:
208
- output = spec.fn(parsed)
 
 
 
 
 
 
 
 
209
  output_repr = repr(output)
210
  state.probe_history.append(
211
  ProbeRecord(
@@ -255,6 +303,7 @@ class OpenSleuthEnv:
255
  fuzz_inputs,
256
  target_name=spec.name,
257
  edge_inputs=edge_inputs,
 
258
  )
259
 
260
  total = (
 
38
  import ast
39
  import logging
40
  import uuid
41
+ from typing import Any, Optional, Tuple
42
 
43
  from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
44
  from .models import (
 
50
  StepResponse,
51
  SubmitAction,
52
  )
53
+ from .task_catalog import TaskCatalog, TaskResolutionError
54
  from .verifier import generate_fuzz_inputs, get_edge_inputs, verify_submission
55
 
56
  log = logging.getLogger("opensleuth.env")
 
122
  class OpenSleuthEnv:
123
  """Multi-episode environment registry."""
124
 
125
+ def __init__(
126
+ self,
127
+ fuzz_count: int = 100,
128
+ catalog: Optional["TaskCatalog"] = None,
129
+ ) -> None:
130
  self._states: dict[str, State] = {}
131
  self._configs: dict[str, dict] = {}
132
+ # Per-episode resolved spec. We cache it here (rather than looking it
133
+ # up by name on every step from BLACK_BOX_FUNCTIONS) because
134
+ # caller-supplied / Hub-loaded specs aren't in BLACK_BOX_FUNCTIONS.
135
+ self._episode_specs: dict[str, FunctionSpec] = {}
136
  self.fuzz_count = fuzz_count
137
+ self._catalog = catalog or TaskCatalog()
138
+
139
+ @property
140
+ def catalog(self) -> "TaskCatalog":
141
+ return self._catalog
142
 
143
  # --- Lifecycle ---------------------------------------------------------
144
 
145
+ def reset(
146
+ self,
147
+ target_name: Optional[str] = None,
148
+ seed: int = 0,
149
+ max_steps: int = 25,
150
+ *,
151
+ target_code: Optional[str] = None,
152
+ target_function_name: Optional[str] = None,
153
+ edge_cases: Optional[list] = None,
154
+ fuzz_spec: Optional[dict] = None,
155
+ ) -> Observation:
156
+ # Backwards-compat: legacy callers pass ``target_name="fibonacci"``
157
+ # only. The catalog handles that path identically to before.
158
+ try:
159
+ spec = self._catalog.resolve(
160
+ target_name=target_name,
161
+ target_code=target_code,
162
+ target_function_name=target_function_name,
163
+ edge_cases=edge_cases,
164
+ fuzz_spec=fuzz_spec,
165
  )
166
+ except TaskResolutionError as e:
167
+ raise ValueError(str(e)) from e
168
  episode_id = uuid.uuid4().hex
169
  self._states[episode_id] = State(
170
  episode_id=episode_id,
171
+ target_function_name=spec.name,
172
  seed=seed,
173
  )
174
  self._configs[episode_id] = {"max_steps": max_steps}
175
+ self._episode_specs[episode_id] = spec
176
  return self._build_observation(episode_id, spec, last_error="")
177
 
178
+ def _spec_for(self, state: State) -> FunctionSpec:
179
+ spec = self._episode_specs.get(state.episode_id)
180
+ if spec is not None:
181
+ return spec
182
+ # Legacy fallback: if an episode was created before we started
183
+ # caching specs (or via a code path that bypassed reset), look up
184
+ # by name in the builtin registry.
185
+ return BLACK_BOX_FUNCTIONS[state.target_function_name]
186
+
187
  def step(self, episode_id: str, action: Action) -> StepResponse:
188
  state = self._states.get(episode_id)
189
  if state is None:
190
  raise KeyError(f"Unknown episode_id {episode_id!r}. Did you /reset first?")
191
  if state.done:
192
+ spec = self._spec_for(state)
193
  obs = self._build_observation(episode_id, spec, last_error="Episode already terminated.")
194
  return StepResponse(observation=obs, reward=0.0, done=True, info={"reason": "already_done"})
195
 
196
+ spec = self._spec_for(state)
197
  state.steps_taken += 1
198
  max_steps = self._configs[episode_id]["max_steps"]
199
 
 
245
  intrinsic = 0.0
246
  last_error = ""
247
  try:
248
+ if spec.unpack_args:
249
+ if not isinstance(parsed, tuple):
250
+ raise TypeError(
251
+ f"Multi-parameter target {spec.name!r} expects a tuple "
252
+ f"of args, got {type(parsed).__name__}."
253
+ )
254
+ output = spec.fn(*parsed)
255
+ else:
256
+ output = spec.fn(parsed)
257
  output_repr = repr(output)
258
  state.probe_history.append(
259
  ProbeRecord(
 
303
  fuzz_inputs,
304
  target_name=spec.name,
305
  edge_inputs=edge_inputs,
306
+ unpack_args=spec.unpack_args,
307
  )
308
 
309
  total = (
opensleuth_env/models.py CHANGED
@@ -91,9 +91,43 @@ class State(BaseModel):
91
 
92
 
93
  class ResetRequest(BaseModel):
94
- target_name: str = "fibonacci"
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  seed: int = 0
96
  max_steps: int = 25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
 
99
  class StepRequest(BaseModel):
 
91
 
92
 
93
  class ResetRequest(BaseModel):
94
+ """Reset payload.
95
+
96
+ The original (v0.3) shape ``{"target_name": "fibonacci", "seed": 0,
97
+ "max_steps": 25}`` still works exactly as before -- the four new fields
98
+ below are all optional and additive so the in-flight trainer doesn't
99
+ have to change.
100
+
101
+ Open-ended (Level 2) targets are specified by passing ``target_code``
102
+ + ``target_function_name`` (and optionally ``edge_cases`` and
103
+ ``fuzz_spec``), which is then resolved via the TaskCatalog using the
104
+ same hardened sandbox the verifier uses for agent submissions.
105
+ """
106
+
107
+ target_name: Optional[str] = None
108
  seed: int = 0
109
  max_steps: int = 25
110
+ # --- Level 2 open-ended fields (additive, default-None) ---
111
+ target_code: Optional[str] = Field(
112
+ default=None,
113
+ description="Python source defining a black-box callable. When set, "
114
+ "overrides target_name (caller-supplied beats Hub beats builtin).",
115
+ )
116
+ target_function_name: Optional[str] = Field(
117
+ default=None,
118
+ description="Name of the callable inside target_code to use as the "
119
+ "oracle. Required when target_code is set.",
120
+ )
121
+ edge_cases: Optional[List[str]] = Field(
122
+ default=None,
123
+ description="Optional list of must-pass probe inputs as Python "
124
+ "literal strings (e.g. ['0', '\"\"', '([1,2,3], 2)']).",
125
+ )
126
+ fuzz_spec: Optional[dict] = Field(
127
+ default=None,
128
+ description="Optional auto-fuzzer override map keyed by parameter "
129
+ "name, e.g. {'n': {'type': 'int', 'min': 1, 'max': 90}}.",
130
+ )
131
 
132
 
133
  class StepRequest(BaseModel):
opensleuth_env/scripts/__init__.py ADDED
File without changes
opensleuth_env/scripts/bootstrap_tasks_dataset.py ADDED
@@ -0,0 +1,508 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Bootstrap / refresh the OpenSleuth Hub task catalog.
2
+
3
+ Idempotently creates ``anugrah55/opensleuth-tasks`` and pushes:
4
+
5
+ * The 9 builtin BLACK_BOX_FUNCTIONS as rows (so the dataset is non-empty
6
+ for testing and so the trainer's curriculum has parity with the
7
+ in-process oracle), and
8
+ * 6 brand-new tasks (``roman_to_int``, ``levenshtein_distance``,
9
+ ``flatten_list``, ``merge_sorted``, ``run_length_encode``,
10
+ ``binary_search``) that aren't in BLACK_BOX_FUNCTIONS, exercising
11
+ multi-arg and unannotated cases the auto-fuzzer must handle.
12
+
13
+ Each row is::
14
+
15
+ {
16
+ "name": str,
17
+ "target_function_name": str, # which fn inside source_code
18
+ "signature": str,
19
+ "description": str,
20
+ "difficulty": "easy"|"medium"|"hard",
21
+ "source_code": str, # standalone Python; NO oracle imports
22
+ "edge_cases_json": str, # JSON list of literal-repr strings
23
+ "fuzz_spec_json": str, # JSON dict or "null"
24
+ }
25
+
26
+ Run::
27
+
28
+ cd env && PYTHONPATH=. ../.venv/bin/python -m opensleuth_env.scripts.bootstrap_tasks_dataset
29
+ """
30
+
31
+ from __future__ import annotations
32
+
33
+ import argparse
34
+ import json
35
+ import logging
36
+ import sys
37
+ from typing import Any, Dict, List, Optional
38
+
39
+ from opensleuth_env.black_box import BLACK_BOX_FUNCTIONS
40
+
41
+ log = logging.getLogger("opensleuth.bootstrap")
42
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
43
+
44
+ DATASET_ID = "anugrah55/opensleuth-tasks"
45
+
46
+
47
+ # ---------------------------------------------------------------------------
48
+ # Oracle source code for the 9 builtins (self-contained -- no opensleuth_*
49
+ # imports, so the catalog's static reject filter accepts them).
50
+ # ---------------------------------------------------------------------------
51
+
52
+
53
+ _BUILTIN_SOURCE: Dict[str, Dict[str, Any]] = {
54
+ "fibonacci": {
55
+ "target_function_name": "fibonacci",
56
+ "source_code": (
57
+ "def fibonacci(n):\n"
58
+ " if not isinstance(n, int) or isinstance(n, bool) or n <= 0 or n > 90:\n"
59
+ " raise ValueError('Input must be a positive integer <= 90.')\n"
60
+ " a, b = 0, 1\n"
61
+ " for _ in range(n - 1):\n"
62
+ " a, b = b, a + b\n"
63
+ " return b if n > 0 else a\n"
64
+ ),
65
+ "edge_cases": ["1", "2", "3", "10", "89", "90"],
66
+ "fuzz_spec": {"n": {"type": "int", "min": 1, "max": 90}},
67
+ },
68
+ "reverse_string": {
69
+ "target_function_name": "reverse_string",
70
+ "source_code": (
71
+ "def reverse_string(s):\n"
72
+ " if not isinstance(s, str):\n"
73
+ " raise TypeError('Input must be a string.')\n"
74
+ " return s[::-1]\n"
75
+ ),
76
+ "edge_cases": ['""', '"a"', '"ab"', '"racecar"', '"Hello, World!"'],
77
+ "fuzz_spec": {"s": {"type": "str", "max_len": 12}},
78
+ },
79
+ "is_palindrome": {
80
+ "target_function_name": "is_palindrome",
81
+ "source_code": (
82
+ "def is_palindrome(s):\n"
83
+ " if not isinstance(s, str):\n"
84
+ " raise TypeError('Input must be a string.')\n"
85
+ " cleaned = ''.join(ch.lower() for ch in s if ch.isalnum())\n"
86
+ " return cleaned == cleaned[::-1]\n"
87
+ ),
88
+ "edge_cases": [
89
+ '""', '"a"', '"ab"', '"abba"',
90
+ "\"A man, a plan, a canal: Panama\"", '"Hello"',
91
+ ],
92
+ "fuzz_spec": {"s": {"type": "str", "max_len": 12}},
93
+ },
94
+ "digit_sum": {
95
+ "target_function_name": "digit_sum",
96
+ "source_code": (
97
+ "def digit_sum(n):\n"
98
+ " if not isinstance(n, int) or isinstance(n, bool):\n"
99
+ " raise TypeError('Input must be int.')\n"
100
+ " if n < 0:\n"
101
+ " raise ValueError('Input must be non-negative.')\n"
102
+ " return sum(int(c) for c in str(n))\n"
103
+ ),
104
+ "edge_cases": ["0", "1", "9", "10", "99", "100", "9999"],
105
+ "fuzz_spec": {"n": {"type": "int", "min": 0, "max": 10000}},
106
+ },
107
+ "count_vowels": {
108
+ "target_function_name": "count_vowels",
109
+ "source_code": (
110
+ "def count_vowels(s):\n"
111
+ " if not isinstance(s, str):\n"
112
+ " raise TypeError('Input must be a string.')\n"
113
+ " return sum(1 for c in s.lower() if c in 'aeiou')\n"
114
+ ),
115
+ "edge_cases": ['""', '"bcd"', '"AEIOU"', '"Hello, World!"', '"aaaaa"'],
116
+ "fuzz_spec": {"s": {"type": "str", "max_len": 16}},
117
+ },
118
+ "gcd": {
119
+ "target_function_name": "gcd",
120
+ "source_code": (
121
+ "def gcd(pair):\n"
122
+ " if not isinstance(pair, (list, tuple)) or len(pair) != 2:\n"
123
+ " raise TypeError('Input must be a 2-element list or tuple.')\n"
124
+ " a, b = pair\n"
125
+ " if not all(isinstance(x, int) and not isinstance(x, bool) for x in (a, b)):\n"
126
+ " raise TypeError('Both elements must be int.')\n"
127
+ " if a < 0 or b < 0:\n"
128
+ " raise ValueError('Both elements must be non-negative.')\n"
129
+ " while b:\n"
130
+ " a, b = b, a % b\n"
131
+ " return a\n"
132
+ ),
133
+ "edge_cases": ["(0, 0)", "(0, 7)", "(12, 18)", "(17, 13)", "(100, 75)"],
134
+ "fuzz_spec": {
135
+ "pair": {
136
+ "type": "tuple",
137
+ "elems": [{"type": "int", "min": 0, "max": 1000}, {"type": "int", "min": 0, "max": 1000}],
138
+ }
139
+ },
140
+ },
141
+ "sort_unique": {
142
+ "target_function_name": "sort_unique",
143
+ "source_code": (
144
+ "def sort_unique(xs):\n"
145
+ " if not isinstance(xs, list):\n"
146
+ " raise TypeError('Input must be a list.')\n"
147
+ " if not all(isinstance(x, int) and not isinstance(x, bool) for x in xs):\n"
148
+ " raise TypeError('All elements must be int.')\n"
149
+ " return sorted(set(xs))\n"
150
+ ),
151
+ "edge_cases": ["[]", "[1]", "[1, 1, 1]", "[3, 1, 2]", "[-5, 5, 0, -5, 5]"],
152
+ "fuzz_spec": {"xs": {"type": "list", "elem": {"type": "int", "min": -50, "max": 50}, "max_len": 8}},
153
+ },
154
+ "caesar_cipher": {
155
+ "target_function_name": "caesar_cipher",
156
+ "source_code": (
157
+ "def caesar_cipher(s):\n"
158
+ " if not isinstance(s, str):\n"
159
+ " raise TypeError('Input must be a string.')\n"
160
+ " out = []\n"
161
+ " for ch in s:\n"
162
+ " if 'a' <= ch <= 'z':\n"
163
+ " out.append(chr((ord(ch) - ord('a') + 3) % 26 + ord('a')))\n"
164
+ " else:\n"
165
+ " out.append(ch)\n"
166
+ " return ''.join(out)\n"
167
+ ),
168
+ "edge_cases": ['""', '"abc"', '"xyz"', '"Hello, World!"', '"ABC"', '"hello world"'],
169
+ "fuzz_spec": {"s": {"type": "str", "max_len": 16}},
170
+ },
171
+ "is_prime": {
172
+ "target_function_name": "is_prime",
173
+ "source_code": (
174
+ "def is_prime(n):\n"
175
+ " if not isinstance(n, int) or isinstance(n, bool):\n"
176
+ " raise TypeError('Input must be int.')\n"
177
+ " if n < 2:\n"
178
+ " return False\n"
179
+ " if n < 4:\n"
180
+ " return True\n"
181
+ " if n % 2 == 0:\n"
182
+ " return False\n"
183
+ " i = 3\n"
184
+ " while i * i <= n:\n"
185
+ " if n % i == 0:\n"
186
+ " return False\n"
187
+ " i += 2\n"
188
+ " return True\n"
189
+ ),
190
+ "edge_cases": ["0", "1", "2", "3", "4", "17", "25", "97", "100"],
191
+ "fuzz_spec": {"n": {"type": "int", "min": 0, "max": 200}},
192
+ },
193
+ }
194
+
195
+
196
+ # ---------------------------------------------------------------------------
197
+ # Six new tasks. These exercise auto-fuzzer features the builtins didn't:
198
+ # * multi-arg signatures (binary_search, merge_sorted, levenshtein_distance)
199
+ # * Optional / Literal hint coverage (run_length_encode -> list[tuple[str, int]])
200
+ # * unannotated containers (flatten_list)
201
+ # ---------------------------------------------------------------------------
202
+
203
+
204
+ _NEW_TASK_ROWS: List[Dict[str, Any]] = [
205
+ {
206
+ "name": "roman_to_int",
207
+ "target_function_name": "roman_to_int",
208
+ "signature": "roman_to_int(s: str) -> int",
209
+ "description": (
210
+ "Parse a roman numeral string into its integer value. "
211
+ "Raises ValueError for non-roman characters. Subtraction "
212
+ "rules (IV=4, IX=9, XL=40, ...) are honoured. Empty -> 0."
213
+ ),
214
+ "difficulty": "medium",
215
+ "source_code": (
216
+ "def roman_to_int(s: str) -> int:\n"
217
+ " if not isinstance(s, str):\n"
218
+ " raise TypeError('input must be str')\n"
219
+ " table = {'I':1,'V':5,'X':10,'L':50,'C':100,'D':500,'M':1000}\n"
220
+ " total = 0\n"
221
+ " prev = 0\n"
222
+ " for ch in reversed(s.upper()):\n"
223
+ " if ch not in table:\n"
224
+ " raise ValueError(f'invalid roman numeral character: {ch!r}')\n"
225
+ " v = table[ch]\n"
226
+ " if v < prev:\n"
227
+ " total -= v\n"
228
+ " else:\n"
229
+ " total += v\n"
230
+ " prev = v\n"
231
+ " return total\n"
232
+ ),
233
+ "edge_cases": ['""', '"I"', '"IV"', '"IX"', '"LVIII"', '"MCMXCIV"', '"MMXXIV"'],
234
+ "fuzz_spec": {"s": {"type": "str", "alphabet": "IVXLCDM", "max_len": 8}},
235
+ },
236
+ {
237
+ "name": "levenshtein_distance",
238
+ "target_function_name": "levenshtein_distance",
239
+ "signature": "levenshtein_distance(a: str, b: str) -> int",
240
+ "description": (
241
+ "Classic edit distance between two strings: minimum number of "
242
+ "single-character insertions, deletions, or substitutions to "
243
+ "transform a into b. Both arguments must be str."
244
+ ),
245
+ "difficulty": "hard",
246
+ "source_code": (
247
+ "def levenshtein_distance(a: str, b: str) -> int:\n"
248
+ " if not isinstance(a, str) or not isinstance(b, str):\n"
249
+ " raise TypeError('both arguments must be str')\n"
250
+ " if a == b:\n"
251
+ " return 0\n"
252
+ " if not a:\n"
253
+ " return len(b)\n"
254
+ " if not b:\n"
255
+ " return len(a)\n"
256
+ " prev = list(range(len(b) + 1))\n"
257
+ " for i, ca in enumerate(a, 1):\n"
258
+ " cur = [i] + [0] * len(b)\n"
259
+ " for j, cb in enumerate(b, 1):\n"
260
+ " ins = cur[j-1] + 1\n"
261
+ " dele = prev[j] + 1\n"
262
+ " sub = prev[j-1] + (ca != cb)\n"
263
+ " cur[j] = min(ins, dele, sub)\n"
264
+ " prev = cur\n"
265
+ " return prev[-1]\n"
266
+ ),
267
+ "edge_cases": [
268
+ '("", "")', '("a", "")', '("", "a")', '("kitten", "sitting")',
269
+ '("flaw", "lawn")', '("abc", "abc")',
270
+ ],
271
+ "fuzz_spec": {
272
+ "a": {"type": "str", "alphabet": "abc", "max_len": 6},
273
+ "b": {"type": "str", "alphabet": "abc", "max_len": 6},
274
+ },
275
+ },
276
+ {
277
+ "name": "flatten_list",
278
+ "target_function_name": "flatten_list",
279
+ "signature": "flatten_list(xs: list) -> list",
280
+ "description": (
281
+ "Recursively flatten a nested list of arbitrary depth. Tuples "
282
+ "are also flattened; non-list/tuple atoms (ints, strs, ...) "
283
+ "pass through unchanged."
284
+ ),
285
+ "difficulty": "medium",
286
+ "source_code": (
287
+ "def flatten_list(xs):\n"
288
+ " if not isinstance(xs, (list, tuple)):\n"
289
+ " raise TypeError('input must be list or tuple')\n"
290
+ " out = []\n"
291
+ " stack = list(xs)\n"
292
+ " # iterative DFS to avoid recursion limits on adversarial input\n"
293
+ " rev = []\n"
294
+ " rev.extend(reversed(stack))\n"
295
+ " while rev:\n"
296
+ " x = rev.pop()\n"
297
+ " if isinstance(x, (list, tuple)):\n"
298
+ " for y in reversed(x):\n"
299
+ " rev.append(y)\n"
300
+ " else:\n"
301
+ " out.append(x)\n"
302
+ " return out\n"
303
+ ),
304
+ "edge_cases": [
305
+ "[]", "[1]", "[[1, 2], [3, 4]]",
306
+ "[1, [2, [3, [4, [5]]]]]", "[[], [], 1]",
307
+ ],
308
+ "fuzz_spec": {
309
+ "xs": {
310
+ "type": "list",
311
+ "elem": {"type": "int", "min": -10, "max": 10},
312
+ "max_len": 6,
313
+ }
314
+ },
315
+ },
316
+ {
317
+ "name": "merge_sorted",
318
+ "target_function_name": "merge_sorted",
319
+ "signature": "merge_sorted(a: list[int], b: list[int]) -> list[int]",
320
+ "description": (
321
+ "Merge two pre-sorted lists of ints into a single sorted list. "
322
+ "Both arguments must be lists; elements must be ints (bools "
323
+ "rejected). The classic merge step of merge-sort."
324
+ ),
325
+ "difficulty": "medium",
326
+ "source_code": (
327
+ "def merge_sorted(a, b):\n"
328
+ " if not isinstance(a, list) or not isinstance(b, list):\n"
329
+ " raise TypeError('both arguments must be list')\n"
330
+ " for x in (*a, *b):\n"
331
+ " if not isinstance(x, int) or isinstance(x, bool):\n"
332
+ " raise TypeError('elements must be int')\n"
333
+ " out = []\n"
334
+ " i = j = 0\n"
335
+ " while i < len(a) and j < len(b):\n"
336
+ " if a[i] <= b[j]:\n"
337
+ " out.append(a[i]); i += 1\n"
338
+ " else:\n"
339
+ " out.append(b[j]); j += 1\n"
340
+ " out.extend(a[i:])\n"
341
+ " out.extend(b[j:])\n"
342
+ " return out\n"
343
+ ),
344
+ "edge_cases": [
345
+ "([], [])", "([1, 2, 3], [])", "([], [1, 2, 3])",
346
+ "([1, 3, 5], [2, 4, 6])", "([1, 1], [1, 1])",
347
+ ],
348
+ "fuzz_spec": {
349
+ "a": {"type": "list", "elem": {"type": "int", "min": -20, "max": 20}, "max_len": 5},
350
+ "b": {"type": "list", "elem": {"type": "int", "min": -20, "max": 20}, "max_len": 5},
351
+ },
352
+ },
353
+ {
354
+ "name": "run_length_encode",
355
+ "target_function_name": "run_length_encode",
356
+ "signature": "run_length_encode(s: str) -> list[tuple[str, int]]",
357
+ "description": (
358
+ "Run-length encoding: returns a list of (character, count) "
359
+ "tuples for each run of identical characters in s. Empty "
360
+ "input yields an empty list."
361
+ ),
362
+ "difficulty": "easy",
363
+ "source_code": (
364
+ "def run_length_encode(s):\n"
365
+ " if not isinstance(s, str):\n"
366
+ " raise TypeError('input must be str')\n"
367
+ " if not s:\n"
368
+ " return []\n"
369
+ " out = []\n"
370
+ " cur = s[0]\n"
371
+ " n = 1\n"
372
+ " for ch in s[1:]:\n"
373
+ " if ch == cur:\n"
374
+ " n += 1\n"
375
+ " else:\n"
376
+ " out.append((cur, n))\n"
377
+ " cur = ch\n"
378
+ " n = 1\n"
379
+ " out.append((cur, n))\n"
380
+ " return out\n"
381
+ ),
382
+ "edge_cases": ['""', '"a"', '"aa"', '"abc"', '"aaabbbccc"', '"aaaaaaaaaa"'],
383
+ "fuzz_spec": {"s": {"type": "str", "alphabet": "ab", "max_len": 12}},
384
+ },
385
+ {
386
+ "name": "binary_search",
387
+ "target_function_name": "binary_search",
388
+ "signature": "binary_search(arr: list[int], target: int) -> int",
389
+ "description": (
390
+ "Return the index of target in the sorted ascending list arr, "
391
+ "or -1 if not present. arr must be a list of ints; target "
392
+ "must be int. The list is assumed sorted."
393
+ ),
394
+ "difficulty": "medium",
395
+ "source_code": (
396
+ "def binary_search(arr, target):\n"
397
+ " if not isinstance(arr, list):\n"
398
+ " raise TypeError('arr must be list')\n"
399
+ " if not isinstance(target, int) or isinstance(target, bool):\n"
400
+ " raise TypeError('target must be int')\n"
401
+ " lo, hi = 0, len(arr) - 1\n"
402
+ " while lo <= hi:\n"
403
+ " mid = (lo + hi) // 2\n"
404
+ " v = arr[mid]\n"
405
+ " if v == target:\n"
406
+ " return mid\n"
407
+ " if v < target:\n"
408
+ " lo = mid + 1\n"
409
+ " else:\n"
410
+ " hi = mid - 1\n"
411
+ " return -1\n"
412
+ ),
413
+ "edge_cases": [
414
+ "([], 3)", "([1], 1)", "([1], 2)",
415
+ "([1, 2, 3, 4, 5], 3)", "([1, 2, 3, 4, 5], 0)",
416
+ "([1, 2, 3, 4, 5], 6)",
417
+ ],
418
+ "fuzz_spec": {
419
+ "arr": {"type": "list", "elem": {"type": "int", "min": -20, "max": 20}, "max_len": 8},
420
+ "target": {"type": "int", "min": -20, "max": 20},
421
+ },
422
+ },
423
+ ]
424
+
425
+
426
+ def _builtin_to_row(name: str) -> Dict[str, Any]:
427
+ spec = BLACK_BOX_FUNCTIONS[name]
428
+ src_meta = _BUILTIN_SOURCE[name]
429
+ return {
430
+ "name": name,
431
+ "target_function_name": src_meta["target_function_name"],
432
+ "signature": spec.signature,
433
+ "description": spec.description,
434
+ "difficulty": spec.difficulty,
435
+ "source_code": src_meta["source_code"],
436
+ "edge_cases_json": json.dumps(src_meta["edge_cases"]),
437
+ "fuzz_spec_json": json.dumps(src_meta["fuzz_spec"]),
438
+ }
439
+
440
+
441
+ def _new_task_to_row(meta: Dict[str, Any]) -> Dict[str, Any]:
442
+ return {
443
+ "name": meta["name"],
444
+ "target_function_name": meta["target_function_name"],
445
+ "signature": meta["signature"],
446
+ "description": meta["description"],
447
+ "difficulty": meta["difficulty"],
448
+ "source_code": meta["source_code"],
449
+ "edge_cases_json": json.dumps(meta["edge_cases"]),
450
+ "fuzz_spec_json": json.dumps(meta["fuzz_spec"]),
451
+ }
452
+
453
+
454
+ def build_rows() -> List[Dict[str, Any]]:
455
+ rows: List[Dict[str, Any]] = []
456
+ for name in BLACK_BOX_FUNCTIONS:
457
+ rows.append(_builtin_to_row(name))
458
+ for meta in _NEW_TASK_ROWS:
459
+ rows.append(_new_task_to_row(meta))
460
+ return rows
461
+
462
+
463
+ def push_to_hub(rows: List[Dict[str, Any]], dataset_id: str, *, private: bool = False) -> str:
464
+ """Push the row list to ``dataset_id`` (overwriting any prior contents).
465
+ Returns the hub URL.
466
+ """
467
+ from datasets import Dataset
468
+ from huggingface_hub import HfApi
469
+
470
+ api = HfApi()
471
+ api.create_repo(
472
+ repo_id=dataset_id,
473
+ repo_type="dataset",
474
+ exist_ok=True,
475
+ private=private,
476
+ )
477
+
478
+ ds = Dataset.from_list(rows)
479
+ log.info("pushing %d row(s) to %s", len(rows), dataset_id)
480
+ ds.push_to_hub(dataset_id, split="train", private=private)
481
+ return f"https://huggingface.co/datasets/{dataset_id}"
482
+
483
+
484
+ def main(argv: Optional[List[str]] = None) -> int:
485
+ p = argparse.ArgumentParser(description="Bootstrap the OpenSleuth Hub task catalog.")
486
+ p.add_argument("--dataset-id", default=DATASET_ID)
487
+ p.add_argument("--dry-run", action="store_true", help="Print row count, don't push.")
488
+ p.add_argument("--private", action="store_true", help="Create as private dataset.")
489
+ args = p.parse_args(argv)
490
+
491
+ rows = build_rows()
492
+ log.info("built %d row(s) (%d builtin + %d new)",
493
+ len(rows), len(BLACK_BOX_FUNCTIONS), len(_NEW_TASK_ROWS))
494
+ for r in rows:
495
+ log.info(" %-22s difficulty=%-6s edges=%-2d",
496
+ r["name"], r["difficulty"], len(json.loads(r["edge_cases_json"])))
497
+
498
+ if args.dry_run:
499
+ log.info("--dry-run: not pushing")
500
+ return 0
501
+
502
+ url = push_to_hub(rows, args.dataset_id, private=args.private)
503
+ log.info("dataset live at %s", url)
504
+ return 0
505
+
506
+
507
+ if __name__ == "__main__":
508
+ sys.exit(main())
opensleuth_env/task_catalog.py ADDED
@@ -0,0 +1,469 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """TaskCatalog: resolve a target function from one of three sources.
2
+
3
+ OpenSleuth Level 2 makes the env open-ended. Where v0.3 only knew about the
4
+ 9 hand-written ``BLACK_BOX_FUNCTIONS``, the catalog accepts targets from:
5
+
6
+ 1. **Caller-supplied** -- per-/reset payload, the most specific source.
7
+ The caller passes ``target_code`` + ``target_function_name`` (and
8
+ optionally ``edge_cases`` / ``fuzz_spec``) and we compile the source
9
+ in the same hardened sandbox the verifier uses for submissions.
10
+
11
+ 2. **Hub dataset** -- ``anugrah55/opensleuth-tasks`` on Hugging Face Hub.
12
+ Each row carries ``{name, signature, description, difficulty,
13
+ source_code, edge_cases_json, fuzz_spec_json}``. Loaded lazily on
14
+ first reset and cached in-process.
15
+
16
+ 3. **Builtin registry** -- the original 9 ``BLACK_BOX_FUNCTIONS``. Kept
17
+ as the safety-net so the in-flight trainer keeps working unchanged.
18
+
19
+ Resolution priority: caller-supplied wins, then Hub by name, then builtin.
20
+ This makes "trainer asks for fibonacci" still resolve to the builtin
21
+ fibonacci even when the Hub copy exists, *unless* the caller explicitly
22
+ overrides via ``target_code``.
23
+
24
+ Sandbox: caller-supplied / Hub source code is executed via the same
25
+ ``_make_safe_globals`` whitelist as agent submissions (no ``__import__``,
26
+ ``open``, ``eval``, ...). On top we statically reject any source that
27
+ imports ``opensleuth_*`` to prevent oracle-cheesing.
28
+ """
29
+
30
+ from __future__ import annotations
31
+
32
+ import ast
33
+ import inspect
34
+ import json
35
+ import logging
36
+ import threading
37
+ from typing import Any, Callable, Dict, List, Optional
38
+
39
+ from .auto_fuzzer import auto_fuzz, make_fuzzer
40
+ from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
41
+ from .verifier import _make_safe_globals # reuse the hardened sandbox
42
+
43
+ log = logging.getLogger("opensleuth.task_catalog")
44
+
45
+ HUB_DATASET_ID = "anugrah55/opensleuth-tasks"
46
+
47
+
48
+ class TaskResolutionError(ValueError):
49
+ """Raised when a /reset request can't be turned into a FunctionSpec."""
50
+
51
+
52
+ # ---------------------------------------------------------------------------
53
+ # Caller / Hub source-code compilation
54
+ # ---------------------------------------------------------------------------
55
+
56
+
57
+ _FORBIDDEN_PREFIXES = ("opensleuth", "opensleuth_env")
58
+
59
+
60
+ def _statically_reject_oracle_import(code: str) -> Optional[str]:
61
+ """Return an error string if the source statically imports the env's own
62
+ oracle module (which would let the agent / Hub author cheese the
63
+ verifier). The hardened sandbox already blocks ``__import__``, but we
64
+ fail fast and surface a clear error.
65
+ """
66
+ try:
67
+ tree = ast.parse(code)
68
+ except SyntaxError as e:
69
+ return f"target_code is not valid Python: {e}"
70
+ for node in ast.walk(tree):
71
+ if isinstance(node, ast.Import):
72
+ for alias in node.names:
73
+ if any(alias.name.startswith(p) for p in _FORBIDDEN_PREFIXES):
74
+ return (
75
+ f"target_code is not allowed to import {alias.name!r} "
76
+ "(oracle import)."
77
+ )
78
+ elif isinstance(node, ast.ImportFrom):
79
+ mod = node.module or ""
80
+ if any(mod.startswith(p) for p in _FORBIDDEN_PREFIXES):
81
+ return (
82
+ f"target_code is not allowed to import from {mod!r} "
83
+ "(oracle import)."
84
+ )
85
+ return None
86
+
87
+
88
+ def _compile_target_in_sandbox(code: str, function_name: str) -> Callable[..., Any]:
89
+ """Compile ``code`` in the same restricted globals the verifier uses for
90
+ agent submissions, then return the named callable. Raises
91
+ ``TaskResolutionError`` on any problem so /reset can return a clean 400.
92
+ """
93
+ err = _statically_reject_oracle_import(code)
94
+ if err:
95
+ raise TaskResolutionError(err)
96
+ safe_globals = _make_safe_globals()
97
+ local_scope: Dict[str, Any] = {}
98
+ try:
99
+ exec(code, safe_globals, local_scope)
100
+ except Exception as e: # noqa: BLE001
101
+ raise TaskResolutionError(
102
+ f"target_code raised at definition time: {type(e).__name__}: {e}"
103
+ ) from e
104
+ fn = local_scope.get(function_name) or safe_globals.get(function_name)
105
+ if not callable(fn):
106
+ raise TaskResolutionError(
107
+ f"target_code does not define a callable named {function_name!r}."
108
+ )
109
+ return fn
110
+
111
+
112
+ def _arity_of(fn: Callable[..., Any]) -> int:
113
+ """Number of positional / positional-or-keyword params on ``fn``."""
114
+ try:
115
+ sig = inspect.signature(fn)
116
+ except (TypeError, ValueError):
117
+ return 1
118
+ n = 0
119
+ for p in sig.parameters.values():
120
+ if p.kind in (
121
+ inspect.Parameter.POSITIONAL_ONLY,
122
+ inspect.Parameter.POSITIONAL_OR_KEYWORD,
123
+ ):
124
+ n += 1
125
+ return max(n, 1)
126
+
127
+
128
+ def _signature_string(fn: Callable[..., Any], name: str) -> str:
129
+ try:
130
+ sig = inspect.signature(fn)
131
+ return f"{name}{sig}"
132
+ except (TypeError, ValueError):
133
+ return f"{name}(...)"
134
+
135
+
136
+ def _description_of(fn: Callable[..., Any]) -> str:
137
+ return inspect.getdoc(fn) or ""
138
+
139
+
140
+ def _parse_edge_cases(edge_cases: Optional[List[Any]]) -> List[Any]:
141
+ """Edge cases arrive as a list of strings (Python literal reprs) when
142
+ coming from the API or from the Hub's ``edge_cases_json`` column. Each
143
+ string is parsed via ``ast.literal_eval``. Already-parsed values
144
+ (e.g. ints from the bootstrap script) are passed through unchanged.
145
+ """
146
+ if not edge_cases:
147
+ return []
148
+ parsed: List[Any] = []
149
+ for raw in edge_cases:
150
+ if isinstance(raw, str):
151
+ try:
152
+ parsed.append(ast.literal_eval(raw))
153
+ except (ValueError, SyntaxError) as e:
154
+ raise TaskResolutionError(
155
+ f"edge_cases entry {raw!r} is not a Python literal: {e}"
156
+ ) from e
157
+ else:
158
+ parsed.append(raw)
159
+ return parsed
160
+
161
+
162
+ def _flatten_unary_edges(arity: int, edges: List[Any]) -> List[Any]:
163
+ """For unary fns we accept either ``[5, 10]`` or ``[(5,), (10,)]`` and
164
+ normalise to flat values; for multi-arg fns we require tuples and pass
165
+ them through."""
166
+ if arity == 1:
167
+ out = []
168
+ for e in edges:
169
+ if isinstance(e, tuple) and len(e) == 1:
170
+ out.append(e[0])
171
+ else:
172
+ out.append(e)
173
+ return out
174
+ out = []
175
+ for e in edges:
176
+ if not isinstance(e, tuple):
177
+ raise TaskResolutionError(
178
+ f"edge_cases for a {arity}-arg target must be tuples, "
179
+ f"got {type(e).__name__}: {e!r}"
180
+ )
181
+ out.append(e)
182
+ return out
183
+
184
+
185
+ def _spec_from_callable(
186
+ name: str,
187
+ fn: Callable[..., Any],
188
+ *,
189
+ description: Optional[str] = None,
190
+ signature: Optional[str] = None,
191
+ difficulty: str = "medium",
192
+ edge_cases: Optional[List[Any]] = None,
193
+ fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
194
+ source: str = "user",
195
+ ) -> FunctionSpec:
196
+ """Build a FunctionSpec from a Python callable + optional metadata.
197
+
198
+ Wraps ``auto_fuzz`` for the fuzzer. The arity is auto-detected from
199
+ ``inspect.signature`` so ``unpack_args`` is set correctly: unary fns
200
+ behave like the existing builtins (single-arg call), N-arg fns flow
201
+ through the tuple-unpacking path in env / verifier.
202
+ """
203
+ arity = _arity_of(fn)
204
+ unpack = arity > 1
205
+
206
+ parsed_edges = _flatten_unary_edges(arity, _parse_edge_cases(edge_cases))
207
+
208
+ if unpack:
209
+ # Catalog-level adapter: keep the public spec.fn one-arg-style for
210
+ # the *unary* path so existing call sites work, but for multi-arg
211
+ # the env/verifier respect ``unpack_args`` and call ``fn(*args)``.
212
+ # We still store the original here -- env._handle_probe and
213
+ # verify_submission do the unpacking.
214
+ ref_fn: Callable[..., Any] = fn
215
+
216
+ def _fuzzer(rng, n):
217
+ return auto_fuzz(fn, n, rng, fuzz_spec=fuzz_spec)
218
+
219
+ else:
220
+ ref_fn = fn
221
+
222
+ def _unary_fuzzer(rng, n):
223
+ tuples = auto_fuzz(fn, n, rng, fuzz_spec=fuzz_spec)
224
+ return [t[0] if isinstance(t, tuple) and len(t) == 1 else t for t in tuples]
225
+
226
+ _fuzzer = _unary_fuzzer
227
+
228
+ return FunctionSpec(
229
+ name=name,
230
+ fn=ref_fn,
231
+ signature=signature or _signature_string(fn, name),
232
+ description=description or _description_of(fn),
233
+ fuzzer=_fuzzer,
234
+ difficulty=difficulty,
235
+ edge_cases=parsed_edges,
236
+ unpack_args=unpack,
237
+ source=source,
238
+ )
239
+
240
+
241
+ # ---------------------------------------------------------------------------
242
+ # Hub loader
243
+ # ---------------------------------------------------------------------------
244
+
245
+
246
+ class _HubCache:
247
+ """Lazily loads the Hub dataset into ``{name: FunctionSpec}``. Thread-
248
+ safe initialisation; subsequent reads are lock-free."""
249
+
250
+ def __init__(self, dataset_id: str):
251
+ self.dataset_id = dataset_id
252
+ self._lock = threading.Lock()
253
+ self._loaded: bool = False
254
+ self._specs: Dict[str, FunctionSpec] = {}
255
+ self._raw_rows: List[Dict[str, Any]] = []
256
+ self._load_error: Optional[str] = None
257
+
258
+ @property
259
+ def loaded(self) -> bool:
260
+ return self._loaded
261
+
262
+ @property
263
+ def load_error(self) -> Optional[str]:
264
+ return self._load_error
265
+
266
+ def _row_to_spec(self, row: Dict[str, Any]) -> Optional[FunctionSpec]:
267
+ name = row.get("name")
268
+ code = row.get("source_code")
269
+ if not name or not code:
270
+ return None
271
+ fn_name = row.get("target_function_name") or name
272
+ try:
273
+ fn = _compile_target_in_sandbox(code, fn_name)
274
+ except TaskResolutionError as e:
275
+ log.warning("hub task %r failed to compile: %s", name, e)
276
+ return None
277
+ edge_cases_raw = row.get("edge_cases_json") or "[]"
278
+ fuzz_spec_raw = row.get("fuzz_spec_json") or "null"
279
+ try:
280
+ edge_cases = json.loads(edge_cases_raw) if isinstance(edge_cases_raw, str) else edge_cases_raw
281
+ except json.JSONDecodeError:
282
+ edge_cases = []
283
+ try:
284
+ fuzz_spec = json.loads(fuzz_spec_raw) if isinstance(fuzz_spec_raw, str) else fuzz_spec_raw
285
+ except json.JSONDecodeError:
286
+ fuzz_spec = None
287
+ try:
288
+ return _spec_from_callable(
289
+ name=name,
290
+ fn=fn,
291
+ description=row.get("description") or _description_of(fn),
292
+ signature=row.get("signature") or _signature_string(fn, name),
293
+ difficulty=row.get("difficulty") or "medium",
294
+ edge_cases=edge_cases,
295
+ fuzz_spec=fuzz_spec,
296
+ source="hub",
297
+ )
298
+ except TaskResolutionError as e:
299
+ log.warning("hub task %r could not be specced: %s", name, e)
300
+ return None
301
+
302
+ def ensure_loaded(self) -> None:
303
+ if self._loaded:
304
+ return
305
+ with self._lock:
306
+ if self._loaded:
307
+ return
308
+ try:
309
+ from datasets import load_dataset # type: ignore
310
+
311
+ ds = load_dataset(self.dataset_id, split="train")
312
+ rows = list(ds)
313
+ specs: Dict[str, FunctionSpec] = {}
314
+ for row in rows:
315
+ spec = self._row_to_spec(row)
316
+ if spec is not None:
317
+ specs[spec.name] = spec
318
+ self._specs = specs
319
+ self._raw_rows = rows
320
+ log.info(
321
+ "loaded %d task(s) from %s (%d row(s) total)",
322
+ len(specs),
323
+ self.dataset_id,
324
+ len(rows),
325
+ )
326
+ except Exception as e: # noqa: BLE001
327
+ # Hub unreachable / not yet bootstrapped / offline. We swallow
328
+ # the error so the env keeps working from the builtin
329
+ # registry alone -- this is what lets the trainer keep
330
+ # running even if the Hub goes down mid-rollout.
331
+ self._load_error = f"{type(e).__name__}: {e}"
332
+ log.warning("hub dataset %s unavailable: %s", self.dataset_id, self._load_error)
333
+ finally:
334
+ self._loaded = True
335
+
336
+ def specs(self) -> Dict[str, FunctionSpec]:
337
+ self.ensure_loaded()
338
+ return self._specs
339
+
340
+ def rows(self) -> List[Dict[str, Any]]:
341
+ self.ensure_loaded()
342
+ return self._raw_rows
343
+
344
+
345
+ # ---------------------------------------------------------------------------
346
+ # TaskCatalog
347
+ # ---------------------------------------------------------------------------
348
+
349
+
350
+ class TaskCatalog:
351
+ """Resolves /reset payloads to FunctionSpecs from caller / Hub / builtin."""
352
+
353
+ def __init__(
354
+ self,
355
+ hub_dataset_id: str = HUB_DATASET_ID,
356
+ *,
357
+ enable_hub: bool = True,
358
+ ) -> None:
359
+ self.hub_dataset_id = hub_dataset_id
360
+ self.enable_hub = enable_hub
361
+ self._hub = _HubCache(hub_dataset_id) if enable_hub else None
362
+
363
+ # --- Resolution --------------------------------------------------------
364
+
365
+ def resolve(
366
+ self,
367
+ target_name: Optional[str] = None,
368
+ target_code: Optional[str] = None,
369
+ target_function_name: Optional[str] = None,
370
+ edge_cases: Optional[List[Any]] = None,
371
+ fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
372
+ ) -> FunctionSpec:
373
+ # 1. Caller-supplied: highest priority.
374
+ if target_code is not None:
375
+ if not target_function_name:
376
+ raise TaskResolutionError(
377
+ "target_code requires target_function_name to identify "
378
+ "which callable in the source to use."
379
+ )
380
+ fn = _compile_target_in_sandbox(target_code, target_function_name)
381
+ return _spec_from_callable(
382
+ name=target_function_name,
383
+ fn=fn,
384
+ edge_cases=edge_cases,
385
+ fuzz_spec=fuzz_spec,
386
+ source="user",
387
+ )
388
+
389
+ # 2. & 3. Hub-by-name / builtin-by-name. Builtin wins for legacy names
390
+ # (so the trainer's "fibonacci" always means the in-process oracle,
391
+ # never a possibly-modified Hub copy).
392
+ if not target_name:
393
+ raise TaskResolutionError(
394
+ "Either target_name or (target_code + target_function_name) must be set."
395
+ )
396
+
397
+ if target_name in BLACK_BOX_FUNCTIONS:
398
+ return BLACK_BOX_FUNCTIONS[target_name]
399
+
400
+ if self._hub is not None:
401
+ hub_specs = self._hub.specs()
402
+ if target_name in hub_specs:
403
+ return hub_specs[target_name]
404
+
405
+ available = self.list_known_names()
406
+ raise TaskResolutionError(
407
+ f"Unknown target function: {target_name!r}. Available: {sorted(available)[:25]}"
408
+ )
409
+
410
+ # --- Listing -----------------------------------------------------------
411
+
412
+ def list_known_names(self) -> List[str]:
413
+ names = set(BLACK_BOX_FUNCTIONS)
414
+ if self._hub is not None:
415
+ try:
416
+ names.update(self._hub.specs())
417
+ except Exception: # noqa: BLE001 -- best effort
418
+ pass
419
+ return sorted(names)
420
+
421
+ def list_builtin(self) -> List[Dict[str, Any]]:
422
+ return [
423
+ {
424
+ "name": s.name,
425
+ "signature": s.signature,
426
+ "description": s.description,
427
+ "difficulty": s.difficulty,
428
+ "edge_case_count": len(s.edge_cases or []),
429
+ "source": "builtin",
430
+ }
431
+ for s in BLACK_BOX_FUNCTIONS.values()
432
+ ]
433
+
434
+ def list_hub(self) -> List[Dict[str, Any]]:
435
+ if self._hub is None:
436
+ return []
437
+ out = []
438
+ for s in self._hub.specs().values():
439
+ # Don't shadow builtins in the Hub list (avoids surprising the
440
+ # caller with a "fibonacci@hub" entry that's never used).
441
+ if s.name in BLACK_BOX_FUNCTIONS:
442
+ continue
443
+ out.append(
444
+ {
445
+ "name": s.name,
446
+ "signature": s.signature,
447
+ "description": s.description,
448
+ "difficulty": s.difficulty,
449
+ "edge_case_count": len(s.edge_cases or []),
450
+ "source": "hub",
451
+ }
452
+ )
453
+ return out
454
+
455
+ def list_all(self) -> List[Dict[str, Any]]:
456
+ return self.list_builtin() + self.list_hub()
457
+
458
+ # --- Diagnostics -------------------------------------------------------
459
+
460
+ def hub_status(self) -> Dict[str, Any]:
461
+ if self._hub is None:
462
+ return {"enabled": False}
463
+ return {
464
+ "enabled": True,
465
+ "dataset_id": self.hub_dataset_id,
466
+ "loaded": self._hub.loaded,
467
+ "task_count": len(self._hub.specs()) if self._hub.loaded else None,
468
+ "error": self._hub.load_error,
469
+ }
opensleuth_env/verifier.py CHANGED
@@ -180,23 +180,36 @@ class _CallTimeout(Exception):
180
  pass
181
 
182
 
183
- def _call_with_timeout(fn: Callable, arg: Any, timeout_s: float):
184
  def _handler(signum, frame): # noqa: ARG001
185
  raise _CallTimeout()
186
 
187
  old = signal.signal(signal.SIGALRM, _handler)
188
  signal.setitimer(signal.ITIMER_REAL, timeout_s)
189
  try:
 
 
 
 
 
 
 
 
190
  return fn(arg)
191
  finally:
192
  signal.setitimer(signal.ITIMER_REAL, 0)
193
  signal.signal(signal.SIGALRM, old)
194
 
195
 
196
- def _safe_call(fn: Callable, arg: Any, timeout_s: float):
197
- """Returns (kind, value): kind in {'val', 'err', 'timeout'}."""
 
 
 
 
 
198
  try:
199
- return ("val", _call_with_timeout(fn, arg, timeout_s))
200
  except _CallTimeout:
201
  return ("timeout", f"timed out after {timeout_s}s")
202
  except Exception as e: # noqa: BLE001
@@ -270,13 +283,14 @@ def _looks_like_reference_import(code: str) -> bool:
270
 
271
  def verify_submission(
272
  submitted_code: str,
273
- target_function: Callable[[Any], Any],
274
  fuzz_inputs: List[Any],
275
  *,
276
  target_name: Optional[str] = None,
277
  define_timeout_s: float = 5.0,
278
  call_timeout_s: float = 1.0,
279
  edge_inputs: Optional[List[Any]] = None,
 
280
  ) -> VerificationResult:
281
  """Score ``submitted_code`` against ``target_function`` over the supplied
282
  ``fuzz_inputs`` (random regime) and ``edge_inputs`` (must-pass regime).
@@ -324,8 +338,8 @@ def verify_submission(
324
 
325
  def _score(inputs: List[Any], category: str) -> None:
326
  for inp in inputs:
327
- ref = _safe_call(target_function, inp, call_timeout_s)
328
- sub = _safe_call(submitted_fn, inp, call_timeout_s)
329
  sub_results.append(sub)
330
  ref_results.append(ref)
331
  if _outputs_equivalent(ref, sub):
 
180
  pass
181
 
182
 
183
+ def _call_with_timeout(fn: Callable, arg: Any, timeout_s: float, *, unpack: bool = False):
184
  def _handler(signum, frame): # noqa: ARG001
185
  raise _CallTimeout()
186
 
187
  old = signal.signal(signal.SIGALRM, _handler)
188
  signal.setitimer(signal.ITIMER_REAL, timeout_s)
189
  try:
190
+ if unpack:
191
+ if not isinstance(arg, tuple):
192
+ # Defensive: a multi-param target should always receive a
193
+ # tuple, but if the agent's probe input_repr happens to
194
+ # parse to a single value, treat it as a 1-tuple so we get
195
+ # a clear TypeError rather than a confusing call shape.
196
+ arg = (arg,)
197
+ return fn(*arg)
198
  return fn(arg)
199
  finally:
200
  signal.setitimer(signal.ITIMER_REAL, 0)
201
  signal.signal(signal.SIGALRM, old)
202
 
203
 
204
+ def _safe_call(fn: Callable, arg: Any, timeout_s: float, *, unpack: bool = False):
205
+ """Returns (kind, value): kind in {'val', 'err', 'timeout'}.
206
+
207
+ When ``unpack`` is True the input ``arg`` is expected to be an args
208
+ tuple and ``fn`` is invoked as ``fn(*arg)``. This is how multi-parameter
209
+ auto-fuzzer-driven targets are scored.
210
+ """
211
  try:
212
+ return ("val", _call_with_timeout(fn, arg, timeout_s, unpack=unpack))
213
  except _CallTimeout:
214
  return ("timeout", f"timed out after {timeout_s}s")
215
  except Exception as e: # noqa: BLE001
 
283
 
284
  def verify_submission(
285
  submitted_code: str,
286
+ target_function: Callable[..., Any],
287
  fuzz_inputs: List[Any],
288
  *,
289
  target_name: Optional[str] = None,
290
  define_timeout_s: float = 5.0,
291
  call_timeout_s: float = 1.0,
292
  edge_inputs: Optional[List[Any]] = None,
293
+ unpack_args: bool = False,
294
  ) -> VerificationResult:
295
  """Score ``submitted_code`` against ``target_function`` over the supplied
296
  ``fuzz_inputs`` (random regime) and ``edge_inputs`` (must-pass regime).
 
338
 
339
  def _score(inputs: List[Any], category: str) -> None:
340
  for inp in inputs:
341
+ ref = _safe_call(target_function, inp, call_timeout_s, unpack=unpack_args)
342
+ sub = _safe_call(submitted_fn, inp, call_timeout_s, unpack=unpack_args)
343
  sub_results.append(sub)
344
  ref_results.append(ref)
345
  if _outputs_equivalent(ref, sub):
requirements.txt CHANGED
@@ -1,3 +1,8 @@
1
  fastapi==0.115.6
2
  uvicorn[standard]==0.32.1
3
  pydantic==2.10.3
 
 
 
 
 
 
1
  fastapi==0.115.6
2
  uvicorn[standard]==0.32.1
3
  pydantic==2.10.3
4
+ # Level 2: Hub-driven task catalog. We swallow load failures at runtime so
5
+ # the env still functions if Hub is offline, but the dependency is required
6
+ # for Hub-backed tasks to be discoverable.
7
+ datasets>=3.0.0
8
+ huggingface_hub>=0.25.0
server.py CHANGED
@@ -15,18 +15,23 @@ from opensleuth_env import (
15
  StepRequest,
16
  StepResponse,
17
  SubmitAction,
 
18
  )
19
 
20
  logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
21
  log = logging.getLogger("opensleuth.server")
22
 
23
- app = FastAPI(title="OpenSleuth Env", version="0.3.0")
24
  env = OpenSleuthEnv()
25
 
26
 
27
  @app.get("/health")
28
  def health():
29
- return {"status": "ok", "episodes_tracked": len(env._states)} # noqa: SLF001
 
 
 
 
30
 
31
 
32
  @app.get("/functions")
@@ -36,6 +41,10 @@ def list_functions(
36
  description="Optional filter: easy / medium / hard. Used by the trainer for curriculum scheduling.",
37
  ),
38
  ):
 
 
 
 
39
  items = []
40
  for s in BLACK_BOX_FUNCTIONS.values():
41
  if difficulty is not None and getattr(s, "difficulty", None) != difficulty:
@@ -47,15 +56,65 @@ def list_functions(
47
  "description": s.description,
48
  "difficulty": getattr(s, "difficulty", None),
49
  "edge_case_count": len(getattr(s, "edge_cases", []) or []),
 
50
  }
51
  )
52
  return {"functions": items}
53
 
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  @app.post("/reset")
56
  def reset(req: ResetRequest):
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  try:
58
- obs = env.reset(target_name=req.target_name, seed=req.seed, max_steps=req.max_steps)
 
 
 
 
 
 
 
 
59
  except ValueError as e:
60
  raise HTTPException(status_code=400, detail=str(e)) from e
61
  return obs
@@ -77,8 +136,6 @@ def get_state(episode_id: str):
77
  return state
78
 
79
 
80
- # Convenience: a flat /step that does reset+step in one call is occasionally
81
- # useful for shell-style debugging.
82
  @app.post("/probe_once")
83
  def probe_once(target_name: str, input_repr: str):
84
  obs = env.reset(target_name=target_name)
 
15
  StepRequest,
16
  StepResponse,
17
  SubmitAction,
18
+ TaskCatalog,
19
  )
20
 
21
  logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
22
  log = logging.getLogger("opensleuth.server")
23
 
24
+ app = FastAPI(title="OpenSleuth Env", version="0.4.0")
25
  env = OpenSleuthEnv()
26
 
27
 
28
  @app.get("/health")
29
  def health():
30
+ return {
31
+ "status": "ok",
32
+ "episodes_tracked": len(env._states), # noqa: SLF001
33
+ "hub": env.catalog.hub_status(),
34
+ }
35
 
36
 
37
  @app.get("/functions")
 
41
  description="Optional filter: easy / medium / hard. Used by the trainer for curriculum scheduling.",
42
  ),
43
  ):
44
+ # NOTE -- backwards compatibility: this endpoint deliberately keeps the
45
+ # exact v0.3 shape (just the 9 builtin functions, with the original
46
+ # field set), because the in-flight trainer queries it. The new "source"
47
+ # field is additive. Open-ended / Hub tasks are exposed via /tasks.
48
  items = []
49
  for s in BLACK_BOX_FUNCTIONS.values():
50
  if difficulty is not None and getattr(s, "difficulty", None) != difficulty:
 
56
  "description": s.description,
57
  "difficulty": getattr(s, "difficulty", None),
58
  "edge_case_count": len(getattr(s, "edge_cases", []) or []),
59
+ "source": "builtin",
60
  }
61
  )
62
  return {"functions": items}
63
 
64
 
65
+ @app.get("/tasks")
66
+ def list_tasks(
67
+ source: str = Query(
68
+ "all",
69
+ description="Filter by source: 'builtin', 'hub', or 'all' (default).",
70
+ ),
71
+ difficulty: Optional[str] = Query(None, description="Optional curriculum filter."),
72
+ ):
73
+ src = source.lower()
74
+ if src == "builtin":
75
+ tasks = env.catalog.list_builtin()
76
+ elif src == "hub":
77
+ tasks = env.catalog.list_hub()
78
+ elif src == "all":
79
+ tasks = env.catalog.list_all()
80
+ else:
81
+ raise HTTPException(
82
+ status_code=400, detail="source must be one of: builtin, hub, all"
83
+ )
84
+ if difficulty is not None:
85
+ tasks = [t for t in tasks if t.get("difficulty") == difficulty]
86
+ return {
87
+ "tasks": tasks,
88
+ "count": len(tasks),
89
+ "hub": env.catalog.hub_status(),
90
+ }
91
+
92
+
93
  @app.post("/reset")
94
  def reset(req: ResetRequest):
95
+ # Validation: legacy callers pass only target_name; open-ended callers
96
+ # pass target_code + target_function_name. At least one of those paths
97
+ # must be populated.
98
+ if not req.target_name and not req.target_code:
99
+ raise HTTPException(
100
+ status_code=400,
101
+ detail="Either 'target_name' or ('target_code' + 'target_function_name') must be set.",
102
+ )
103
+ if req.target_code and not req.target_function_name:
104
+ raise HTTPException(
105
+ status_code=400,
106
+ detail="'target_function_name' is required when 'target_code' is provided.",
107
+ )
108
  try:
109
+ obs = env.reset(
110
+ target_name=req.target_name,
111
+ seed=req.seed,
112
+ max_steps=req.max_steps,
113
+ target_code=req.target_code,
114
+ target_function_name=req.target_function_name,
115
+ edge_cases=req.edge_cases,
116
+ fuzz_spec=req.fuzz_spec,
117
+ )
118
  except ValueError as e:
119
  raise HTTPException(status_code=400, detail=str(e)) from e
120
  return obs
 
136
  return state
137
 
138
 
 
 
139
  @app.post("/probe_once")
140
  def probe_once(target_name: str, input_repr: str):
141
  obs = env.reset(target_name=target_name)