Spaces:

anugrah55
/

opensleuth-env-gemini-cli

Paused

App Files Files Community

anugrah55 commited on 13 days ago

Commit

77e65fb

verified ·

1 Parent(s): 536dda7

Level 2 open-ended env: auto-fuzzer + TaskCatalog + Hub-driven catalog + extended /reset

Browse files

Adds opensleuth_env/auto_fuzzer.py (type-driven fuzz input generator), opensleuth_env/task_catalog.py (resolves caller-supplied target_code, Hub dataset rows, builtin registry in priority order), opensleuth_env/scripts/bootstrap_tasks_dataset.py (idempotent push of 9 builtin + 6 new tasks to anugrah55/opensleuth-tasks). Extends ResetRequest with target_code/target_function_name/edge_cases/fuzz_spec, adds GET /tasks endpoint, threads unpack_args through env+verifier for multi-arg targets. The 9 builtin functions and the v0.3 /reset shape are kept as the safety net so the in-flight trainer keeps working unchanged. 61 unit tests pass.

Files changed (12) hide show

README.md +50 -3
opensleuth_env/__init__.py +7 -0
opensleuth_env/auto_fuzzer.py +383 -0
opensleuth_env/black_box.py +11 -1
opensleuth_env/env.py +61 -12
opensleuth_env/models.py +35 -1
opensleuth_env/scripts/__init__.py +0 -0
opensleuth_env/scripts/bootstrap_tasks_dataset.py +508 -0
opensleuth_env/task_catalog.py +469 -0
opensleuth_env/verifier.py +21 -7
requirements.txt +5 -0
server.py +62 -5

README.md CHANGED Viewed

@@ -19,9 +19,10 @@ function by probing it, then submit Python source that replicates it.
 | Method | Path          | Body                                   | Notes                                  |
 |-------:|---------------|----------------------------------------|----------------------------------------|
-| GET    | `/health`     | —                                      | Liveness probe.                        |
-| GET    | `/functions`  | optional `?difficulty=easy\|medium\|hard` | Catalogue of available black-boxes (with curriculum metadata). |
-| POST   | `/reset`      | `{"target_name": "fibonacci", "seed": 0}` | Starts a new episode, returns initial obs + `episode_id`. |
 | POST   | `/step`       | `{"episode_id": "...", "action": {...}}` | One agent action.                      |
 | GET    | `/state/{eid}`| —                                      | Inspect the live state of an episode (debug). |
@@ -56,6 +57,45 @@ Engineering and Shaping*, arXiv:2408.10215).
     (`+15`). The sandbox additionally **blocks** `__import__`, `open`,
     `eval`, `exec`, `compile`, etc.
 ### Backwards compatibility
 Existing trainer / eval clients only read `info["execution_reward"]`,
@@ -64,6 +104,13 @@ with the same meaning. New fields (`difficulty`, `coverage_buckets_seen`,
 `matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`,
 `floor_penalty`, `perfect_bonus`) are additive and ignored by older clients.
 ## Hardware
 CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space.

 | Method | Path          | Body                                   | Notes                                  |
 |-------:|---------------|----------------------------------------|----------------------------------------|
+| GET    | `/health`     | —                                      | Liveness probe (also reports Hub-catalog status). |
+| GET    | `/functions`  | optional `?difficulty=easy\|medium\|hard` | Catalogue of the 9 builtin black-boxes (back-compat shape). |
+| GET    | `/tasks`      | optional `?source=builtin\|hub\|all`    | Open-ended catalog (Level 2): builtins + Hub-loaded rows. |
+| POST   | `/reset`      | `{"target_name": "fibonacci", "seed": 0}` *or* `{"target_code": "...", "target_function_name": "..."}` | Starts an episode. Caller-supplied target_code wins over target_name. |
 | POST   | `/step`       | `{"episode_id": "...", "action": {...}}` | One agent action.                      |
 | GET    | `/state/{eid}`| —                                      | Inspect the live state of an episode (debug). |
     (`+15`). The sandbox additionally **blocks** `__import__`, `open`,
     `eval`, `exec`, `compile`, etc.
+### Open-ended tasks (Level 2)
+The env resolves a target function from three sources, in priority order:
+1. **Caller-supplied** — `POST /reset` with `target_code` + `target_function_name`
+   (and optionally `edge_cases` + `fuzz_spec`). The source is compiled in the
+   same hardened sandbox the verifier uses for agent submissions; static-import
+   of `opensleuth_*` is rejected up front. This lets a trainer hand the env an
+   arbitrary unseen task per rollout without any redeploy.
+2. **Hub dataset** — [`anugrah55/opensleuth-tasks`](https://huggingface.co/datasets/anugrah55/opensleuth-tasks).
+   Loaded lazily on first `/reset`, cached in-process. Each row has
+   `{name, target_function_name, signature, description, difficulty,
+   source_code, edge_cases_json, fuzz_spec_json}`.
+3. **Builtin registry** — the original 9 functions in `black_box.py` are kept
+   as the safety-net so the in-flight trainer keeps working unchanged. Builtins
+   *win* by name over Hub copies, so `target_name="fibonacci"` always resolves
+   to the in-process oracle.
+#### Adding new tasks
+* **Per-reset (one-shot)**: pass `target_code` + `target_function_name` to
+  `/reset`. Multi-arg signatures are supported via the auto-fuzzer (which
+  introspects `inspect.signature` + `typing.get_type_hints`); pass
+  `edge_cases` as a list of Python literal strings and `fuzz_spec` as a
+  per-parameter override map.
+* **Persistent**: append a row to the Hub dataset and the env will pick it
+  up on its next process-start. The bootstrap script
+  (`opensleuth_env/scripts/bootstrap_tasks_dataset.py`) is idempotent —
+  re-running it overwrites the dataset with the latest builtin + curated
+  rows.
+```bash
+# Push the curated 9 + 6 = 15-task seed catalog.
+PYTHONPATH=. python -m opensleuth_env.scripts.bootstrap_tasks_dataset
+```
 ### Backwards compatibility
 Existing trainer / eval clients only read `info["execution_reward"]`,
 `matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`,
 `floor_penalty`, `perfect_bonus`) are additive and ignored by older clients.
+`/reset` retains its v0.3 shape: `{"target_name": "fibonacci", "seed": 0,
+"max_steps": 25}` works exactly as before. The four new optional fields
+(`target_code`, `target_function_name`, `edge_cases`, `fuzz_spec`) are
+additive. `/functions` returns the same shape as before (with one *additive*
+`source` field). Open-ended/Hub tasks are exposed via the new `/tasks`
+endpoint so older clients aren't surprised.
 ## Hardware
 CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space.

opensleuth_env/__init__.py CHANGED Viewed

@@ -12,6 +12,8 @@ from .models import (
     StepRequest,
 )
 from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
 __all__ = [
     "OpenSleuthEnv",
@@ -25,4 +27,9 @@ __all__ = [
     "StepRequest",
     "BLACK_BOX_FUNCTIONS",
     "FunctionSpec",
 ]

     StepRequest,
 )
 from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
+from .task_catalog import TaskCatalog, TaskResolutionError, HUB_DATASET_ID
+from .auto_fuzzer import auto_fuzz, make_fuzzer
 __all__ = [
     "OpenSleuthEnv",
     "StepRequest",
     "BLACK_BOX_FUNCTIONS",
     "FunctionSpec",
+    "TaskCatalog",
+    "TaskResolutionError",
+    "HUB_DATASET_ID",
+    "auto_fuzz",
+    "make_fuzzer",
 ]

opensleuth_env/auto_fuzzer.py ADDED Viewed

	@@ -0,0 +1,383 @@

+"""Generic, type-driven fuzz-input generator for OpenSleuth Level 2.
+Given a Python callable annotated with ``typing`` hints, ``auto_fuzz`` produces
+``n`` argument tuples that respect the signature so the verifier can score
+unannotated *arbitrary* targets without requiring a hand-written fuzzer the
+way the 9 builtin BLACK_BOX_FUNCTIONS do.
+Each per-type generator mixes a small set of "edge" values (``0``, ``-1``,
+``""``, ``None`` for ``Optional``, ...) with random values, weighted ~30/70.
+This biases the fuzz batch toward the boundaries that actually distinguish
+implementations while still covering the boring middle.
+A caller-supplied ``fuzz_spec: dict`` overrides the type-based generation on
+a per-parameter basis, e.g.::
+    auto_fuzz(my_fn, n=20, fuzz_spec={"n": {"type": "int", "min": 1, "max": 90}})
+Returned shape: ``List[tuple]`` -- one tuple per fuzz input, with one element
+per (positional) parameter of ``fn``. Even for unary ``fn`` we return tuples
+so the catalog wrapper has a single, uniform calling convention.
+"""
+from __future__ import annotations
+import inspect
+import random
+import string
+import typing
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union, get_args, get_origin
+# Probability that a per-type generator emits an "edge" value (0, "", None,
+# ...) instead of a random sample. Kept small enough that the boring middle
+# still gets coverage but high enough that the edge cases reliably appear.
+EDGE_PROB = 0.30
+# Per-type edge pools. These are used by the ``_g_*`` helpers below.
+_INT_EDGES = (0, 1, -1, 2, -2, 10, -10, 100, -100)
+_FLOAT_EDGES = (0.0, 1.0, -1.0, 0.5, -0.5, 1e-9, -1e-9, 100.0)
+_STR_EDGES = ("", "a", "ab", "Hello", "  ", "0", "abc def")
+_BYTES_EDGES = (b"", b"a", b"ab", b"\x00", b"abc")
+# ---------------------------------------------------------------------------
+# Per-type generators (do not assume any param-name dispatch).
+# ---------------------------------------------------------------------------
+def _maybe_edge(rng: random.Random, edges: tuple, random_fn: Callable[[], Any]) -> Any:
+    if edges and rng.random() < EDGE_PROB:
+        return rng.choice(edges)
+    return random_fn()
+def _g_int(rng: random.Random, *, lo: int = -100, hi: int = 100) -> int:
+    # Filter the edge pool by [lo, hi] so a caller-supplied fuzz_spec
+    # ``{"type": "int", "min": 1, "max": 5}`` never emits ``-100``.
+    edges = tuple(v for v in _INT_EDGES if lo <= v <= hi) or (lo,)
+    return _maybe_edge(rng, edges, lambda: rng.randint(lo, hi))
+def _g_float(rng: random.Random, *, lo: float = -100.0, hi: float = 100.0) -> float:
+    edges = tuple(v for v in _FLOAT_EDGES if lo <= v <= hi) or (lo,)
+    return _maybe_edge(rng, edges, lambda: rng.uniform(lo, hi))
+def _g_bool(rng: random.Random) -> bool:
+    return bool(rng.getrandbits(1))
+def _g_str(rng: random.Random, *, max_len: int = 12, alphabet: Optional[str] = None) -> str:
+    alpha = alphabet or (string.ascii_letters + string.digits)
+    def _rand():
+        return "".join(rng.choices(alpha, k=rng.randint(0, max_len)))
+    if alphabet is not None:
+        # When the caller restricts the alphabet, our generic edge pool
+        # ("Hello", "  ", ...) would violate it. Build a deterministic
+        # alphabet-respecting edge set instead.
+        custom_edges = ("",)
+        if alphabet:
+            custom_edges = ("", alphabet[0], alphabet[0] * min(max_len, 2))
+        return _maybe_edge(rng, custom_edges, _rand)
+    return _maybe_edge(rng, _STR_EDGES, _rand)
+def _g_bytes(rng: random.Random, *, max_len: int = 8) -> bytes:
+    def _rand():
+        return bytes(rng.randint(0, 255) for _ in range(rng.randint(0, max_len)))
+    return _maybe_edge(rng, _BYTES_EDGES, _rand)
+def _g_list(rng: random.Random, elem_gen: Callable[[], Any], *, max_len: int = 6) -> list:
+    if rng.random() < EDGE_PROB / 2:
+        return []
+    return [elem_gen() for _ in range(rng.randint(0, max_len))]
+def _g_tuple_homogeneous(
+    rng: random.Random, elem_gen: Callable[[], Any], *, max_len: int = 6
+) -> tuple:
+    return tuple(_g_list(rng, elem_gen, max_len=max_len))
+def _g_tuple_heterogeneous(rng: random.Random, elem_gens: List[Callable[[], Any]]) -> tuple:
+    return tuple(g() for g in elem_gens)
+def _g_set(rng: random.Random, elem_gen: Callable[[], Any], *, max_len: int = 6) -> set:
+    if rng.random() < EDGE_PROB / 2:
+        return set()
+    return {elem_gen() for _ in range(rng.randint(0, max_len))}
+def _g_dict(
+    rng: random.Random,
+    key_gen: Callable[[], Any],
+    val_gen: Callable[[], Any],
+    *,
+    max_len: int = 5,
+) -> dict:
+    if rng.random() < EDGE_PROB / 2:
+        return {}
+    return {key_gen(): val_gen() for _ in range(rng.randint(0, max_len))}
+# ---------------------------------------------------------------------------
+# Type -> generator dispatch.
+# ---------------------------------------------------------------------------
+def _is_optional(tp: Any) -> bool:
+    """``Optional[X]`` is ``Union[X, None]`` under the hood."""
+    if get_origin(tp) is Union:
+        return type(None) in get_args(tp)
+    return False
+def _strip_optional(tp: Any) -> Any:
+    """Return ``X`` for ``Optional[X]``; for unions with None + multiple, pick
+    the first non-None member (we can't satisfy a union in a single call)."""
+    if get_origin(tp) is Union:
+        non_none = [a for a in get_args(tp) if a is not type(None)]
+        if len(non_none) == 1:
+            return non_none[0]
+        if non_none:
+            return non_none[0]
+    return tp
+def _make_generator(tp: Any, rng: random.Random) -> Callable[[], Any]:
+    """Return a 0-arg callable that produces one random value of type ``tp``.
+    The recursion handles container element types (``list[int]``,
+    ``dict[str, list[int]]``, etc).
+    """
+    if tp is None or tp is type(None):
+        return lambda: None
+    if _is_optional(tp):
+        inner = _strip_optional(tp)
+        inner_gen = _make_generator(inner, rng)
+        def _gen_opt():
+            if rng.random() < EDGE_PROB:
+                return None
+            return inner_gen()
+        return _gen_opt
+    origin = get_origin(tp)
+    if origin is typing.Literal:
+        choices = list(get_args(tp))
+        return lambda: rng.choice(choices)
+    if origin is None:
+        if tp is int:
+            return lambda: _g_int(rng)
+        if tp is float:
+            return lambda: _g_float(rng)
+        if tp is bool:
+            return lambda: _g_bool(rng)
+        if tp is str:
+            return lambda: _g_str(rng)
+        if tp is bytes:
+            return lambda: _g_bytes(rng)
+        if tp is list:
+            return lambda: _g_list(rng, lambda: _g_int(rng))
+        if tp is tuple:
+            return lambda: _g_tuple_homogeneous(rng, lambda: _g_int(rng))
+        if tp is set:
+            return lambda: _g_set(rng, lambda: _g_int(rng))
+        if tp is dict:
+            return lambda: _g_dict(rng, lambda: _g_str(rng, max_len=4), lambda: _g_int(rng))
+        if tp is type(None):
+            return lambda: None
+        if tp is typing.Any:
+            return lambda: _g_int(rng)
+        # Unknown bare type -> fall back to int.
+        return lambda: _g_int(rng)
+    args = get_args(tp)
+    if origin in (list, List):
+        elem_t = args[0] if args else int
+        elem_gen = _make_generator(elem_t, rng)
+        return lambda: _g_list(rng, elem_gen)
+    if origin in (set, frozenset):
+        elem_t = args[0] if args else int
+        elem_gen = _make_generator(elem_t, rng)
+        return lambda: _g_set(rng, elem_gen)
+    if origin in (tuple, Tuple):
+        if not args:
+            return lambda: _g_tuple_homogeneous(rng, lambda: _g_int(rng))
+        if len(args) == 2 and args[1] is Ellipsis:
+            elem_gen = _make_generator(args[0], rng)
+            return lambda: _g_tuple_homogeneous(rng, elem_gen)
+        elem_gens = [_make_generator(a, rng) for a in args]
+        return lambda: _g_tuple_heterogeneous(rng, elem_gens)
+    if origin in (dict, Dict):
+        key_t = args[0] if args else str
+        val_t = args[1] if len(args) > 1 else int
+        key_gen = _make_generator(key_t, rng)
+        val_gen = _make_generator(val_t, rng)
+        return lambda: _g_dict(rng, key_gen, val_gen)
+    if origin is Union:
+        # Already handled Optional above. For pure unions, pick first member.
+        return _make_generator(args[0], rng)
+    return lambda: _g_int(rng)
+# ---------------------------------------------------------------------------
+# fuzz_spec overrides
+# ---------------------------------------------------------------------------
+def _generator_from_spec(entry: Dict[str, Any], rng: random.Random) -> Callable[[], Any]:
+    """Build a generator from a ``fuzz_spec`` entry dict.
+    Supported keys (all optional except ``type``):
+      - ``type``: one of ``"int" | "float" | "bool" | "str" | "bytes" |
+        "list" | "tuple" | "set" | "dict" | "literal" | "any"``
+      - ``min``, ``max``: int/float bounds
+      - ``max_len``: container/string length cap
+      - ``alphabet``: str-only character pool
+      - ``elem``: nested ``fuzz_spec`` entry for container elements
+      - ``key``, ``value``: nested entries for dict
+      - ``elems``: list of nested entries for fixed-arity tuple
+      - ``choices``: list of literals to sample from
+      - ``optional``: bool; if True, occasionally yields ``None``
+    """
+    t = entry.get("type", "any")
+    def _maybe_optional(gen: Callable[[], Any]) -> Callable[[], Any]:
+        if not entry.get("optional"):
+            return gen
+        def _g():
+            if rng.random() < EDGE_PROB:
+                return None
+            return gen()
+        return _g
+    if t == "int":
+        lo = int(entry.get("min", -100))
+        hi = int(entry.get("max", 100))
+        return _maybe_optional(lambda: _g_int(rng, lo=lo, hi=hi))
+    if t == "float":
+        lo = float(entry.get("min", -100.0))
+        hi = float(entry.get("max", 100.0))
+        return _maybe_optional(lambda: _g_float(rng, lo=lo, hi=hi))
+    if t == "bool":
+        return _maybe_optional(lambda: _g_bool(rng))
+    if t == "str":
+        max_len = int(entry.get("max_len", 12))
+        alphabet = entry.get("alphabet")
+        return _maybe_optional(lambda: _g_str(rng, max_len=max_len, alphabet=alphabet))
+    if t == "bytes":
+        max_len = int(entry.get("max_len", 8))
+        return _maybe_optional(lambda: _g_bytes(rng, max_len=max_len))
+    if t == "literal":
+        choices = list(entry.get("choices", []))
+        if not choices:
+            return _maybe_optional(lambda: None)
+        return _maybe_optional(lambda: rng.choice(choices))
+    if t == "list":
+        elem = entry.get("elem", {"type": "int"})
+        elem_gen = _generator_from_spec(elem, rng)
+        max_len = int(entry.get("max_len", 6))
+        return _maybe_optional(lambda: _g_list(rng, elem_gen, max_len=max_len))
+    if t == "tuple":
+        if "elems" in entry:
+            elem_gens = [_generator_from_spec(e, rng) for e in entry["elems"]]
+            return _maybe_optional(lambda: _g_tuple_heterogeneous(rng, elem_gens))
+        elem = entry.get("elem", {"type": "int"})
+        elem_gen = _generator_from_spec(elem, rng)
+        max_len = int(entry.get("max_len", 6))
+        return _maybe_optional(lambda: _g_tuple_homogeneous(rng, elem_gen, max_len=max_len))
+    if t == "set":
+        elem = entry.get("elem", {"type": "int"})
+        elem_gen = _generator_from_spec(elem, rng)
+        max_len = int(entry.get("max_len", 6))
+        return _maybe_optional(lambda: _g_set(rng, elem_gen, max_len=max_len))
+    if t == "dict":
+        key = entry.get("key", {"type": "str", "max_len": 4})
+        value = entry.get("value", {"type": "int"})
+        key_gen = _generator_from_spec(key, rng)
+        val_gen = _generator_from_spec(value, rng)
+        max_len = int(entry.get("max_len", 5))
+        return _maybe_optional(lambda: _g_dict(rng, key_gen, val_gen, max_len=max_len))
+    return _maybe_optional(lambda: _g_int(rng))
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def auto_fuzz(
+    fn: Callable[..., Any],
+    n: int,
+    rng: Optional[random.Random] = None,
+    *,
+    fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
+) -> List[tuple]:
+    """Produce ``n`` argument tuples for calling ``fn``.
+    Each returned element is an ``args`` tuple, intended to be applied as
+    ``fn(*args)``. ``fuzz_spec`` is keyed by parameter name and overrides
+    the type-based generation per-parameter.
+    """
+    rng = rng or random.Random()
+    fuzz_spec = fuzz_spec or {}
+    sig = inspect.signature(fn)
+    try:
+        hints = typing.get_type_hints(fn)
+    except Exception:  # noqa: BLE001 -- bad annotations shouldn't crash fuzzing
+        hints = {}
+    param_gens: List[Callable[[], Any]] = []
+    for pname, param in sig.parameters.items():
+        if param.kind in (
+            inspect.Parameter.VAR_POSITIONAL,
+            inspect.Parameter.VAR_KEYWORD,
+            inspect.Parameter.KEYWORD_ONLY,
+        ):
+            # We only fuzz positional / positional-or-keyword params.
+            continue
+        if pname in fuzz_spec:
+            param_gens.append(_generator_from_spec(fuzz_spec[pname], rng))
+            continue
+        annot = hints.get(pname, param.annotation)
+        if annot is inspect.Parameter.empty:
+            param_gens.append(lambda r=rng: _g_int(r))
+        else:
+            param_gens.append(_make_generator(annot, rng))
+    return [tuple(g() for g in param_gens) for _ in range(n)]
+def make_fuzzer(
+    fn: Callable[..., Any],
+    fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
+) -> Callable[[random.Random, int], List[tuple]]:
+    """Adapt ``auto_fuzz`` to the ``FunctionSpec.fuzzer`` signature
+    (``(rng, n) -> list``)."""
+    def _fuzzer(rng: random.Random, n: int) -> List[tuple]:
+        return auto_fuzz(fn, n, rng, fuzz_spec=fuzz_spec)
+    return _fuzzer

opensleuth_env/black_box.py CHANGED Viewed

@@ -180,7 +180,7 @@ def _fuzz_prime_int(rng: random.Random, n: int) -> List[int]:
 @dataclass(frozen=True)
 class FunctionSpec:
     name: str
-    fn: Callable[[Any], Any]
     signature: str
     description: str
     fuzzer: Callable[[random.Random, int], list]
@@ -189,6 +189,16 @@ class FunctionSpec:
     # fuzz batch. They are scored as their own category ("edge") so the
     # verifier can report stratified pass-rates back to the trainer.
     edge_cases: List[Any] = field(default_factory=list)
 BLACK_BOX_FUNCTIONS: Dict[str, FunctionSpec] = {

 @dataclass(frozen=True)
 class FunctionSpec:
     name: str
+    fn: Callable[..., Any]
     signature: str
     description: str
     fuzzer: Callable[[random.Random, int], list]
     # fuzz batch. They are scored as their own category ("edge") so the
     # verifier can report stratified pass-rates back to the trainer.
     edge_cases: List[Any] = field(default_factory=list)
+    # Calling convention. When False (the default, used by all 9 builtins),
+    # ``fn(arg)`` is invoked with a single positional argument -- whatever
+    # the fuzzer produced. When True (used by the auto-fuzzer-generated
+    # specs for multi-parameter target functions), each fuzz input is a
+    # *tuple of args* and is unpacked: ``fn(*args)``.
+    unpack_args: bool = False
+    # Provenance: where this spec came from. Useful for /tasks?source=...
+    # Defaults to "builtin" for backwards compatibility with the original
+    # 9 hand-written specs.
+    source: str = "builtin"
 BLACK_BOX_FUNCTIONS: Dict[str, FunctionSpec] = {

opensleuth_env/env.py CHANGED Viewed

@@ -38,7 +38,7 @@ from __future__ import annotations
 import ast
 import logging
 import uuid
-from typing import Any, Tuple
 from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
 from .models import (
@@ -50,6 +50,7 @@ from .models import (
     StepResponse,
     SubmitAction,
 )
 from .verifier import generate_fuzz_inputs, get_edge_inputs, verify_submission
 log = logging.getLogger("opensleuth.env")
@@ -121,39 +122,78 @@ def _bucket_of(x: Any) -> str:
 class OpenSleuthEnv:
     """Multi-episode environment registry."""
-    def __init__(self, fuzz_count: int = 100) -> None:
         self._states: dict[str, State] = {}
         self._configs: dict[str, dict] = {}
         self.fuzz_count = fuzz_count
     # --- Lifecycle ---------------------------------------------------------
-    def reset(self, target_name: str, seed: int = 0, max_steps: int = 25) -> Observation:
-        if target_name not in BLACK_BOX_FUNCTIONS:
-            raise ValueError(
-                f"Unknown target function: {target_name!r}. "
-                f"Available: {sorted(BLACK_BOX_FUNCTIONS)}"
             )
-        spec = BLACK_BOX_FUNCTIONS[target_name]
         episode_id = uuid.uuid4().hex
         self._states[episode_id] = State(
             episode_id=episode_id,
-            target_function_name=target_name,
             seed=seed,
         )
         self._configs[episode_id] = {"max_steps": max_steps}
         return self._build_observation(episode_id, spec, last_error="")
     def step(self, episode_id: str, action: Action) -> StepResponse:
         state = self._states.get(episode_id)
         if state is None:
             raise KeyError(f"Unknown episode_id {episode_id!r}. Did you /reset first?")
         if state.done:
-            spec = BLACK_BOX_FUNCTIONS[state.target_function_name]
             obs = self._build_observation(episode_id, spec, last_error="Episode already terminated.")
             return StepResponse(observation=obs, reward=0.0, done=True, info={"reason": "already_done"})
-        spec = BLACK_BOX_FUNCTIONS[state.target_function_name]
         state.steps_taken += 1
         max_steps = self._configs[episode_id]["max_steps"]
@@ -205,7 +245,15 @@ class OpenSleuthEnv:
         intrinsic = 0.0
         last_error = ""
         try:
-            output = spec.fn(parsed)
             output_repr = repr(output)
             state.probe_history.append(
                 ProbeRecord(
@@ -255,6 +303,7 @@ class OpenSleuthEnv:
             fuzz_inputs,
             target_name=spec.name,
             edge_inputs=edge_inputs,
         )
         total = (

 import ast
 import logging
 import uuid
+from typing import Any, Optional, Tuple
 from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
 from .models import (
     StepResponse,
     SubmitAction,
 )
+from .task_catalog import TaskCatalog, TaskResolutionError
 from .verifier import generate_fuzz_inputs, get_edge_inputs, verify_submission
 log = logging.getLogger("opensleuth.env")
 class OpenSleuthEnv:
     """Multi-episode environment registry."""
+    def __init__(
+        self,
+        fuzz_count: int = 100,
+        catalog: Optional["TaskCatalog"] = None,
+    ) -> None:
         self._states: dict[str, State] = {}
         self._configs: dict[str, dict] = {}
+        # Per-episode resolved spec. We cache it here (rather than looking it
+        # up by name on every step from BLACK_BOX_FUNCTIONS) because
+        # caller-supplied / Hub-loaded specs aren't in BLACK_BOX_FUNCTIONS.
+        self._episode_specs: dict[str, FunctionSpec] = {}
         self.fuzz_count = fuzz_count
+        self._catalog = catalog or TaskCatalog()
+    @property
+    def catalog(self) -> "TaskCatalog":
+        return self._catalog
     # --- Lifecycle ---------------------------------------------------------
+    def reset(
+        self,
+        target_name: Optional[str] = None,
+        seed: int = 0,
+        max_steps: int = 25,
+        *,
+        target_code: Optional[str] = None,
+        target_function_name: Optional[str] = None,
+        edge_cases: Optional[list] = None,
+        fuzz_spec: Optional[dict] = None,
+    ) -> Observation:
+        # Backwards-compat: legacy callers pass ``target_name="fibonacci"``
+        # only. The catalog handles that path identically to before.
+        try:
+            spec = self._catalog.resolve(
+                target_name=target_name,
+                target_code=target_code,
+                target_function_name=target_function_name,
+                edge_cases=edge_cases,
+                fuzz_spec=fuzz_spec,
             )
+        except TaskResolutionError as e:
+            raise ValueError(str(e)) from e
         episode_id = uuid.uuid4().hex
         self._states[episode_id] = State(
             episode_id=episode_id,
+            target_function_name=spec.name,
             seed=seed,
         )
         self._configs[episode_id] = {"max_steps": max_steps}
+        self._episode_specs[episode_id] = spec
         return self._build_observation(episode_id, spec, last_error="")
+    def _spec_for(self, state: State) -> FunctionSpec:
+        spec = self._episode_specs.get(state.episode_id)
+        if spec is not None:
+            return spec
+        # Legacy fallback: if an episode was created before we started
+        # caching specs (or via a code path that bypassed reset), look up
+        # by name in the builtin registry.
+        return BLACK_BOX_FUNCTIONS[state.target_function_name]
     def step(self, episode_id: str, action: Action) -> StepResponse:
         state = self._states.get(episode_id)
         if state is None:
             raise KeyError(f"Unknown episode_id {episode_id!r}. Did you /reset first?")
         if state.done:
+            spec = self._spec_for(state)
             obs = self._build_observation(episode_id, spec, last_error="Episode already terminated.")
             return StepResponse(observation=obs, reward=0.0, done=True, info={"reason": "already_done"})
+        spec = self._spec_for(state)
         state.steps_taken += 1
         max_steps = self._configs[episode_id]["max_steps"]
         intrinsic = 0.0
         last_error = ""
         try:
+            if spec.unpack_args:
+                if not isinstance(parsed, tuple):
+                    raise TypeError(
+                        f"Multi-parameter target {spec.name!r} expects a tuple "
+                        f"of args, got {type(parsed).__name__}."
+                    )
+                output = spec.fn(*parsed)
+            else:
+                output = spec.fn(parsed)
             output_repr = repr(output)
             state.probe_history.append(
                 ProbeRecord(
             fuzz_inputs,
             target_name=spec.name,
             edge_inputs=edge_inputs,
+            unpack_args=spec.unpack_args,
         )
         total = (

opensleuth_env/models.py CHANGED Viewed

@@ -91,9 +91,43 @@ class State(BaseModel):
 class ResetRequest(BaseModel):
-    target_name: str = "fibonacci"
     seed: int = 0
     max_steps: int = 25
 class StepRequest(BaseModel):

 class ResetRequest(BaseModel):
+    """Reset payload.
+    The original (v0.3) shape ``{"target_name": "fibonacci", "seed": 0,
+    "max_steps": 25}`` still works exactly as before -- the four new fields
+    below are all optional and additive so the in-flight trainer doesn't
+    have to change.
+    Open-ended (Level 2) targets are specified by passing ``target_code``
+    + ``target_function_name`` (and optionally ``edge_cases`` and
+    ``fuzz_spec``), which is then resolved via the TaskCatalog using the
+    same hardened sandbox the verifier uses for agent submissions.
+    """
+    target_name: Optional[str] = None
     seed: int = 0
     max_steps: int = 25
+    # --- Level 2 open-ended fields (additive, default-None) ---
+    target_code: Optional[str] = Field(
+        default=None,
+        description="Python source defining a black-box callable. When set, "
+        "overrides target_name (caller-supplied beats Hub beats builtin).",
+    )
+    target_function_name: Optional[str] = Field(
+        default=None,
+        description="Name of the callable inside target_code to use as the "
+        "oracle. Required when target_code is set.",
+    )
+    edge_cases: Optional[List[str]] = Field(
+        default=None,
+        description="Optional list of must-pass probe inputs as Python "
+        "literal strings (e.g. ['0', '\"\"', '([1,2,3], 2)']).",
+    )
+    fuzz_spec: Optional[dict] = Field(
+        default=None,
+        description="Optional auto-fuzzer override map keyed by parameter "
+        "name, e.g. {'n': {'type': 'int', 'min': 1, 'max': 90}}.",
+    )
 class StepRequest(BaseModel):

opensleuth_env/scripts/__init__.py ADDED Viewed

File without changes

opensleuth_env/scripts/bootstrap_tasks_dataset.py ADDED Viewed

	@@ -0,0 +1,508 @@

+"""Bootstrap / refresh the OpenSleuth Hub task catalog.
+Idempotently creates ``anugrah55/opensleuth-tasks`` and pushes:
+  * The 9 builtin BLACK_BOX_FUNCTIONS as rows (so the dataset is non-empty
+    for testing and so the trainer's curriculum has parity with the
+    in-process oracle), and
+  * 6 brand-new tasks (``roman_to_int``, ``levenshtein_distance``,
+    ``flatten_list``, ``merge_sorted``, ``run_length_encode``,
+    ``binary_search``) that aren't in BLACK_BOX_FUNCTIONS, exercising
+    multi-arg and unannotated cases the auto-fuzzer must handle.
+Each row is::
+    {
+      "name":                   str,
+      "target_function_name":   str,    # which fn inside source_code
+      "signature":              str,
+      "description":            str,
+      "difficulty":             "easy"|"medium"|"hard",
+      "source_code":            str,    # standalone Python; NO oracle imports
+      "edge_cases_json":        str,    # JSON list of literal-repr strings
+      "fuzz_spec_json":         str,    # JSON dict or "null"
+    }
+Run::
+    cd env && PYTHONPATH=. ../.venv/bin/python -m opensleuth_env.scripts.bootstrap_tasks_dataset
+"""
+from __future__ import annotations
+import argparse
+import json
+import logging
+import sys
+from typing import Any, Dict, List, Optional
+from opensleuth_env.black_box import BLACK_BOX_FUNCTIONS
+log = logging.getLogger("opensleuth.bootstrap")
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
+DATASET_ID = "anugrah55/opensleuth-tasks"
+# ---------------------------------------------------------------------------
+# Oracle source code for the 9 builtins (self-contained -- no opensleuth_*
+# imports, so the catalog's static reject filter accepts them).
+# ---------------------------------------------------------------------------
+_BUILTIN_SOURCE: Dict[str, Dict[str, Any]] = {
+    "fibonacci": {
+        "target_function_name": "fibonacci",
+        "source_code": (
+            "def fibonacci(n):\n"
+            "    if not isinstance(n, int) or isinstance(n, bool) or n <= 0 or n > 90:\n"
+            "        raise ValueError('Input must be a positive integer <= 90.')\n"
+            "    a, b = 0, 1\n"
+            "    for _ in range(n - 1):\n"
+            "        a, b = b, a + b\n"
+            "    return b if n > 0 else a\n"
+        ),
+        "edge_cases": ["1", "2", "3", "10", "89", "90"],
+        "fuzz_spec": {"n": {"type": "int", "min": 1, "max": 90}},
+    },
+    "reverse_string": {
+        "target_function_name": "reverse_string",
+        "source_code": (
+            "def reverse_string(s):\n"
+            "    if not isinstance(s, str):\n"
+            "        raise TypeError('Input must be a string.')\n"
+            "    return s[::-1]\n"
+        ),
+        "edge_cases": ['""', '"a"', '"ab"', '"racecar"', '"Hello, World!"'],
+        "fuzz_spec": {"s": {"type": "str", "max_len": 12}},
+    },
+    "is_palindrome": {
+        "target_function_name": "is_palindrome",
+        "source_code": (
+            "def is_palindrome(s):\n"
+            "    if not isinstance(s, str):\n"
+            "        raise TypeError('Input must be a string.')\n"
+            "    cleaned = ''.join(ch.lower() for ch in s if ch.isalnum())\n"
+            "    return cleaned == cleaned[::-1]\n"
+        ),
+        "edge_cases": [
+            '""', '"a"', '"ab"', '"abba"',
+            "\"A man, a plan, a canal: Panama\"", '"Hello"',
+        ],
+        "fuzz_spec": {"s": {"type": "str", "max_len": 12}},
+    },
+    "digit_sum": {
+        "target_function_name": "digit_sum",
+        "source_code": (
+            "def digit_sum(n):\n"
+            "    if not isinstance(n, int) or isinstance(n, bool):\n"
+            "        raise TypeError('Input must be int.')\n"
+            "    if n < 0:\n"
+            "        raise ValueError('Input must be non-negative.')\n"
+            "    return sum(int(c) for c in str(n))\n"
+        ),
+        "edge_cases": ["0", "1", "9", "10", "99", "100", "9999"],
+        "fuzz_spec": {"n": {"type": "int", "min": 0, "max": 10000}},
+    },
+    "count_vowels": {
+        "target_function_name": "count_vowels",
+        "source_code": (
+            "def count_vowels(s):\n"
+            "    if not isinstance(s, str):\n"
+            "        raise TypeError('Input must be a string.')\n"
+            "    return sum(1 for c in s.lower() if c in 'aeiou')\n"
+        ),
+        "edge_cases": ['""', '"bcd"', '"AEIOU"', '"Hello, World!"', '"aaaaa"'],
+        "fuzz_spec": {"s": {"type": "str", "max_len": 16}},
+    },
+    "gcd": {
+        "target_function_name": "gcd",
+        "source_code": (
+            "def gcd(pair):\n"
+            "    if not isinstance(pair, (list, tuple)) or len(pair) != 2:\n"
+            "        raise TypeError('Input must be a 2-element list or tuple.')\n"
+            "    a, b = pair\n"
+            "    if not all(isinstance(x, int) and not isinstance(x, bool) for x in (a, b)):\n"
+            "        raise TypeError('Both elements must be int.')\n"
+            "    if a < 0 or b < 0:\n"
+            "        raise ValueError('Both elements must be non-negative.')\n"
+            "    while b:\n"
+            "        a, b = b, a % b\n"
+            "    return a\n"
+        ),
+        "edge_cases": ["(0, 0)", "(0, 7)", "(12, 18)", "(17, 13)", "(100, 75)"],
+        "fuzz_spec": {
+            "pair": {
+                "type": "tuple",
+                "elems": [{"type": "int", "min": 0, "max": 1000}, {"type": "int", "min": 0, "max": 1000}],
+            }
+        },
+    },
+    "sort_unique": {
+        "target_function_name": "sort_unique",
+        "source_code": (
+            "def sort_unique(xs):\n"
+            "    if not isinstance(xs, list):\n"
+            "        raise TypeError('Input must be a list.')\n"
+            "    if not all(isinstance(x, int) and not isinstance(x, bool) for x in xs):\n"
+            "        raise TypeError('All elements must be int.')\n"
+            "    return sorted(set(xs))\n"
+        ),
+        "edge_cases": ["[]", "[1]", "[1, 1, 1]", "[3, 1, 2]", "[-5, 5, 0, -5, 5]"],
+        "fuzz_spec": {"xs": {"type": "list", "elem": {"type": "int", "min": -50, "max": 50}, "max_len": 8}},
+    },
+    "caesar_cipher": {
+        "target_function_name": "caesar_cipher",
+        "source_code": (
+            "def caesar_cipher(s):\n"
+            "    if not isinstance(s, str):\n"
+            "        raise TypeError('Input must be a string.')\n"
+            "    out = []\n"
+            "    for ch in s:\n"
+            "        if 'a' <= ch <= 'z':\n"
+            "            out.append(chr((ord(ch) - ord('a') + 3) % 26 + ord('a')))\n"
+            "        else:\n"
+            "            out.append(ch)\n"
+            "    return ''.join(out)\n"
+        ),
+        "edge_cases": ['""', '"abc"', '"xyz"', '"Hello, World!"', '"ABC"', '"hello world"'],
+        "fuzz_spec": {"s": {"type": "str", "max_len": 16}},
+    },
+    "is_prime": {
+        "target_function_name": "is_prime",
+        "source_code": (
+            "def is_prime(n):\n"
+            "    if not isinstance(n, int) or isinstance(n, bool):\n"
+            "        raise TypeError('Input must be int.')\n"
+            "    if n < 2:\n"
+            "        return False\n"
+            "    if n < 4:\n"
+            "        return True\n"
+            "    if n % 2 == 0:\n"
+            "        return False\n"
+            "    i = 3\n"
+            "    while i * i <= n:\n"
+            "        if n % i == 0:\n"
+            "            return False\n"
+            "        i += 2\n"
+            "    return True\n"
+        ),
+        "edge_cases": ["0", "1", "2", "3", "4", "17", "25", "97", "100"],
+        "fuzz_spec": {"n": {"type": "int", "min": 0, "max": 200}},
+    },
+}
+# ---------------------------------------------------------------------------
+# Six new tasks. These exercise auto-fuzzer features the builtins didn't:
+#  * multi-arg signatures (binary_search, merge_sorted, levenshtein_distance)
+#  * Optional / Literal hint coverage (run_length_encode -> list[tuple[str, int]])
+#  * unannotated containers (flatten_list)
+# ---------------------------------------------------------------------------
+_NEW_TASK_ROWS: List[Dict[str, Any]] = [
+    {
+        "name": "roman_to_int",
+        "target_function_name": "roman_to_int",
+        "signature": "roman_to_int(s: str) -> int",
+        "description": (
+            "Parse a roman numeral string into its integer value. "
+            "Raises ValueError for non-roman characters. Subtraction "
+            "rules (IV=4, IX=9, XL=40, ...) are honoured. Empty -> 0."
+        ),
+        "difficulty": "medium",
+        "source_code": (
+            "def roman_to_int(s: str) -> int:\n"
+            "    if not isinstance(s, str):\n"
+            "        raise TypeError('input must be str')\n"
+            "    table = {'I':1,'V':5,'X':10,'L':50,'C':100,'D':500,'M':1000}\n"
+            "    total = 0\n"
+            "    prev = 0\n"
+            "    for ch in reversed(s.upper()):\n"
+            "        if ch not in table:\n"
+            "            raise ValueError(f'invalid roman numeral character: {ch!r}')\n"
+            "        v = table[ch]\n"
+            "        if v < prev:\n"
+            "            total -= v\n"
+            "        else:\n"
+            "            total += v\n"
+            "        prev = v\n"
+            "    return total\n"
+        ),
+        "edge_cases": ['""', '"I"', '"IV"', '"IX"', '"LVIII"', '"MCMXCIV"', '"MMXXIV"'],
+        "fuzz_spec": {"s": {"type": "str", "alphabet": "IVXLCDM", "max_len": 8}},
+    },
+    {
+        "name": "levenshtein_distance",
+        "target_function_name": "levenshtein_distance",
+        "signature": "levenshtein_distance(a: str, b: str) -> int",
+        "description": (
+            "Classic edit distance between two strings: minimum number of "
+            "single-character insertions, deletions, or substitutions to "
+            "transform a into b. Both arguments must be str."
+        ),
+        "difficulty": "hard",
+        "source_code": (
+            "def levenshtein_distance(a: str, b: str) -> int:\n"
+            "    if not isinstance(a, str) or not isinstance(b, str):\n"
+            "        raise TypeError('both arguments must be str')\n"
+            "    if a == b:\n"
+            "        return 0\n"
+            "    if not a:\n"
+            "        return len(b)\n"
+            "    if not b:\n"
+            "        return len(a)\n"
+            "    prev = list(range(len(b) + 1))\n"
+            "    for i, ca in enumerate(a, 1):\n"
+            "        cur = [i] + [0] * len(b)\n"
+            "        for j, cb in enumerate(b, 1):\n"
+            "            ins = cur[j-1] + 1\n"
+            "            dele = prev[j] + 1\n"
+            "            sub = prev[j-1] + (ca != cb)\n"
+            "            cur[j] = min(ins, dele, sub)\n"
+            "        prev = cur\n"
+            "    return prev[-1]\n"
+        ),
+        "edge_cases": [
+            '("", "")', '("a", "")', '("", "a")', '("kitten", "sitting")',
+            '("flaw", "lawn")', '("abc", "abc")',
+        ],
+        "fuzz_spec": {
+            "a": {"type": "str", "alphabet": "abc", "max_len": 6},
+            "b": {"type": "str", "alphabet": "abc", "max_len": 6},
+        },
+    },
+    {
+        "name": "flatten_list",
+        "target_function_name": "flatten_list",
+        "signature": "flatten_list(xs: list) -> list",
+        "description": (
+            "Recursively flatten a nested list of arbitrary depth. Tuples "
+            "are also flattened; non-list/tuple atoms (ints, strs, ...) "
+            "pass through unchanged."
+        ),
+        "difficulty": "medium",
+        "source_code": (
+            "def flatten_list(xs):\n"
+            "    if not isinstance(xs, (list, tuple)):\n"
+            "        raise TypeError('input must be list or tuple')\n"
+            "    out = []\n"
+            "    stack = list(xs)\n"
+            "    # iterative DFS to avoid recursion limits on adversarial input\n"
+            "    rev = []\n"
+            "    rev.extend(reversed(stack))\n"
+            "    while rev:\n"
+            "        x = rev.pop()\n"
+            "        if isinstance(x, (list, tuple)):\n"
+            "            for y in reversed(x):\n"
+            "                rev.append(y)\n"
+            "        else:\n"
+            "            out.append(x)\n"
+            "    return out\n"
+        ),
+        "edge_cases": [
+            "[]", "[1]", "[[1, 2], [3, 4]]",
+            "[1, [2, [3, [4, [5]]]]]", "[[], [], 1]",
+        ],
+        "fuzz_spec": {
+            "xs": {
+                "type": "list",
+                "elem": {"type": "int", "min": -10, "max": 10},
+                "max_len": 6,
+            }
+        },
+    },
+    {
+        "name": "merge_sorted",
+        "target_function_name": "merge_sorted",
+        "signature": "merge_sorted(a: list[int], b: list[int]) -> list[int]",
+        "description": (
+            "Merge two pre-sorted lists of ints into a single sorted list. "
+            "Both arguments must be lists; elements must be ints (bools "
+            "rejected). The classic merge step of merge-sort."
+        ),
+        "difficulty": "medium",
+        "source_code": (
+            "def merge_sorted(a, b):\n"
+            "    if not isinstance(a, list) or not isinstance(b, list):\n"
+            "        raise TypeError('both arguments must be list')\n"
+            "    for x in (*a, *b):\n"
+            "        if not isinstance(x, int) or isinstance(x, bool):\n"
+            "            raise TypeError('elements must be int')\n"
+            "    out = []\n"
+            "    i = j = 0\n"
+            "    while i < len(a) and j < len(b):\n"
+            "        if a[i] <= b[j]:\n"
+            "            out.append(a[i]); i += 1\n"
+            "        else:\n"
+            "            out.append(b[j]); j += 1\n"
+            "    out.extend(a[i:])\n"
+            "    out.extend(b[j:])\n"
+            "    return out\n"
+        ),
+        "edge_cases": [
+            "([], [])", "([1, 2, 3], [])", "([], [1, 2, 3])",
+            "([1, 3, 5], [2, 4, 6])", "([1, 1], [1, 1])",
+        ],
+        "fuzz_spec": {
+            "a": {"type": "list", "elem": {"type": "int", "min": -20, "max": 20}, "max_len": 5},
+            "b": {"type": "list", "elem": {"type": "int", "min": -20, "max": 20}, "max_len": 5},
+        },
+    },
+    {
+        "name": "run_length_encode",
+        "target_function_name": "run_length_encode",
+        "signature": "run_length_encode(s: str) -> list[tuple[str, int]]",
+        "description": (
+            "Run-length encoding: returns a list of (character, count) "
+            "tuples for each run of identical characters in s. Empty "
+            "input yields an empty list."
+        ),
+        "difficulty": "easy",
+        "source_code": (
+            "def run_length_encode(s):\n"
+            "    if not isinstance(s, str):\n"
+            "        raise TypeError('input must be str')\n"
+            "    if not s:\n"
+            "        return []\n"
+            "    out = []\n"
+            "    cur = s[0]\n"
+            "    n = 1\n"
+            "    for ch in s[1:]:\n"
+            "        if ch == cur:\n"
+            "            n += 1\n"
+            "        else:\n"
+            "            out.append((cur, n))\n"
+            "            cur = ch\n"
+            "            n = 1\n"
+            "    out.append((cur, n))\n"
+            "    return out\n"
+        ),
+        "edge_cases": ['""', '"a"', '"aa"', '"abc"', '"aaabbbccc"', '"aaaaaaaaaa"'],
+        "fuzz_spec": {"s": {"type": "str", "alphabet": "ab", "max_len": 12}},
+    },
+    {
+        "name": "binary_search",
+        "target_function_name": "binary_search",
+        "signature": "binary_search(arr: list[int], target: int) -> int",
+        "description": (
+            "Return the index of target in the sorted ascending list arr, "
+            "or -1 if not present. arr must be a list of ints; target "
+            "must be int. The list is assumed sorted."
+        ),
+        "difficulty": "medium",
+        "source_code": (
+            "def binary_search(arr, target):\n"
+            "    if not isinstance(arr, list):\n"
+            "        raise TypeError('arr must be list')\n"
+            "    if not isinstance(target, int) or isinstance(target, bool):\n"
+            "        raise TypeError('target must be int')\n"
+            "    lo, hi = 0, len(arr) - 1\n"
+            "    while lo <= hi:\n"
+            "        mid = (lo + hi) // 2\n"
+            "        v = arr[mid]\n"
+            "        if v == target:\n"
+            "            return mid\n"
+            "        if v < target:\n"
+            "            lo = mid + 1\n"
+            "        else:\n"
+            "            hi = mid - 1\n"
+            "    return -1\n"
+        ),
+        "edge_cases": [
+            "([], 3)", "([1], 1)", "([1], 2)",
+            "([1, 2, 3, 4, 5], 3)", "([1, 2, 3, 4, 5], 0)",
+            "([1, 2, 3, 4, 5], 6)",
+        ],
+        "fuzz_spec": {
+            "arr": {"type": "list", "elem": {"type": "int", "min": -20, "max": 20}, "max_len": 8},
+            "target": {"type": "int", "min": -20, "max": 20},
+        },
+    },
+]
+def _builtin_to_row(name: str) -> Dict[str, Any]:
+    spec = BLACK_BOX_FUNCTIONS[name]
+    src_meta = _BUILTIN_SOURCE[name]
+    return {
+        "name": name,
+        "target_function_name": src_meta["target_function_name"],
+        "signature": spec.signature,
+        "description": spec.description,
+        "difficulty": spec.difficulty,
+        "source_code": src_meta["source_code"],
+        "edge_cases_json": json.dumps(src_meta["edge_cases"]),
+        "fuzz_spec_json": json.dumps(src_meta["fuzz_spec"]),
+    }
+def _new_task_to_row(meta: Dict[str, Any]) -> Dict[str, Any]:
+    return {
+        "name": meta["name"],
+        "target_function_name": meta["target_function_name"],
+        "signature": meta["signature"],
+        "description": meta["description"],
+        "difficulty": meta["difficulty"],
+        "source_code": meta["source_code"],
+        "edge_cases_json": json.dumps(meta["edge_cases"]),
+        "fuzz_spec_json": json.dumps(meta["fuzz_spec"]),
+    }
+def build_rows() -> List[Dict[str, Any]]:
+    rows: List[Dict[str, Any]] = []
+    for name in BLACK_BOX_FUNCTIONS:
+        rows.append(_builtin_to_row(name))
+    for meta in _NEW_TASK_ROWS:
+        rows.append(_new_task_to_row(meta))
+    return rows
+def push_to_hub(rows: List[Dict[str, Any]], dataset_id: str, *, private: bool = False) -> str:
+    """Push the row list to ``dataset_id`` (overwriting any prior contents).
+    Returns the hub URL.
+    """
+    from datasets import Dataset
+    from huggingface_hub import HfApi
+    api = HfApi()
+    api.create_repo(
+        repo_id=dataset_id,
+        repo_type="dataset",
+        exist_ok=True,
+        private=private,
+    )
+    ds = Dataset.from_list(rows)
+    log.info("pushing %d row(s) to %s", len(rows), dataset_id)
+    ds.push_to_hub(dataset_id, split="train", private=private)
+    return f"https://huggingface.co/datasets/{dataset_id}"
+def main(argv: Optional[List[str]] = None) -> int:
+    p = argparse.ArgumentParser(description="Bootstrap the OpenSleuth Hub task catalog.")
+    p.add_argument("--dataset-id", default=DATASET_ID)
+    p.add_argument("--dry-run", action="store_true", help="Print row count, don't push.")
+    p.add_argument("--private", action="store_true", help="Create as private dataset.")
+    args = p.parse_args(argv)
+    rows = build_rows()
+    log.info("built %d row(s) (%d builtin + %d new)",
+             len(rows), len(BLACK_BOX_FUNCTIONS), len(_NEW_TASK_ROWS))
+    for r in rows:
+        log.info("  %-22s difficulty=%-6s edges=%-2d",
+                 r["name"], r["difficulty"], len(json.loads(r["edge_cases_json"])))
+    if args.dry_run:
+        log.info("--dry-run: not pushing")
+        return 0
+    url = push_to_hub(rows, args.dataset_id, private=args.private)
+    log.info("dataset live at %s", url)
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

opensleuth_env/task_catalog.py ADDED Viewed

	@@ -0,0 +1,469 @@

+"""TaskCatalog: resolve a target function from one of three sources.
+OpenSleuth Level 2 makes the env open-ended. Where v0.3 only knew about the
+9 hand-written ``BLACK_BOX_FUNCTIONS``, the catalog accepts targets from:
+  1. **Caller-supplied** -- per-/reset payload, the most specific source.
+     The caller passes ``target_code`` + ``target_function_name`` (and
+     optionally ``edge_cases`` / ``fuzz_spec``) and we compile the source
+     in the same hardened sandbox the verifier uses for submissions.
+  2. **Hub dataset** -- ``anugrah55/opensleuth-tasks`` on Hugging Face Hub.
+     Each row carries ``{name, signature, description, difficulty,
+     source_code, edge_cases_json, fuzz_spec_json}``. Loaded lazily on
+     first reset and cached in-process.
+  3. **Builtin registry** -- the original 9 ``BLACK_BOX_FUNCTIONS``. Kept
+     as the safety-net so the in-flight trainer keeps working unchanged.
+Resolution priority: caller-supplied wins, then Hub by name, then builtin.
+This makes "trainer asks for fibonacci" still resolve to the builtin
+fibonacci even when the Hub copy exists, *unless* the caller explicitly
+overrides via ``target_code``.
+Sandbox: caller-supplied / Hub source code is executed via the same
+``_make_safe_globals`` whitelist as agent submissions (no ``__import__``,
+``open``, ``eval``, ...). On top we statically reject any source that
+imports ``opensleuth_*`` to prevent oracle-cheesing.
+"""
+from __future__ import annotations
+import ast
+import inspect
+import json
+import logging
+import threading
+from typing import Any, Callable, Dict, List, Optional
+from .auto_fuzzer import auto_fuzz, make_fuzzer
+from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
+from .verifier import _make_safe_globals  # reuse the hardened sandbox
+log = logging.getLogger("opensleuth.task_catalog")
+HUB_DATASET_ID = "anugrah55/opensleuth-tasks"
+class TaskResolutionError(ValueError):
+    """Raised when a /reset request can't be turned into a FunctionSpec."""
+# ---------------------------------------------------------------------------
+# Caller / Hub source-code compilation
+# ---------------------------------------------------------------------------
+_FORBIDDEN_PREFIXES = ("opensleuth", "opensleuth_env")
+def _statically_reject_oracle_import(code: str) -> Optional[str]:
+    """Return an error string if the source statically imports the env's own
+    oracle module (which would let the agent / Hub author cheese the
+    verifier). The hardened sandbox already blocks ``__import__``, but we
+    fail fast and surface a clear error.
+    """
+    try:
+        tree = ast.parse(code)
+    except SyntaxError as e:
+        return f"target_code is not valid Python: {e}"
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Import):
+            for alias in node.names:
+                if any(alias.name.startswith(p) for p in _FORBIDDEN_PREFIXES):
+                    return (
+                        f"target_code is not allowed to import {alias.name!r} "
+                        "(oracle import)."
+                    )
+        elif isinstance(node, ast.ImportFrom):
+            mod = node.module or ""
+            if any(mod.startswith(p) for p in _FORBIDDEN_PREFIXES):
+                return (
+                    f"target_code is not allowed to import from {mod!r} "
+                    "(oracle import)."
+                )
+    return None
+def _compile_target_in_sandbox(code: str, function_name: str) -> Callable[..., Any]:
+    """Compile ``code`` in the same restricted globals the verifier uses for
+    agent submissions, then return the named callable. Raises
+    ``TaskResolutionError`` on any problem so /reset can return a clean 400.
+    """
+    err = _statically_reject_oracle_import(code)
+    if err:
+        raise TaskResolutionError(err)
+    safe_globals = _make_safe_globals()
+    local_scope: Dict[str, Any] = {}
+    try:
+        exec(code, safe_globals, local_scope)
+    except Exception as e:  # noqa: BLE001
+        raise TaskResolutionError(
+            f"target_code raised at definition time: {type(e).__name__}: {e}"
+        ) from e
+    fn = local_scope.get(function_name) or safe_globals.get(function_name)
+    if not callable(fn):
+        raise TaskResolutionError(
+            f"target_code does not define a callable named {function_name!r}."
+        )
+    return fn
+def _arity_of(fn: Callable[..., Any]) -> int:
+    """Number of positional / positional-or-keyword params on ``fn``."""
+    try:
+        sig = inspect.signature(fn)
+    except (TypeError, ValueError):
+        return 1
+    n = 0
+    for p in sig.parameters.values():
+        if p.kind in (
+            inspect.Parameter.POSITIONAL_ONLY,
+            inspect.Parameter.POSITIONAL_OR_KEYWORD,
+        ):
+            n += 1
+    return max(n, 1)
+def _signature_string(fn: Callable[..., Any], name: str) -> str:
+    try:
+        sig = inspect.signature(fn)
+        return f"{name}{sig}"
+    except (TypeError, ValueError):
+        return f"{name}(...)"
+def _description_of(fn: Callable[..., Any]) -> str:
+    return inspect.getdoc(fn) or ""
+def _parse_edge_cases(edge_cases: Optional[List[Any]]) -> List[Any]:
+    """Edge cases arrive as a list of strings (Python literal reprs) when
+    coming from the API or from the Hub's ``edge_cases_json`` column. Each
+    string is parsed via ``ast.literal_eval``. Already-parsed values
+    (e.g. ints from the bootstrap script) are passed through unchanged.
+    """
+    if not edge_cases:
+        return []
+    parsed: List[Any] = []
+    for raw in edge_cases:
+        if isinstance(raw, str):
+            try:
+                parsed.append(ast.literal_eval(raw))
+            except (ValueError, SyntaxError) as e:
+                raise TaskResolutionError(
+                    f"edge_cases entry {raw!r} is not a Python literal: {e}"
+                ) from e
+        else:
+            parsed.append(raw)
+    return parsed
+def _flatten_unary_edges(arity: int, edges: List[Any]) -> List[Any]:
+    """For unary fns we accept either ``[5, 10]`` or ``[(5,), (10,)]`` and
+    normalise to flat values; for multi-arg fns we require tuples and pass
+    them through."""
+    if arity == 1:
+        out = []
+        for e in edges:
+            if isinstance(e, tuple) and len(e) == 1:
+                out.append(e[0])
+            else:
+                out.append(e)
+        return out
+    out = []
+    for e in edges:
+        if not isinstance(e, tuple):
+            raise TaskResolutionError(
+                f"edge_cases for a {arity}-arg target must be tuples, "
+                f"got {type(e).__name__}: {e!r}"
+            )
+        out.append(e)
+    return out
+def _spec_from_callable(
+    name: str,
+    fn: Callable[..., Any],
+    *,
+    description: Optional[str] = None,
+    signature: Optional[str] = None,
+    difficulty: str = "medium",
+    edge_cases: Optional[List[Any]] = None,
+    fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
+    source: str = "user",
+) -> FunctionSpec:
+    """Build a FunctionSpec from a Python callable + optional metadata.
+    Wraps ``auto_fuzz`` for the fuzzer. The arity is auto-detected from
+    ``inspect.signature`` so ``unpack_args`` is set correctly: unary fns
+    behave like the existing builtins (single-arg call), N-arg fns flow
+    through the tuple-unpacking path in env / verifier.
+    """
+    arity = _arity_of(fn)
+    unpack = arity > 1
+    parsed_edges = _flatten_unary_edges(arity, _parse_edge_cases(edge_cases))
+    if unpack:
+        # Catalog-level adapter: keep the public spec.fn one-arg-style for
+        # the *unary* path so existing call sites work, but for multi-arg
+        # the env/verifier respect ``unpack_args`` and call ``fn(*args)``.
+        # We still store the original here -- env._handle_probe and
+        # verify_submission do the unpacking.
+        ref_fn: Callable[..., Any] = fn
+        def _fuzzer(rng, n):
+            return auto_fuzz(fn, n, rng, fuzz_spec=fuzz_spec)
+    else:
+        ref_fn = fn
+        def _unary_fuzzer(rng, n):
+            tuples = auto_fuzz(fn, n, rng, fuzz_spec=fuzz_spec)
+            return [t[0] if isinstance(t, tuple) and len(t) == 1 else t for t in tuples]
+        _fuzzer = _unary_fuzzer
+    return FunctionSpec(
+        name=name,
+        fn=ref_fn,
+        signature=signature or _signature_string(fn, name),
+        description=description or _description_of(fn),
+        fuzzer=_fuzzer,
+        difficulty=difficulty,
+        edge_cases=parsed_edges,
+        unpack_args=unpack,
+        source=source,
+    )
+# ---------------------------------------------------------------------------
+# Hub loader
+# ---------------------------------------------------------------------------
+class _HubCache:
+    """Lazily loads the Hub dataset into ``{name: FunctionSpec}``. Thread-
+    safe initialisation; subsequent reads are lock-free."""
+    def __init__(self, dataset_id: str):
+        self.dataset_id = dataset_id
+        self._lock = threading.Lock()
+        self._loaded: bool = False
+        self._specs: Dict[str, FunctionSpec] = {}
+        self._raw_rows: List[Dict[str, Any]] = []
+        self._load_error: Optional[str] = None
+    @property
+    def loaded(self) -> bool:
+        return self._loaded
+    @property
+    def load_error(self) -> Optional[str]:
+        return self._load_error
+    def _row_to_spec(self, row: Dict[str, Any]) -> Optional[FunctionSpec]:
+        name = row.get("name")
+        code = row.get("source_code")
+        if not name or not code:
+            return None
+        fn_name = row.get("target_function_name") or name
+        try:
+            fn = _compile_target_in_sandbox(code, fn_name)
+        except TaskResolutionError as e:
+            log.warning("hub task %r failed to compile: %s", name, e)
+            return None
+        edge_cases_raw = row.get("edge_cases_json") or "[]"
+        fuzz_spec_raw = row.get("fuzz_spec_json") or "null"
+        try:
+            edge_cases = json.loads(edge_cases_raw) if isinstance(edge_cases_raw, str) else edge_cases_raw
+        except json.JSONDecodeError:
+            edge_cases = []
+        try:
+            fuzz_spec = json.loads(fuzz_spec_raw) if isinstance(fuzz_spec_raw, str) else fuzz_spec_raw
+        except json.JSONDecodeError:
+            fuzz_spec = None
+        try:
+            return _spec_from_callable(
+                name=name,
+                fn=fn,
+                description=row.get("description") or _description_of(fn),
+                signature=row.get("signature") or _signature_string(fn, name),
+                difficulty=row.get("difficulty") or "medium",
+                edge_cases=edge_cases,
+                fuzz_spec=fuzz_spec,
+                source="hub",
+            )
+        except TaskResolutionError as e:
+            log.warning("hub task %r could not be specced: %s", name, e)
+            return None
+    def ensure_loaded(self) -> None:
+        if self._loaded:
+            return
+        with self._lock:
+            if self._loaded:
+                return
+            try:
+                from datasets import load_dataset  # type: ignore
+                ds = load_dataset(self.dataset_id, split="train")
+                rows = list(ds)
+                specs: Dict[str, FunctionSpec] = {}
+                for row in rows:
+                    spec = self._row_to_spec(row)
+                    if spec is not None:
+                        specs[spec.name] = spec
+                self._specs = specs
+                self._raw_rows = rows
+                log.info(
+                    "loaded %d task(s) from %s (%d row(s) total)",
+                    len(specs),
+                    self.dataset_id,
+                    len(rows),
+                )
+            except Exception as e:  # noqa: BLE001
+                # Hub unreachable / not yet bootstrapped / offline. We swallow
+                # the error so the env keeps working from the builtin
+                # registry alone -- this is what lets the trainer keep
+                # running even if the Hub goes down mid-rollout.
+                self._load_error = f"{type(e).__name__}: {e}"
+                log.warning("hub dataset %s unavailable: %s", self.dataset_id, self._load_error)
+            finally:
+                self._loaded = True
+    def specs(self) -> Dict[str, FunctionSpec]:
+        self.ensure_loaded()
+        return self._specs
+    def rows(self) -> List[Dict[str, Any]]:
+        self.ensure_loaded()
+        return self._raw_rows
+# ---------------------------------------------------------------------------
+# TaskCatalog
+# ---------------------------------------------------------------------------
+class TaskCatalog:
+    """Resolves /reset payloads to FunctionSpecs from caller / Hub / builtin."""
+    def __init__(
+        self,
+        hub_dataset_id: str = HUB_DATASET_ID,
+        *,
+        enable_hub: bool = True,
+    ) -> None:
+        self.hub_dataset_id = hub_dataset_id
+        self.enable_hub = enable_hub
+        self._hub = _HubCache(hub_dataset_id) if enable_hub else None
+    # --- Resolution --------------------------------------------------------
+    def resolve(
+        self,
+        target_name: Optional[str] = None,
+        target_code: Optional[str] = None,
+        target_function_name: Optional[str] = None,
+        edge_cases: Optional[List[Any]] = None,
+        fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
+    ) -> FunctionSpec:
+        # 1. Caller-supplied: highest priority.
+        if target_code is not None:
+            if not target_function_name:
+                raise TaskResolutionError(
+                    "target_code requires target_function_name to identify "
+                    "which callable in the source to use."
+                )
+            fn = _compile_target_in_sandbox(target_code, target_function_name)
+            return _spec_from_callable(
+                name=target_function_name,
+                fn=fn,
+                edge_cases=edge_cases,
+                fuzz_spec=fuzz_spec,
+                source="user",
+            )
+        # 2. & 3. Hub-by-name / builtin-by-name. Builtin wins for legacy names
+        # (so the trainer's "fibonacci" always means the in-process oracle,
+        # never a possibly-modified Hub copy).
+        if not target_name:
+            raise TaskResolutionError(
+                "Either target_name or (target_code + target_function_name) must be set."
+            )
+        if target_name in BLACK_BOX_FUNCTIONS:
+            return BLACK_BOX_FUNCTIONS[target_name]
+        if self._hub is not None:
+            hub_specs = self._hub.specs()
+            if target_name in hub_specs:
+                return hub_specs[target_name]
+        available = self.list_known_names()
+        raise TaskResolutionError(
+            f"Unknown target function: {target_name!r}. Available: {sorted(available)[:25]}"
+        )
+    # --- Listing -----------------------------------------------------------
+    def list_known_names(self) -> List[str]:
+        names = set(BLACK_BOX_FUNCTIONS)
+        if self._hub is not None:
+            try:
+                names.update(self._hub.specs())
+            except Exception:  # noqa: BLE001 -- best effort
+                pass
+        return sorted(names)
+    def list_builtin(self) -> List[Dict[str, Any]]:
+        return [
+            {
+                "name": s.name,
+                "signature": s.signature,
+                "description": s.description,
+                "difficulty": s.difficulty,
+                "edge_case_count": len(s.edge_cases or []),
+                "source": "builtin",
+            }
+            for s in BLACK_BOX_FUNCTIONS.values()
+        ]
+    def list_hub(self) -> List[Dict[str, Any]]:
+        if self._hub is None:
+            return []
+        out = []
+        for s in self._hub.specs().values():
+            # Don't shadow builtins in the Hub list (avoids surprising the
+            # caller with a "fibonacci@hub" entry that's never used).
+            if s.name in BLACK_BOX_FUNCTIONS:
+                continue
+            out.append(
+                {
+                    "name": s.name,
+                    "signature": s.signature,
+                    "description": s.description,
+                    "difficulty": s.difficulty,
+                    "edge_case_count": len(s.edge_cases or []),
+                    "source": "hub",
+                }
+            )
+        return out
+    def list_all(self) -> List[Dict[str, Any]]:
+        return self.list_builtin() + self.list_hub()
+    # --- Diagnostics -------------------------------------------------------
+    def hub_status(self) -> Dict[str, Any]:
+        if self._hub is None:
+            return {"enabled": False}
+        return {
+            "enabled": True,
+            "dataset_id": self.hub_dataset_id,
+            "loaded": self._hub.loaded,
+            "task_count": len(self._hub.specs()) if self._hub.loaded else None,
+            "error": self._hub.load_error,
+        }

opensleuth_env/verifier.py CHANGED Viewed

@@ -180,23 +180,36 @@ class _CallTimeout(Exception):
     pass
-def _call_with_timeout(fn: Callable, arg: Any, timeout_s: float):
     def _handler(signum, frame):  # noqa: ARG001
         raise _CallTimeout()
     old = signal.signal(signal.SIGALRM, _handler)
     signal.setitimer(signal.ITIMER_REAL, timeout_s)
     try:
         return fn(arg)
     finally:
         signal.setitimer(signal.ITIMER_REAL, 0)
         signal.signal(signal.SIGALRM, old)
-def _safe_call(fn: Callable, arg: Any, timeout_s: float):
-    """Returns (kind, value): kind in {'val', 'err', 'timeout'}."""
     try:
-        return ("val", _call_with_timeout(fn, arg, timeout_s))
     except _CallTimeout:
         return ("timeout", f"timed out after {timeout_s}s")
     except Exception as e:  # noqa: BLE001
@@ -270,13 +283,14 @@ def _looks_like_reference_import(code: str) -> bool:
 def verify_submission(
     submitted_code: str,
-    target_function: Callable[[Any], Any],
     fuzz_inputs: List[Any],
     *,
     target_name: Optional[str] = None,
     define_timeout_s: float = 5.0,
     call_timeout_s: float = 1.0,
     edge_inputs: Optional[List[Any]] = None,
 ) -> VerificationResult:
     """Score ``submitted_code`` against ``target_function`` over the supplied
     ``fuzz_inputs`` (random regime) and ``edge_inputs`` (must-pass regime).
@@ -324,8 +338,8 @@ def verify_submission(
     def _score(inputs: List[Any], category: str) -> None:
         for inp in inputs:
-            ref = _safe_call(target_function, inp, call_timeout_s)
-            sub = _safe_call(submitted_fn, inp, call_timeout_s)
             sub_results.append(sub)
             ref_results.append(ref)
             if _outputs_equivalent(ref, sub):

     pass
+def _call_with_timeout(fn: Callable, arg: Any, timeout_s: float, *, unpack: bool = False):
     def _handler(signum, frame):  # noqa: ARG001
         raise _CallTimeout()
     old = signal.signal(signal.SIGALRM, _handler)
     signal.setitimer(signal.ITIMER_REAL, timeout_s)
     try:
+        if unpack:
+            if not isinstance(arg, tuple):
+                # Defensive: a multi-param target should always receive a
+                # tuple, but if the agent's probe input_repr happens to
+                # parse to a single value, treat it as a 1-tuple so we get
+                # a clear TypeError rather than a confusing call shape.
+                arg = (arg,)
+            return fn(*arg)
         return fn(arg)
     finally:
         signal.setitimer(signal.ITIMER_REAL, 0)
         signal.signal(signal.SIGALRM, old)
+def _safe_call(fn: Callable, arg: Any, timeout_s: float, *, unpack: bool = False):
+    """Returns (kind, value): kind in {'val', 'err', 'timeout'}.
+    When ``unpack`` is True the input ``arg`` is expected to be an args
+    tuple and ``fn`` is invoked as ``fn(*arg)``. This is how multi-parameter
+    auto-fuzzer-driven targets are scored.
+    """
     try:
+        return ("val", _call_with_timeout(fn, arg, timeout_s, unpack=unpack))
     except _CallTimeout:
         return ("timeout", f"timed out after {timeout_s}s")
     except Exception as e:  # noqa: BLE001
 def verify_submission(
     submitted_code: str,
+    target_function: Callable[..., Any],
     fuzz_inputs: List[Any],
     *,
     target_name: Optional[str] = None,
     define_timeout_s: float = 5.0,
     call_timeout_s: float = 1.0,
     edge_inputs: Optional[List[Any]] = None,
+    unpack_args: bool = False,
 ) -> VerificationResult:
     """Score ``submitted_code`` against ``target_function`` over the supplied
     ``fuzz_inputs`` (random regime) and ``edge_inputs`` (must-pass regime).
     def _score(inputs: List[Any], category: str) -> None:
         for inp in inputs:
+            ref = _safe_call(target_function, inp, call_timeout_s, unpack=unpack_args)
+            sub = _safe_call(submitted_fn, inp, call_timeout_s, unpack=unpack_args)
             sub_results.append(sub)
             ref_results.append(ref)
             if _outputs_equivalent(ref, sub):

requirements.txt CHANGED Viewed

@@ -1,3 +1,8 @@
 fastapi==0.115.6
 uvicorn[standard]==0.32.1
 pydantic==2.10.3

 fastapi==0.115.6
 uvicorn[standard]==0.32.1
 pydantic==2.10.3
+# Level 2: Hub-driven task catalog. We swallow load failures at runtime so
+# the env still functions if Hub is offline, but the dependency is required
+# for Hub-backed tasks to be discoverable.
+datasets>=3.0.0
+huggingface_hub>=0.25.0

server.py CHANGED Viewed

@@ -15,18 +15,23 @@ from opensleuth_env import (
     StepRequest,
     StepResponse,
     SubmitAction,
 )
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
 log = logging.getLogger("opensleuth.server")
-app = FastAPI(title="OpenSleuth Env", version="0.3.0")
 env = OpenSleuthEnv()
 @app.get("/health")
 def health():
-    return {"status": "ok", "episodes_tracked": len(env._states)}  # noqa: SLF001
 @app.get("/functions")
@@ -36,6 +41,10 @@ def list_functions(
         description="Optional filter: easy / medium / hard. Used by the trainer for curriculum scheduling.",
     ),
 ):
     items = []
     for s in BLACK_BOX_FUNCTIONS.values():
         if difficulty is not None and getattr(s, "difficulty", None) != difficulty:
@@ -47,15 +56,65 @@ def list_functions(
                 "description": s.description,
                 "difficulty": getattr(s, "difficulty", None),
                 "edge_case_count": len(getattr(s, "edge_cases", []) or []),
             }
         )
     return {"functions": items}
 @app.post("/reset")
 def reset(req: ResetRequest):
     try:
-        obs = env.reset(target_name=req.target_name, seed=req.seed, max_steps=req.max_steps)
     except ValueError as e:
         raise HTTPException(status_code=400, detail=str(e)) from e
     return obs
@@ -77,8 +136,6 @@ def get_state(episode_id: str):
     return state
-# Convenience: a flat /step that does reset+step in one call is occasionally
-# useful for shell-style debugging.
 @app.post("/probe_once")
 def probe_once(target_name: str, input_repr: str):
     obs = env.reset(target_name=target_name)

     StepRequest,
     StepResponse,
     SubmitAction,
+    TaskCatalog,
 )
 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
 log = logging.getLogger("opensleuth.server")
+app = FastAPI(title="OpenSleuth Env", version="0.4.0")
 env = OpenSleuthEnv()
 @app.get("/health")
 def health():
+    return {
+        "status": "ok",
+        "episodes_tracked": len(env._states),  # noqa: SLF001
+        "hub": env.catalog.hub_status(),
+    }
 @app.get("/functions")
         description="Optional filter: easy / medium / hard. Used by the trainer for curriculum scheduling.",
     ),
 ):
+    # NOTE -- backwards compatibility: this endpoint deliberately keeps the
+    # exact v0.3 shape (just the 9 builtin functions, with the original
+    # field set), because the in-flight trainer queries it. The new "source"
+    # field is additive. Open-ended / Hub tasks are exposed via /tasks.
     items = []
     for s in BLACK_BOX_FUNCTIONS.values():
         if difficulty is not None and getattr(s, "difficulty", None) != difficulty:
                 "description": s.description,
                 "difficulty": getattr(s, "difficulty", None),
                 "edge_case_count": len(getattr(s, "edge_cases", []) or []),
+                "source": "builtin",
             }
         )
     return {"functions": items}
+@app.get("/tasks")
+def list_tasks(
+    source: str = Query(
+        "all",
+        description="Filter by source: 'builtin', 'hub', or 'all' (default).",
+    ),
+    difficulty: Optional[str] = Query(None, description="Optional curriculum filter."),
+):
+    src = source.lower()
+    if src == "builtin":
+        tasks = env.catalog.list_builtin()
+    elif src == "hub":
+        tasks = env.catalog.list_hub()
+    elif src == "all":
+        tasks = env.catalog.list_all()
+    else:
+        raise HTTPException(
+            status_code=400, detail="source must be one of: builtin, hub, all"
+        )
+    if difficulty is not None:
+        tasks = [t for t in tasks if t.get("difficulty") == difficulty]
+    return {
+        "tasks": tasks,
+        "count": len(tasks),
+        "hub": env.catalog.hub_status(),
+    }
 @app.post("/reset")
 def reset(req: ResetRequest):
+    # Validation: legacy callers pass only target_name; open-ended callers
+    # pass target_code + target_function_name. At least one of those paths
+    # must be populated.
+    if not req.target_name and not req.target_code:
+        raise HTTPException(
+            status_code=400,
+            detail="Either 'target_name' or ('target_code' + 'target_function_name') must be set.",
+        )
+    if req.target_code and not req.target_function_name:
+        raise HTTPException(
+            status_code=400,
+            detail="'target_function_name' is required when 'target_code' is provided.",
+        )
     try:
+        obs = env.reset(
+            target_name=req.target_name,
+            seed=req.seed,
+            max_steps=req.max_steps,
+            target_code=req.target_code,
+            target_function_name=req.target_function_name,
+            edge_cases=req.edge_cases,
+            fuzz_spec=req.fuzz_spec,
+        )
     except ValueError as e:
         raise HTTPException(status_code=400, detail=str(e)) from e
     return obs
     return state
 @app.post("/probe_once")
 def probe_once(target_name: str, input_repr: str):
     obs = env.reset(target_name=target_name)