Level 2 open-ended env: auto-fuzzer + TaskCatalog + Hub-driven catalog + extended /reset
Browse filesAdds opensleuth_env/auto_fuzzer.py (type-driven fuzz input generator), opensleuth_env/task_catalog.py (resolves caller-supplied target_code, Hub dataset rows, builtin registry in priority order), opensleuth_env/scripts/bootstrap_tasks_dataset.py (idempotent push of 9 builtin + 6 new tasks to anugrah55/opensleuth-tasks). Extends ResetRequest with target_code/target_function_name/edge_cases/fuzz_spec, adds GET /tasks endpoint, threads unpack_args through env+verifier for multi-arg targets. The 9 builtin functions and the v0.3 /reset shape are kept as the safety net so the in-flight trainer keeps working unchanged. 61 unit tests pass.
- README.md +50 -3
- opensleuth_env/__init__.py +7 -0
- opensleuth_env/auto_fuzzer.py +383 -0
- opensleuth_env/black_box.py +11 -1
- opensleuth_env/env.py +61 -12
- opensleuth_env/models.py +35 -1
- opensleuth_env/scripts/__init__.py +0 -0
- opensleuth_env/scripts/bootstrap_tasks_dataset.py +508 -0
- opensleuth_env/task_catalog.py +469 -0
- opensleuth_env/verifier.py +21 -7
- requirements.txt +5 -0
- server.py +62 -5
README.md
CHANGED
|
@@ -19,9 +19,10 @@ function by probing it, then submit Python source that replicates it.
|
|
| 19 |
|
| 20 |
| Method | Path | Body | Notes |
|
| 21 |
|-------:|---------------|----------------------------------------|----------------------------------------|
|
| 22 |
-
| GET | `/health` | — | Liveness probe.
|
| 23 |
-
| GET | `/functions` | optional `?difficulty=easy\|medium\|hard` | Catalogue of
|
| 24 |
-
|
|
|
|
|
| 25 |
| POST | `/step` | `{"episode_id": "...", "action": {...}}` | One agent action. |
|
| 26 |
| GET | `/state/{eid}`| — | Inspect the live state of an episode (debug). |
|
| 27 |
|
|
@@ -56,6 +57,45 @@ Engineering and Shaping*, arXiv:2408.10215).
|
|
| 56 |
(`+15`). The sandbox additionally **blocks** `__import__`, `open`,
|
| 57 |
`eval`, `exec`, `compile`, etc.
|
| 58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
### Backwards compatibility
|
| 60 |
|
| 61 |
Existing trainer / eval clients only read `info["execution_reward"]`,
|
|
@@ -64,6 +104,13 @@ with the same meaning. New fields (`difficulty`, `coverage_buckets_seen`,
|
|
| 64 |
`matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`,
|
| 65 |
`floor_penalty`, `perfect_bonus`) are additive and ignored by older clients.
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
## Hardware
|
| 68 |
|
| 69 |
CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space.
|
|
|
|
| 19 |
|
| 20 |
| Method | Path | Body | Notes |
|
| 21 |
|-------:|---------------|----------------------------------------|----------------------------------------|
|
| 22 |
+
| GET | `/health` | — | Liveness probe (also reports Hub-catalog status). |
|
| 23 |
+
| GET | `/functions` | optional `?difficulty=easy\|medium\|hard` | Catalogue of the 9 builtin black-boxes (back-compat shape). |
|
| 24 |
+
| GET | `/tasks` | optional `?source=builtin\|hub\|all` | Open-ended catalog (Level 2): builtins + Hub-loaded rows. |
|
| 25 |
+
| POST | `/reset` | `{"target_name": "fibonacci", "seed": 0}` *or* `{"target_code": "...", "target_function_name": "..."}` | Starts an episode. Caller-supplied target_code wins over target_name. |
|
| 26 |
| POST | `/step` | `{"episode_id": "...", "action": {...}}` | One agent action. |
|
| 27 |
| GET | `/state/{eid}`| — | Inspect the live state of an episode (debug). |
|
| 28 |
|
|
|
|
| 57 |
(`+15`). The sandbox additionally **blocks** `__import__`, `open`,
|
| 58 |
`eval`, `exec`, `compile`, etc.
|
| 59 |
|
| 60 |
+
### Open-ended tasks (Level 2)
|
| 61 |
+
|
| 62 |
+
The env resolves a target function from three sources, in priority order:
|
| 63 |
+
|
| 64 |
+
1. **Caller-supplied** — `POST /reset` with `target_code` + `target_function_name`
|
| 65 |
+
(and optionally `edge_cases` + `fuzz_spec`). The source is compiled in the
|
| 66 |
+
same hardened sandbox the verifier uses for agent submissions; static-import
|
| 67 |
+
of `opensleuth_*` is rejected up front. This lets a trainer hand the env an
|
| 68 |
+
arbitrary unseen task per rollout without any redeploy.
|
| 69 |
+
|
| 70 |
+
2. **Hub dataset** — [`anugrah55/opensleuth-tasks`](https://huggingface.co/datasets/anugrah55/opensleuth-tasks).
|
| 71 |
+
Loaded lazily on first `/reset`, cached in-process. Each row has
|
| 72 |
+
`{name, target_function_name, signature, description, difficulty,
|
| 73 |
+
source_code, edge_cases_json, fuzz_spec_json}`.
|
| 74 |
+
|
| 75 |
+
3. **Builtin registry** — the original 9 functions in `black_box.py` are kept
|
| 76 |
+
as the safety-net so the in-flight trainer keeps working unchanged. Builtins
|
| 77 |
+
*win* by name over Hub copies, so `target_name="fibonacci"` always resolves
|
| 78 |
+
to the in-process oracle.
|
| 79 |
+
|
| 80 |
+
#### Adding new tasks
|
| 81 |
+
|
| 82 |
+
* **Per-reset (one-shot)**: pass `target_code` + `target_function_name` to
|
| 83 |
+
`/reset`. Multi-arg signatures are supported via the auto-fuzzer (which
|
| 84 |
+
introspects `inspect.signature` + `typing.get_type_hints`); pass
|
| 85 |
+
`edge_cases` as a list of Python literal strings and `fuzz_spec` as a
|
| 86 |
+
per-parameter override map.
|
| 87 |
+
|
| 88 |
+
* **Persistent**: append a row to the Hub dataset and the env will pick it
|
| 89 |
+
up on its next process-start. The bootstrap script
|
| 90 |
+
(`opensleuth_env/scripts/bootstrap_tasks_dataset.py`) is idempotent —
|
| 91 |
+
re-running it overwrites the dataset with the latest builtin + curated
|
| 92 |
+
rows.
|
| 93 |
+
|
| 94 |
+
```bash
|
| 95 |
+
# Push the curated 9 + 6 = 15-task seed catalog.
|
| 96 |
+
PYTHONPATH=. python -m opensleuth_env.scripts.bootstrap_tasks_dataset
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
### Backwards compatibility
|
| 100 |
|
| 101 |
Existing trainer / eval clients only read `info["execution_reward"]`,
|
|
|
|
| 104 |
`matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`,
|
| 105 |
`floor_penalty`, `perfect_bonus`) are additive and ignored by older clients.
|
| 106 |
|
| 107 |
+
`/reset` retains its v0.3 shape: `{"target_name": "fibonacci", "seed": 0,
|
| 108 |
+
"max_steps": 25}` works exactly as before. The four new optional fields
|
| 109 |
+
(`target_code`, `target_function_name`, `edge_cases`, `fuzz_spec`) are
|
| 110 |
+
additive. `/functions` returns the same shape as before (with one *additive*
|
| 111 |
+
`source` field). Open-ended/Hub tasks are exposed via the new `/tasks`
|
| 112 |
+
endpoint so older clients aren't surprised.
|
| 113 |
+
|
| 114 |
## Hardware
|
| 115 |
|
| 116 |
CPU-only — `cpu-basic` is plenty. Do **not** assign GPU to this Space.
|
opensleuth_env/__init__.py
CHANGED
|
@@ -12,6 +12,8 @@ from .models import (
|
|
| 12 |
StepRequest,
|
| 13 |
)
|
| 14 |
from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
|
|
|
|
|
|
|
| 15 |
|
| 16 |
__all__ = [
|
| 17 |
"OpenSleuthEnv",
|
|
@@ -25,4 +27,9 @@ __all__ = [
|
|
| 25 |
"StepRequest",
|
| 26 |
"BLACK_BOX_FUNCTIONS",
|
| 27 |
"FunctionSpec",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
]
|
|
|
|
| 12 |
StepRequest,
|
| 13 |
)
|
| 14 |
from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
|
| 15 |
+
from .task_catalog import TaskCatalog, TaskResolutionError, HUB_DATASET_ID
|
| 16 |
+
from .auto_fuzzer import auto_fuzz, make_fuzzer
|
| 17 |
|
| 18 |
__all__ = [
|
| 19 |
"OpenSleuthEnv",
|
|
|
|
| 27 |
"StepRequest",
|
| 28 |
"BLACK_BOX_FUNCTIONS",
|
| 29 |
"FunctionSpec",
|
| 30 |
+
"TaskCatalog",
|
| 31 |
+
"TaskResolutionError",
|
| 32 |
+
"HUB_DATASET_ID",
|
| 33 |
+
"auto_fuzz",
|
| 34 |
+
"make_fuzzer",
|
| 35 |
]
|
opensleuth_env/auto_fuzzer.py
ADDED
|
@@ -0,0 +1,383 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Generic, type-driven fuzz-input generator for OpenSleuth Level 2.
|
| 2 |
+
|
| 3 |
+
Given a Python callable annotated with ``typing`` hints, ``auto_fuzz`` produces
|
| 4 |
+
``n`` argument tuples that respect the signature so the verifier can score
|
| 5 |
+
unannotated *arbitrary* targets without requiring a hand-written fuzzer the
|
| 6 |
+
way the 9 builtin BLACK_BOX_FUNCTIONS do.
|
| 7 |
+
|
| 8 |
+
Each per-type generator mixes a small set of "edge" values (``0``, ``-1``,
|
| 9 |
+
``""``, ``None`` for ``Optional``, ...) with random values, weighted ~30/70.
|
| 10 |
+
This biases the fuzz batch toward the boundaries that actually distinguish
|
| 11 |
+
implementations while still covering the boring middle.
|
| 12 |
+
|
| 13 |
+
A caller-supplied ``fuzz_spec: dict`` overrides the type-based generation on
|
| 14 |
+
a per-parameter basis, e.g.::
|
| 15 |
+
|
| 16 |
+
auto_fuzz(my_fn, n=20, fuzz_spec={"n": {"type": "int", "min": 1, "max": 90}})
|
| 17 |
+
|
| 18 |
+
Returned shape: ``List[tuple]`` -- one tuple per fuzz input, with one element
|
| 19 |
+
per (positional) parameter of ``fn``. Even for unary ``fn`` we return tuples
|
| 20 |
+
so the catalog wrapper has a single, uniform calling convention.
|
| 21 |
+
"""
|
| 22 |
+
|
| 23 |
+
from __future__ import annotations
|
| 24 |
+
|
| 25 |
+
import inspect
|
| 26 |
+
import random
|
| 27 |
+
import string
|
| 28 |
+
import typing
|
| 29 |
+
from typing import Any, Callable, Dict, List, Optional, Tuple, Union, get_args, get_origin
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
# Probability that a per-type generator emits an "edge" value (0, "", None,
|
| 33 |
+
# ...) instead of a random sample. Kept small enough that the boring middle
|
| 34 |
+
# still gets coverage but high enough that the edge cases reliably appear.
|
| 35 |
+
EDGE_PROB = 0.30
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
# Per-type edge pools. These are used by the ``_g_*`` helpers below.
|
| 39 |
+
_INT_EDGES = (0, 1, -1, 2, -2, 10, -10, 100, -100)
|
| 40 |
+
_FLOAT_EDGES = (0.0, 1.0, -1.0, 0.5, -0.5, 1e-9, -1e-9, 100.0)
|
| 41 |
+
_STR_EDGES = ("", "a", "ab", "Hello", " ", "0", "abc def")
|
| 42 |
+
_BYTES_EDGES = (b"", b"a", b"ab", b"\x00", b"abc")
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
# ---------------------------------------------------------------------------
|
| 46 |
+
# Per-type generators (do not assume any param-name dispatch).
|
| 47 |
+
# ---------------------------------------------------------------------------
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def _maybe_edge(rng: random.Random, edges: tuple, random_fn: Callable[[], Any]) -> Any:
|
| 51 |
+
if edges and rng.random() < EDGE_PROB:
|
| 52 |
+
return rng.choice(edges)
|
| 53 |
+
return random_fn()
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def _g_int(rng: random.Random, *, lo: int = -100, hi: int = 100) -> int:
|
| 57 |
+
# Filter the edge pool by [lo, hi] so a caller-supplied fuzz_spec
|
| 58 |
+
# ``{"type": "int", "min": 1, "max": 5}`` never emits ``-100``.
|
| 59 |
+
edges = tuple(v for v in _INT_EDGES if lo <= v <= hi) or (lo,)
|
| 60 |
+
return _maybe_edge(rng, edges, lambda: rng.randint(lo, hi))
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def _g_float(rng: random.Random, *, lo: float = -100.0, hi: float = 100.0) -> float:
|
| 64 |
+
edges = tuple(v for v in _FLOAT_EDGES if lo <= v <= hi) or (lo,)
|
| 65 |
+
return _maybe_edge(rng, edges, lambda: rng.uniform(lo, hi))
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def _g_bool(rng: random.Random) -> bool:
|
| 69 |
+
return bool(rng.getrandbits(1))
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def _g_str(rng: random.Random, *, max_len: int = 12, alphabet: Optional[str] = None) -> str:
|
| 73 |
+
alpha = alphabet or (string.ascii_letters + string.digits)
|
| 74 |
+
|
| 75 |
+
def _rand():
|
| 76 |
+
return "".join(rng.choices(alpha, k=rng.randint(0, max_len)))
|
| 77 |
+
|
| 78 |
+
if alphabet is not None:
|
| 79 |
+
# When the caller restricts the alphabet, our generic edge pool
|
| 80 |
+
# ("Hello", " ", ...) would violate it. Build a deterministic
|
| 81 |
+
# alphabet-respecting edge set instead.
|
| 82 |
+
custom_edges = ("",)
|
| 83 |
+
if alphabet:
|
| 84 |
+
custom_edges = ("", alphabet[0], alphabet[0] * min(max_len, 2))
|
| 85 |
+
return _maybe_edge(rng, custom_edges, _rand)
|
| 86 |
+
return _maybe_edge(rng, _STR_EDGES, _rand)
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
def _g_bytes(rng: random.Random, *, max_len: int = 8) -> bytes:
|
| 90 |
+
def _rand():
|
| 91 |
+
return bytes(rng.randint(0, 255) for _ in range(rng.randint(0, max_len)))
|
| 92 |
+
|
| 93 |
+
return _maybe_edge(rng, _BYTES_EDGES, _rand)
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def _g_list(rng: random.Random, elem_gen: Callable[[], Any], *, max_len: int = 6) -> list:
|
| 97 |
+
if rng.random() < EDGE_PROB / 2:
|
| 98 |
+
return []
|
| 99 |
+
return [elem_gen() for _ in range(rng.randint(0, max_len))]
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
def _g_tuple_homogeneous(
|
| 103 |
+
rng: random.Random, elem_gen: Callable[[], Any], *, max_len: int = 6
|
| 104 |
+
) -> tuple:
|
| 105 |
+
return tuple(_g_list(rng, elem_gen, max_len=max_len))
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
def _g_tuple_heterogeneous(rng: random.Random, elem_gens: List[Callable[[], Any]]) -> tuple:
|
| 109 |
+
return tuple(g() for g in elem_gens)
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
def _g_set(rng: random.Random, elem_gen: Callable[[], Any], *, max_len: int = 6) -> set:
|
| 113 |
+
if rng.random() < EDGE_PROB / 2:
|
| 114 |
+
return set()
|
| 115 |
+
return {elem_gen() for _ in range(rng.randint(0, max_len))}
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
def _g_dict(
|
| 119 |
+
rng: random.Random,
|
| 120 |
+
key_gen: Callable[[], Any],
|
| 121 |
+
val_gen: Callable[[], Any],
|
| 122 |
+
*,
|
| 123 |
+
max_len: int = 5,
|
| 124 |
+
) -> dict:
|
| 125 |
+
if rng.random() < EDGE_PROB / 2:
|
| 126 |
+
return {}
|
| 127 |
+
return {key_gen(): val_gen() for _ in range(rng.randint(0, max_len))}
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
# ---------------------------------------------------------------------------
|
| 131 |
+
# Type -> generator dispatch.
|
| 132 |
+
# ---------------------------------------------------------------------------
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
def _is_optional(tp: Any) -> bool:
|
| 136 |
+
"""``Optional[X]`` is ``Union[X, None]`` under the hood."""
|
| 137 |
+
if get_origin(tp) is Union:
|
| 138 |
+
return type(None) in get_args(tp)
|
| 139 |
+
return False
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
def _strip_optional(tp: Any) -> Any:
|
| 143 |
+
"""Return ``X`` for ``Optional[X]``; for unions with None + multiple, pick
|
| 144 |
+
the first non-None member (we can't satisfy a union in a single call)."""
|
| 145 |
+
if get_origin(tp) is Union:
|
| 146 |
+
non_none = [a for a in get_args(tp) if a is not type(None)]
|
| 147 |
+
if len(non_none) == 1:
|
| 148 |
+
return non_none[0]
|
| 149 |
+
if non_none:
|
| 150 |
+
return non_none[0]
|
| 151 |
+
return tp
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
def _make_generator(tp: Any, rng: random.Random) -> Callable[[], Any]:
|
| 155 |
+
"""Return a 0-arg callable that produces one random value of type ``tp``.
|
| 156 |
+
|
| 157 |
+
The recursion handles container element types (``list[int]``,
|
| 158 |
+
``dict[str, list[int]]``, etc).
|
| 159 |
+
"""
|
| 160 |
+
|
| 161 |
+
if tp is None or tp is type(None):
|
| 162 |
+
return lambda: None
|
| 163 |
+
|
| 164 |
+
if _is_optional(tp):
|
| 165 |
+
inner = _strip_optional(tp)
|
| 166 |
+
inner_gen = _make_generator(inner, rng)
|
| 167 |
+
|
| 168 |
+
def _gen_opt():
|
| 169 |
+
if rng.random() < EDGE_PROB:
|
| 170 |
+
return None
|
| 171 |
+
return inner_gen()
|
| 172 |
+
|
| 173 |
+
return _gen_opt
|
| 174 |
+
|
| 175 |
+
origin = get_origin(tp)
|
| 176 |
+
|
| 177 |
+
if origin is typing.Literal:
|
| 178 |
+
choices = list(get_args(tp))
|
| 179 |
+
return lambda: rng.choice(choices)
|
| 180 |
+
|
| 181 |
+
if origin is None:
|
| 182 |
+
if tp is int:
|
| 183 |
+
return lambda: _g_int(rng)
|
| 184 |
+
if tp is float:
|
| 185 |
+
return lambda: _g_float(rng)
|
| 186 |
+
if tp is bool:
|
| 187 |
+
return lambda: _g_bool(rng)
|
| 188 |
+
if tp is str:
|
| 189 |
+
return lambda: _g_str(rng)
|
| 190 |
+
if tp is bytes:
|
| 191 |
+
return lambda: _g_bytes(rng)
|
| 192 |
+
if tp is list:
|
| 193 |
+
return lambda: _g_list(rng, lambda: _g_int(rng))
|
| 194 |
+
if tp is tuple:
|
| 195 |
+
return lambda: _g_tuple_homogeneous(rng, lambda: _g_int(rng))
|
| 196 |
+
if tp is set:
|
| 197 |
+
return lambda: _g_set(rng, lambda: _g_int(rng))
|
| 198 |
+
if tp is dict:
|
| 199 |
+
return lambda: _g_dict(rng, lambda: _g_str(rng, max_len=4), lambda: _g_int(rng))
|
| 200 |
+
if tp is type(None):
|
| 201 |
+
return lambda: None
|
| 202 |
+
if tp is typing.Any:
|
| 203 |
+
return lambda: _g_int(rng)
|
| 204 |
+
# Unknown bare type -> fall back to int.
|
| 205 |
+
return lambda: _g_int(rng)
|
| 206 |
+
|
| 207 |
+
args = get_args(tp)
|
| 208 |
+
|
| 209 |
+
if origin in (list, List):
|
| 210 |
+
elem_t = args[0] if args else int
|
| 211 |
+
elem_gen = _make_generator(elem_t, rng)
|
| 212 |
+
return lambda: _g_list(rng, elem_gen)
|
| 213 |
+
|
| 214 |
+
if origin in (set, frozenset):
|
| 215 |
+
elem_t = args[0] if args else int
|
| 216 |
+
elem_gen = _make_generator(elem_t, rng)
|
| 217 |
+
return lambda: _g_set(rng, elem_gen)
|
| 218 |
+
|
| 219 |
+
if origin in (tuple, Tuple):
|
| 220 |
+
if not args:
|
| 221 |
+
return lambda: _g_tuple_homogeneous(rng, lambda: _g_int(rng))
|
| 222 |
+
if len(args) == 2 and args[1] is Ellipsis:
|
| 223 |
+
elem_gen = _make_generator(args[0], rng)
|
| 224 |
+
return lambda: _g_tuple_homogeneous(rng, elem_gen)
|
| 225 |
+
elem_gens = [_make_generator(a, rng) for a in args]
|
| 226 |
+
return lambda: _g_tuple_heterogeneous(rng, elem_gens)
|
| 227 |
+
|
| 228 |
+
if origin in (dict, Dict):
|
| 229 |
+
key_t = args[0] if args else str
|
| 230 |
+
val_t = args[1] if len(args) > 1 else int
|
| 231 |
+
key_gen = _make_generator(key_t, rng)
|
| 232 |
+
val_gen = _make_generator(val_t, rng)
|
| 233 |
+
return lambda: _g_dict(rng, key_gen, val_gen)
|
| 234 |
+
|
| 235 |
+
if origin is Union:
|
| 236 |
+
# Already handled Optional above. For pure unions, pick first member.
|
| 237 |
+
return _make_generator(args[0], rng)
|
| 238 |
+
|
| 239 |
+
return lambda: _g_int(rng)
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
# ---------------------------------------------------------------------------
|
| 243 |
+
# fuzz_spec overrides
|
| 244 |
+
# ---------------------------------------------------------------------------
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
def _generator_from_spec(entry: Dict[str, Any], rng: random.Random) -> Callable[[], Any]:
|
| 248 |
+
"""Build a generator from a ``fuzz_spec`` entry dict.
|
| 249 |
+
|
| 250 |
+
Supported keys (all optional except ``type``):
|
| 251 |
+
- ``type``: one of ``"int" | "float" | "bool" | "str" | "bytes" |
|
| 252 |
+
"list" | "tuple" | "set" | "dict" | "literal" | "any"``
|
| 253 |
+
- ``min``, ``max``: int/float bounds
|
| 254 |
+
- ``max_len``: container/string length cap
|
| 255 |
+
- ``alphabet``: str-only character pool
|
| 256 |
+
- ``elem``: nested ``fuzz_spec`` entry for container elements
|
| 257 |
+
- ``key``, ``value``: nested entries for dict
|
| 258 |
+
- ``elems``: list of nested entries for fixed-arity tuple
|
| 259 |
+
- ``choices``: list of literals to sample from
|
| 260 |
+
- ``optional``: bool; if True, occasionally yields ``None``
|
| 261 |
+
"""
|
| 262 |
+
t = entry.get("type", "any")
|
| 263 |
+
|
| 264 |
+
def _maybe_optional(gen: Callable[[], Any]) -> Callable[[], Any]:
|
| 265 |
+
if not entry.get("optional"):
|
| 266 |
+
return gen
|
| 267 |
+
|
| 268 |
+
def _g():
|
| 269 |
+
if rng.random() < EDGE_PROB:
|
| 270 |
+
return None
|
| 271 |
+
return gen()
|
| 272 |
+
|
| 273 |
+
return _g
|
| 274 |
+
|
| 275 |
+
if t == "int":
|
| 276 |
+
lo = int(entry.get("min", -100))
|
| 277 |
+
hi = int(entry.get("max", 100))
|
| 278 |
+
return _maybe_optional(lambda: _g_int(rng, lo=lo, hi=hi))
|
| 279 |
+
if t == "float":
|
| 280 |
+
lo = float(entry.get("min", -100.0))
|
| 281 |
+
hi = float(entry.get("max", 100.0))
|
| 282 |
+
return _maybe_optional(lambda: _g_float(rng, lo=lo, hi=hi))
|
| 283 |
+
if t == "bool":
|
| 284 |
+
return _maybe_optional(lambda: _g_bool(rng))
|
| 285 |
+
if t == "str":
|
| 286 |
+
max_len = int(entry.get("max_len", 12))
|
| 287 |
+
alphabet = entry.get("alphabet")
|
| 288 |
+
return _maybe_optional(lambda: _g_str(rng, max_len=max_len, alphabet=alphabet))
|
| 289 |
+
if t == "bytes":
|
| 290 |
+
max_len = int(entry.get("max_len", 8))
|
| 291 |
+
return _maybe_optional(lambda: _g_bytes(rng, max_len=max_len))
|
| 292 |
+
if t == "literal":
|
| 293 |
+
choices = list(entry.get("choices", []))
|
| 294 |
+
if not choices:
|
| 295 |
+
return _maybe_optional(lambda: None)
|
| 296 |
+
return _maybe_optional(lambda: rng.choice(choices))
|
| 297 |
+
if t == "list":
|
| 298 |
+
elem = entry.get("elem", {"type": "int"})
|
| 299 |
+
elem_gen = _generator_from_spec(elem, rng)
|
| 300 |
+
max_len = int(entry.get("max_len", 6))
|
| 301 |
+
return _maybe_optional(lambda: _g_list(rng, elem_gen, max_len=max_len))
|
| 302 |
+
if t == "tuple":
|
| 303 |
+
if "elems" in entry:
|
| 304 |
+
elem_gens = [_generator_from_spec(e, rng) for e in entry["elems"]]
|
| 305 |
+
return _maybe_optional(lambda: _g_tuple_heterogeneous(rng, elem_gens))
|
| 306 |
+
elem = entry.get("elem", {"type": "int"})
|
| 307 |
+
elem_gen = _generator_from_spec(elem, rng)
|
| 308 |
+
max_len = int(entry.get("max_len", 6))
|
| 309 |
+
return _maybe_optional(lambda: _g_tuple_homogeneous(rng, elem_gen, max_len=max_len))
|
| 310 |
+
if t == "set":
|
| 311 |
+
elem = entry.get("elem", {"type": "int"})
|
| 312 |
+
elem_gen = _generator_from_spec(elem, rng)
|
| 313 |
+
max_len = int(entry.get("max_len", 6))
|
| 314 |
+
return _maybe_optional(lambda: _g_set(rng, elem_gen, max_len=max_len))
|
| 315 |
+
if t == "dict":
|
| 316 |
+
key = entry.get("key", {"type": "str", "max_len": 4})
|
| 317 |
+
value = entry.get("value", {"type": "int"})
|
| 318 |
+
key_gen = _generator_from_spec(key, rng)
|
| 319 |
+
val_gen = _generator_from_spec(value, rng)
|
| 320 |
+
max_len = int(entry.get("max_len", 5))
|
| 321 |
+
return _maybe_optional(lambda: _g_dict(rng, key_gen, val_gen, max_len=max_len))
|
| 322 |
+
return _maybe_optional(lambda: _g_int(rng))
|
| 323 |
+
|
| 324 |
+
|
| 325 |
+
# ---------------------------------------------------------------------------
|
| 326 |
+
# Public API
|
| 327 |
+
# ---------------------------------------------------------------------------
|
| 328 |
+
|
| 329 |
+
|
| 330 |
+
def auto_fuzz(
|
| 331 |
+
fn: Callable[..., Any],
|
| 332 |
+
n: int,
|
| 333 |
+
rng: Optional[random.Random] = None,
|
| 334 |
+
*,
|
| 335 |
+
fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
|
| 336 |
+
) -> List[tuple]:
|
| 337 |
+
"""Produce ``n`` argument tuples for calling ``fn``.
|
| 338 |
+
|
| 339 |
+
Each returned element is an ``args`` tuple, intended to be applied as
|
| 340 |
+
``fn(*args)``. ``fuzz_spec`` is keyed by parameter name and overrides
|
| 341 |
+
the type-based generation per-parameter.
|
| 342 |
+
"""
|
| 343 |
+
rng = rng or random.Random()
|
| 344 |
+
fuzz_spec = fuzz_spec or {}
|
| 345 |
+
|
| 346 |
+
sig = inspect.signature(fn)
|
| 347 |
+
try:
|
| 348 |
+
hints = typing.get_type_hints(fn)
|
| 349 |
+
except Exception: # noqa: BLE001 -- bad annotations shouldn't crash fuzzing
|
| 350 |
+
hints = {}
|
| 351 |
+
|
| 352 |
+
param_gens: List[Callable[[], Any]] = []
|
| 353 |
+
for pname, param in sig.parameters.items():
|
| 354 |
+
if param.kind in (
|
| 355 |
+
inspect.Parameter.VAR_POSITIONAL,
|
| 356 |
+
inspect.Parameter.VAR_KEYWORD,
|
| 357 |
+
inspect.Parameter.KEYWORD_ONLY,
|
| 358 |
+
):
|
| 359 |
+
# We only fuzz positional / positional-or-keyword params.
|
| 360 |
+
continue
|
| 361 |
+
if pname in fuzz_spec:
|
| 362 |
+
param_gens.append(_generator_from_spec(fuzz_spec[pname], rng))
|
| 363 |
+
continue
|
| 364 |
+
annot = hints.get(pname, param.annotation)
|
| 365 |
+
if annot is inspect.Parameter.empty:
|
| 366 |
+
param_gens.append(lambda r=rng: _g_int(r))
|
| 367 |
+
else:
|
| 368 |
+
param_gens.append(_make_generator(annot, rng))
|
| 369 |
+
|
| 370 |
+
return [tuple(g() for g in param_gens) for _ in range(n)]
|
| 371 |
+
|
| 372 |
+
|
| 373 |
+
def make_fuzzer(
|
| 374 |
+
fn: Callable[..., Any],
|
| 375 |
+
fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
|
| 376 |
+
) -> Callable[[random.Random, int], List[tuple]]:
|
| 377 |
+
"""Adapt ``auto_fuzz`` to the ``FunctionSpec.fuzzer`` signature
|
| 378 |
+
(``(rng, n) -> list``)."""
|
| 379 |
+
|
| 380 |
+
def _fuzzer(rng: random.Random, n: int) -> List[tuple]:
|
| 381 |
+
return auto_fuzz(fn, n, rng, fuzz_spec=fuzz_spec)
|
| 382 |
+
|
| 383 |
+
return _fuzzer
|
opensleuth_env/black_box.py
CHANGED
|
@@ -180,7 +180,7 @@ def _fuzz_prime_int(rng: random.Random, n: int) -> List[int]:
|
|
| 180 |
@dataclass(frozen=True)
|
| 181 |
class FunctionSpec:
|
| 182 |
name: str
|
| 183 |
-
fn: Callable[
|
| 184 |
signature: str
|
| 185 |
description: str
|
| 186 |
fuzzer: Callable[[random.Random, int], list]
|
|
@@ -189,6 +189,16 @@ class FunctionSpec:
|
|
| 189 |
# fuzz batch. They are scored as their own category ("edge") so the
|
| 190 |
# verifier can report stratified pass-rates back to the trainer.
|
| 191 |
edge_cases: List[Any] = field(default_factory=list)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
|
| 193 |
|
| 194 |
BLACK_BOX_FUNCTIONS: Dict[str, FunctionSpec] = {
|
|
|
|
| 180 |
@dataclass(frozen=True)
|
| 181 |
class FunctionSpec:
|
| 182 |
name: str
|
| 183 |
+
fn: Callable[..., Any]
|
| 184 |
signature: str
|
| 185 |
description: str
|
| 186 |
fuzzer: Callable[[random.Random, int], list]
|
|
|
|
| 189 |
# fuzz batch. They are scored as their own category ("edge") so the
|
| 190 |
# verifier can report stratified pass-rates back to the trainer.
|
| 191 |
edge_cases: List[Any] = field(default_factory=list)
|
| 192 |
+
# Calling convention. When False (the default, used by all 9 builtins),
|
| 193 |
+
# ``fn(arg)`` is invoked with a single positional argument -- whatever
|
| 194 |
+
# the fuzzer produced. When True (used by the auto-fuzzer-generated
|
| 195 |
+
# specs for multi-parameter target functions), each fuzz input is a
|
| 196 |
+
# *tuple of args* and is unpacked: ``fn(*args)``.
|
| 197 |
+
unpack_args: bool = False
|
| 198 |
+
# Provenance: where this spec came from. Useful for /tasks?source=...
|
| 199 |
+
# Defaults to "builtin" for backwards compatibility with the original
|
| 200 |
+
# 9 hand-written specs.
|
| 201 |
+
source: str = "builtin"
|
| 202 |
|
| 203 |
|
| 204 |
BLACK_BOX_FUNCTIONS: Dict[str, FunctionSpec] = {
|
opensleuth_env/env.py
CHANGED
|
@@ -38,7 +38,7 @@ from __future__ import annotations
|
|
| 38 |
import ast
|
| 39 |
import logging
|
| 40 |
import uuid
|
| 41 |
-
from typing import Any, Tuple
|
| 42 |
|
| 43 |
from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
|
| 44 |
from .models import (
|
|
@@ -50,6 +50,7 @@ from .models import (
|
|
| 50 |
StepResponse,
|
| 51 |
SubmitAction,
|
| 52 |
)
|
|
|
|
| 53 |
from .verifier import generate_fuzz_inputs, get_edge_inputs, verify_submission
|
| 54 |
|
| 55 |
log = logging.getLogger("opensleuth.env")
|
|
@@ -121,39 +122,78 @@ def _bucket_of(x: Any) -> str:
|
|
| 121 |
class OpenSleuthEnv:
|
| 122 |
"""Multi-episode environment registry."""
|
| 123 |
|
| 124 |
-
def __init__(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
self._states: dict[str, State] = {}
|
| 126 |
self._configs: dict[str, dict] = {}
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
self.fuzz_count = fuzz_count
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
# --- Lifecycle ---------------------------------------------------------
|
| 130 |
|
| 131 |
-
def reset(
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
)
|
| 137 |
-
|
|
|
|
| 138 |
episode_id = uuid.uuid4().hex
|
| 139 |
self._states[episode_id] = State(
|
| 140 |
episode_id=episode_id,
|
| 141 |
-
target_function_name=
|
| 142 |
seed=seed,
|
| 143 |
)
|
| 144 |
self._configs[episode_id] = {"max_steps": max_steps}
|
|
|
|
| 145 |
return self._build_observation(episode_id, spec, last_error="")
|
| 146 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
def step(self, episode_id: str, action: Action) -> StepResponse:
|
| 148 |
state = self._states.get(episode_id)
|
| 149 |
if state is None:
|
| 150 |
raise KeyError(f"Unknown episode_id {episode_id!r}. Did you /reset first?")
|
| 151 |
if state.done:
|
| 152 |
-
spec =
|
| 153 |
obs = self._build_observation(episode_id, spec, last_error="Episode already terminated.")
|
| 154 |
return StepResponse(observation=obs, reward=0.0, done=True, info={"reason": "already_done"})
|
| 155 |
|
| 156 |
-
spec =
|
| 157 |
state.steps_taken += 1
|
| 158 |
max_steps = self._configs[episode_id]["max_steps"]
|
| 159 |
|
|
@@ -205,7 +245,15 @@ class OpenSleuthEnv:
|
|
| 205 |
intrinsic = 0.0
|
| 206 |
last_error = ""
|
| 207 |
try:
|
| 208 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
output_repr = repr(output)
|
| 210 |
state.probe_history.append(
|
| 211 |
ProbeRecord(
|
|
@@ -255,6 +303,7 @@ class OpenSleuthEnv:
|
|
| 255 |
fuzz_inputs,
|
| 256 |
target_name=spec.name,
|
| 257 |
edge_inputs=edge_inputs,
|
|
|
|
| 258 |
)
|
| 259 |
|
| 260 |
total = (
|
|
|
|
| 38 |
import ast
|
| 39 |
import logging
|
| 40 |
import uuid
|
| 41 |
+
from typing import Any, Optional, Tuple
|
| 42 |
|
| 43 |
from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
|
| 44 |
from .models import (
|
|
|
|
| 50 |
StepResponse,
|
| 51 |
SubmitAction,
|
| 52 |
)
|
| 53 |
+
from .task_catalog import TaskCatalog, TaskResolutionError
|
| 54 |
from .verifier import generate_fuzz_inputs, get_edge_inputs, verify_submission
|
| 55 |
|
| 56 |
log = logging.getLogger("opensleuth.env")
|
|
|
|
| 122 |
class OpenSleuthEnv:
|
| 123 |
"""Multi-episode environment registry."""
|
| 124 |
|
| 125 |
+
def __init__(
|
| 126 |
+
self,
|
| 127 |
+
fuzz_count: int = 100,
|
| 128 |
+
catalog: Optional["TaskCatalog"] = None,
|
| 129 |
+
) -> None:
|
| 130 |
self._states: dict[str, State] = {}
|
| 131 |
self._configs: dict[str, dict] = {}
|
| 132 |
+
# Per-episode resolved spec. We cache it here (rather than looking it
|
| 133 |
+
# up by name on every step from BLACK_BOX_FUNCTIONS) because
|
| 134 |
+
# caller-supplied / Hub-loaded specs aren't in BLACK_BOX_FUNCTIONS.
|
| 135 |
+
self._episode_specs: dict[str, FunctionSpec] = {}
|
| 136 |
self.fuzz_count = fuzz_count
|
| 137 |
+
self._catalog = catalog or TaskCatalog()
|
| 138 |
+
|
| 139 |
+
@property
|
| 140 |
+
def catalog(self) -> "TaskCatalog":
|
| 141 |
+
return self._catalog
|
| 142 |
|
| 143 |
# --- Lifecycle ---------------------------------------------------------
|
| 144 |
|
| 145 |
+
def reset(
|
| 146 |
+
self,
|
| 147 |
+
target_name: Optional[str] = None,
|
| 148 |
+
seed: int = 0,
|
| 149 |
+
max_steps: int = 25,
|
| 150 |
+
*,
|
| 151 |
+
target_code: Optional[str] = None,
|
| 152 |
+
target_function_name: Optional[str] = None,
|
| 153 |
+
edge_cases: Optional[list] = None,
|
| 154 |
+
fuzz_spec: Optional[dict] = None,
|
| 155 |
+
) -> Observation:
|
| 156 |
+
# Backwards-compat: legacy callers pass ``target_name="fibonacci"``
|
| 157 |
+
# only. The catalog handles that path identically to before.
|
| 158 |
+
try:
|
| 159 |
+
spec = self._catalog.resolve(
|
| 160 |
+
target_name=target_name,
|
| 161 |
+
target_code=target_code,
|
| 162 |
+
target_function_name=target_function_name,
|
| 163 |
+
edge_cases=edge_cases,
|
| 164 |
+
fuzz_spec=fuzz_spec,
|
| 165 |
)
|
| 166 |
+
except TaskResolutionError as e:
|
| 167 |
+
raise ValueError(str(e)) from e
|
| 168 |
episode_id = uuid.uuid4().hex
|
| 169 |
self._states[episode_id] = State(
|
| 170 |
episode_id=episode_id,
|
| 171 |
+
target_function_name=spec.name,
|
| 172 |
seed=seed,
|
| 173 |
)
|
| 174 |
self._configs[episode_id] = {"max_steps": max_steps}
|
| 175 |
+
self._episode_specs[episode_id] = spec
|
| 176 |
return self._build_observation(episode_id, spec, last_error="")
|
| 177 |
|
| 178 |
+
def _spec_for(self, state: State) -> FunctionSpec:
|
| 179 |
+
spec = self._episode_specs.get(state.episode_id)
|
| 180 |
+
if spec is not None:
|
| 181 |
+
return spec
|
| 182 |
+
# Legacy fallback: if an episode was created before we started
|
| 183 |
+
# caching specs (or via a code path that bypassed reset), look up
|
| 184 |
+
# by name in the builtin registry.
|
| 185 |
+
return BLACK_BOX_FUNCTIONS[state.target_function_name]
|
| 186 |
+
|
| 187 |
def step(self, episode_id: str, action: Action) -> StepResponse:
|
| 188 |
state = self._states.get(episode_id)
|
| 189 |
if state is None:
|
| 190 |
raise KeyError(f"Unknown episode_id {episode_id!r}. Did you /reset first?")
|
| 191 |
if state.done:
|
| 192 |
+
spec = self._spec_for(state)
|
| 193 |
obs = self._build_observation(episode_id, spec, last_error="Episode already terminated.")
|
| 194 |
return StepResponse(observation=obs, reward=0.0, done=True, info={"reason": "already_done"})
|
| 195 |
|
| 196 |
+
spec = self._spec_for(state)
|
| 197 |
state.steps_taken += 1
|
| 198 |
max_steps = self._configs[episode_id]["max_steps"]
|
| 199 |
|
|
|
|
| 245 |
intrinsic = 0.0
|
| 246 |
last_error = ""
|
| 247 |
try:
|
| 248 |
+
if spec.unpack_args:
|
| 249 |
+
if not isinstance(parsed, tuple):
|
| 250 |
+
raise TypeError(
|
| 251 |
+
f"Multi-parameter target {spec.name!r} expects a tuple "
|
| 252 |
+
f"of args, got {type(parsed).__name__}."
|
| 253 |
+
)
|
| 254 |
+
output = spec.fn(*parsed)
|
| 255 |
+
else:
|
| 256 |
+
output = spec.fn(parsed)
|
| 257 |
output_repr = repr(output)
|
| 258 |
state.probe_history.append(
|
| 259 |
ProbeRecord(
|
|
|
|
| 303 |
fuzz_inputs,
|
| 304 |
target_name=spec.name,
|
| 305 |
edge_inputs=edge_inputs,
|
| 306 |
+
unpack_args=spec.unpack_args,
|
| 307 |
)
|
| 308 |
|
| 309 |
total = (
|
opensleuth_env/models.py
CHANGED
|
@@ -91,9 +91,43 @@ class State(BaseModel):
|
|
| 91 |
|
| 92 |
|
| 93 |
class ResetRequest(BaseModel):
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
seed: int = 0
|
| 96 |
max_steps: int = 25
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
|
| 99 |
class StepRequest(BaseModel):
|
|
|
|
| 91 |
|
| 92 |
|
| 93 |
class ResetRequest(BaseModel):
|
| 94 |
+
"""Reset payload.
|
| 95 |
+
|
| 96 |
+
The original (v0.3) shape ``{"target_name": "fibonacci", "seed": 0,
|
| 97 |
+
"max_steps": 25}`` still works exactly as before -- the four new fields
|
| 98 |
+
below are all optional and additive so the in-flight trainer doesn't
|
| 99 |
+
have to change.
|
| 100 |
+
|
| 101 |
+
Open-ended (Level 2) targets are specified by passing ``target_code``
|
| 102 |
+
+ ``target_function_name`` (and optionally ``edge_cases`` and
|
| 103 |
+
``fuzz_spec``), which is then resolved via the TaskCatalog using the
|
| 104 |
+
same hardened sandbox the verifier uses for agent submissions.
|
| 105 |
+
"""
|
| 106 |
+
|
| 107 |
+
target_name: Optional[str] = None
|
| 108 |
seed: int = 0
|
| 109 |
max_steps: int = 25
|
| 110 |
+
# --- Level 2 open-ended fields (additive, default-None) ---
|
| 111 |
+
target_code: Optional[str] = Field(
|
| 112 |
+
default=None,
|
| 113 |
+
description="Python source defining a black-box callable. When set, "
|
| 114 |
+
"overrides target_name (caller-supplied beats Hub beats builtin).",
|
| 115 |
+
)
|
| 116 |
+
target_function_name: Optional[str] = Field(
|
| 117 |
+
default=None,
|
| 118 |
+
description="Name of the callable inside target_code to use as the "
|
| 119 |
+
"oracle. Required when target_code is set.",
|
| 120 |
+
)
|
| 121 |
+
edge_cases: Optional[List[str]] = Field(
|
| 122 |
+
default=None,
|
| 123 |
+
description="Optional list of must-pass probe inputs as Python "
|
| 124 |
+
"literal strings (e.g. ['0', '\"\"', '([1,2,3], 2)']).",
|
| 125 |
+
)
|
| 126 |
+
fuzz_spec: Optional[dict] = Field(
|
| 127 |
+
default=None,
|
| 128 |
+
description="Optional auto-fuzzer override map keyed by parameter "
|
| 129 |
+
"name, e.g. {'n': {'type': 'int', 'min': 1, 'max': 90}}.",
|
| 130 |
+
)
|
| 131 |
|
| 132 |
|
| 133 |
class StepRequest(BaseModel):
|
opensleuth_env/scripts/__init__.py
ADDED
|
File without changes
|
opensleuth_env/scripts/bootstrap_tasks_dataset.py
ADDED
|
@@ -0,0 +1,508 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Bootstrap / refresh the OpenSleuth Hub task catalog.
|
| 2 |
+
|
| 3 |
+
Idempotently creates ``anugrah55/opensleuth-tasks`` and pushes:
|
| 4 |
+
|
| 5 |
+
* The 9 builtin BLACK_BOX_FUNCTIONS as rows (so the dataset is non-empty
|
| 6 |
+
for testing and so the trainer's curriculum has parity with the
|
| 7 |
+
in-process oracle), and
|
| 8 |
+
* 6 brand-new tasks (``roman_to_int``, ``levenshtein_distance``,
|
| 9 |
+
``flatten_list``, ``merge_sorted``, ``run_length_encode``,
|
| 10 |
+
``binary_search``) that aren't in BLACK_BOX_FUNCTIONS, exercising
|
| 11 |
+
multi-arg and unannotated cases the auto-fuzzer must handle.
|
| 12 |
+
|
| 13 |
+
Each row is::
|
| 14 |
+
|
| 15 |
+
{
|
| 16 |
+
"name": str,
|
| 17 |
+
"target_function_name": str, # which fn inside source_code
|
| 18 |
+
"signature": str,
|
| 19 |
+
"description": str,
|
| 20 |
+
"difficulty": "easy"|"medium"|"hard",
|
| 21 |
+
"source_code": str, # standalone Python; NO oracle imports
|
| 22 |
+
"edge_cases_json": str, # JSON list of literal-repr strings
|
| 23 |
+
"fuzz_spec_json": str, # JSON dict or "null"
|
| 24 |
+
}
|
| 25 |
+
|
| 26 |
+
Run::
|
| 27 |
+
|
| 28 |
+
cd env && PYTHONPATH=. ../.venv/bin/python -m opensleuth_env.scripts.bootstrap_tasks_dataset
|
| 29 |
+
"""
|
| 30 |
+
|
| 31 |
+
from __future__ import annotations
|
| 32 |
+
|
| 33 |
+
import argparse
|
| 34 |
+
import json
|
| 35 |
+
import logging
|
| 36 |
+
import sys
|
| 37 |
+
from typing import Any, Dict, List, Optional
|
| 38 |
+
|
| 39 |
+
from opensleuth_env.black_box import BLACK_BOX_FUNCTIONS
|
| 40 |
+
|
| 41 |
+
log = logging.getLogger("opensleuth.bootstrap")
|
| 42 |
+
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
|
| 43 |
+
|
| 44 |
+
DATASET_ID = "anugrah55/opensleuth-tasks"
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
# ---------------------------------------------------------------------------
|
| 48 |
+
# Oracle source code for the 9 builtins (self-contained -- no opensleuth_*
|
| 49 |
+
# imports, so the catalog's static reject filter accepts them).
|
| 50 |
+
# ---------------------------------------------------------------------------
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
_BUILTIN_SOURCE: Dict[str, Dict[str, Any]] = {
|
| 54 |
+
"fibonacci": {
|
| 55 |
+
"target_function_name": "fibonacci",
|
| 56 |
+
"source_code": (
|
| 57 |
+
"def fibonacci(n):\n"
|
| 58 |
+
" if not isinstance(n, int) or isinstance(n, bool) or n <= 0 or n > 90:\n"
|
| 59 |
+
" raise ValueError('Input must be a positive integer <= 90.')\n"
|
| 60 |
+
" a, b = 0, 1\n"
|
| 61 |
+
" for _ in range(n - 1):\n"
|
| 62 |
+
" a, b = b, a + b\n"
|
| 63 |
+
" return b if n > 0 else a\n"
|
| 64 |
+
),
|
| 65 |
+
"edge_cases": ["1", "2", "3", "10", "89", "90"],
|
| 66 |
+
"fuzz_spec": {"n": {"type": "int", "min": 1, "max": 90}},
|
| 67 |
+
},
|
| 68 |
+
"reverse_string": {
|
| 69 |
+
"target_function_name": "reverse_string",
|
| 70 |
+
"source_code": (
|
| 71 |
+
"def reverse_string(s):\n"
|
| 72 |
+
" if not isinstance(s, str):\n"
|
| 73 |
+
" raise TypeError('Input must be a string.')\n"
|
| 74 |
+
" return s[::-1]\n"
|
| 75 |
+
),
|
| 76 |
+
"edge_cases": ['""', '"a"', '"ab"', '"racecar"', '"Hello, World!"'],
|
| 77 |
+
"fuzz_spec": {"s": {"type": "str", "max_len": 12}},
|
| 78 |
+
},
|
| 79 |
+
"is_palindrome": {
|
| 80 |
+
"target_function_name": "is_palindrome",
|
| 81 |
+
"source_code": (
|
| 82 |
+
"def is_palindrome(s):\n"
|
| 83 |
+
" if not isinstance(s, str):\n"
|
| 84 |
+
" raise TypeError('Input must be a string.')\n"
|
| 85 |
+
" cleaned = ''.join(ch.lower() for ch in s if ch.isalnum())\n"
|
| 86 |
+
" return cleaned == cleaned[::-1]\n"
|
| 87 |
+
),
|
| 88 |
+
"edge_cases": [
|
| 89 |
+
'""', '"a"', '"ab"', '"abba"',
|
| 90 |
+
"\"A man, a plan, a canal: Panama\"", '"Hello"',
|
| 91 |
+
],
|
| 92 |
+
"fuzz_spec": {"s": {"type": "str", "max_len": 12}},
|
| 93 |
+
},
|
| 94 |
+
"digit_sum": {
|
| 95 |
+
"target_function_name": "digit_sum",
|
| 96 |
+
"source_code": (
|
| 97 |
+
"def digit_sum(n):\n"
|
| 98 |
+
" if not isinstance(n, int) or isinstance(n, bool):\n"
|
| 99 |
+
" raise TypeError('Input must be int.')\n"
|
| 100 |
+
" if n < 0:\n"
|
| 101 |
+
" raise ValueError('Input must be non-negative.')\n"
|
| 102 |
+
" return sum(int(c) for c in str(n))\n"
|
| 103 |
+
),
|
| 104 |
+
"edge_cases": ["0", "1", "9", "10", "99", "100", "9999"],
|
| 105 |
+
"fuzz_spec": {"n": {"type": "int", "min": 0, "max": 10000}},
|
| 106 |
+
},
|
| 107 |
+
"count_vowels": {
|
| 108 |
+
"target_function_name": "count_vowels",
|
| 109 |
+
"source_code": (
|
| 110 |
+
"def count_vowels(s):\n"
|
| 111 |
+
" if not isinstance(s, str):\n"
|
| 112 |
+
" raise TypeError('Input must be a string.')\n"
|
| 113 |
+
" return sum(1 for c in s.lower() if c in 'aeiou')\n"
|
| 114 |
+
),
|
| 115 |
+
"edge_cases": ['""', '"bcd"', '"AEIOU"', '"Hello, World!"', '"aaaaa"'],
|
| 116 |
+
"fuzz_spec": {"s": {"type": "str", "max_len": 16}},
|
| 117 |
+
},
|
| 118 |
+
"gcd": {
|
| 119 |
+
"target_function_name": "gcd",
|
| 120 |
+
"source_code": (
|
| 121 |
+
"def gcd(pair):\n"
|
| 122 |
+
" if not isinstance(pair, (list, tuple)) or len(pair) != 2:\n"
|
| 123 |
+
" raise TypeError('Input must be a 2-element list or tuple.')\n"
|
| 124 |
+
" a, b = pair\n"
|
| 125 |
+
" if not all(isinstance(x, int) and not isinstance(x, bool) for x in (a, b)):\n"
|
| 126 |
+
" raise TypeError('Both elements must be int.')\n"
|
| 127 |
+
" if a < 0 or b < 0:\n"
|
| 128 |
+
" raise ValueError('Both elements must be non-negative.')\n"
|
| 129 |
+
" while b:\n"
|
| 130 |
+
" a, b = b, a % b\n"
|
| 131 |
+
" return a\n"
|
| 132 |
+
),
|
| 133 |
+
"edge_cases": ["(0, 0)", "(0, 7)", "(12, 18)", "(17, 13)", "(100, 75)"],
|
| 134 |
+
"fuzz_spec": {
|
| 135 |
+
"pair": {
|
| 136 |
+
"type": "tuple",
|
| 137 |
+
"elems": [{"type": "int", "min": 0, "max": 1000}, {"type": "int", "min": 0, "max": 1000}],
|
| 138 |
+
}
|
| 139 |
+
},
|
| 140 |
+
},
|
| 141 |
+
"sort_unique": {
|
| 142 |
+
"target_function_name": "sort_unique",
|
| 143 |
+
"source_code": (
|
| 144 |
+
"def sort_unique(xs):\n"
|
| 145 |
+
" if not isinstance(xs, list):\n"
|
| 146 |
+
" raise TypeError('Input must be a list.')\n"
|
| 147 |
+
" if not all(isinstance(x, int) and not isinstance(x, bool) for x in xs):\n"
|
| 148 |
+
" raise TypeError('All elements must be int.')\n"
|
| 149 |
+
" return sorted(set(xs))\n"
|
| 150 |
+
),
|
| 151 |
+
"edge_cases": ["[]", "[1]", "[1, 1, 1]", "[3, 1, 2]", "[-5, 5, 0, -5, 5]"],
|
| 152 |
+
"fuzz_spec": {"xs": {"type": "list", "elem": {"type": "int", "min": -50, "max": 50}, "max_len": 8}},
|
| 153 |
+
},
|
| 154 |
+
"caesar_cipher": {
|
| 155 |
+
"target_function_name": "caesar_cipher",
|
| 156 |
+
"source_code": (
|
| 157 |
+
"def caesar_cipher(s):\n"
|
| 158 |
+
" if not isinstance(s, str):\n"
|
| 159 |
+
" raise TypeError('Input must be a string.')\n"
|
| 160 |
+
" out = []\n"
|
| 161 |
+
" for ch in s:\n"
|
| 162 |
+
" if 'a' <= ch <= 'z':\n"
|
| 163 |
+
" out.append(chr((ord(ch) - ord('a') + 3) % 26 + ord('a')))\n"
|
| 164 |
+
" else:\n"
|
| 165 |
+
" out.append(ch)\n"
|
| 166 |
+
" return ''.join(out)\n"
|
| 167 |
+
),
|
| 168 |
+
"edge_cases": ['""', '"abc"', '"xyz"', '"Hello, World!"', '"ABC"', '"hello world"'],
|
| 169 |
+
"fuzz_spec": {"s": {"type": "str", "max_len": 16}},
|
| 170 |
+
},
|
| 171 |
+
"is_prime": {
|
| 172 |
+
"target_function_name": "is_prime",
|
| 173 |
+
"source_code": (
|
| 174 |
+
"def is_prime(n):\n"
|
| 175 |
+
" if not isinstance(n, int) or isinstance(n, bool):\n"
|
| 176 |
+
" raise TypeError('Input must be int.')\n"
|
| 177 |
+
" if n < 2:\n"
|
| 178 |
+
" return False\n"
|
| 179 |
+
" if n < 4:\n"
|
| 180 |
+
" return True\n"
|
| 181 |
+
" if n % 2 == 0:\n"
|
| 182 |
+
" return False\n"
|
| 183 |
+
" i = 3\n"
|
| 184 |
+
" while i * i <= n:\n"
|
| 185 |
+
" if n % i == 0:\n"
|
| 186 |
+
" return False\n"
|
| 187 |
+
" i += 2\n"
|
| 188 |
+
" return True\n"
|
| 189 |
+
),
|
| 190 |
+
"edge_cases": ["0", "1", "2", "3", "4", "17", "25", "97", "100"],
|
| 191 |
+
"fuzz_spec": {"n": {"type": "int", "min": 0, "max": 200}},
|
| 192 |
+
},
|
| 193 |
+
}
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
# ---------------------------------------------------------------------------
|
| 197 |
+
# Six new tasks. These exercise auto-fuzzer features the builtins didn't:
|
| 198 |
+
# * multi-arg signatures (binary_search, merge_sorted, levenshtein_distance)
|
| 199 |
+
# * Optional / Literal hint coverage (run_length_encode -> list[tuple[str, int]])
|
| 200 |
+
# * unannotated containers (flatten_list)
|
| 201 |
+
# ---------------------------------------------------------------------------
|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
_NEW_TASK_ROWS: List[Dict[str, Any]] = [
|
| 205 |
+
{
|
| 206 |
+
"name": "roman_to_int",
|
| 207 |
+
"target_function_name": "roman_to_int",
|
| 208 |
+
"signature": "roman_to_int(s: str) -> int",
|
| 209 |
+
"description": (
|
| 210 |
+
"Parse a roman numeral string into its integer value. "
|
| 211 |
+
"Raises ValueError for non-roman characters. Subtraction "
|
| 212 |
+
"rules (IV=4, IX=9, XL=40, ...) are honoured. Empty -> 0."
|
| 213 |
+
),
|
| 214 |
+
"difficulty": "medium",
|
| 215 |
+
"source_code": (
|
| 216 |
+
"def roman_to_int(s: str) -> int:\n"
|
| 217 |
+
" if not isinstance(s, str):\n"
|
| 218 |
+
" raise TypeError('input must be str')\n"
|
| 219 |
+
" table = {'I':1,'V':5,'X':10,'L':50,'C':100,'D':500,'M':1000}\n"
|
| 220 |
+
" total = 0\n"
|
| 221 |
+
" prev = 0\n"
|
| 222 |
+
" for ch in reversed(s.upper()):\n"
|
| 223 |
+
" if ch not in table:\n"
|
| 224 |
+
" raise ValueError(f'invalid roman numeral character: {ch!r}')\n"
|
| 225 |
+
" v = table[ch]\n"
|
| 226 |
+
" if v < prev:\n"
|
| 227 |
+
" total -= v\n"
|
| 228 |
+
" else:\n"
|
| 229 |
+
" total += v\n"
|
| 230 |
+
" prev = v\n"
|
| 231 |
+
" return total\n"
|
| 232 |
+
),
|
| 233 |
+
"edge_cases": ['""', '"I"', '"IV"', '"IX"', '"LVIII"', '"MCMXCIV"', '"MMXXIV"'],
|
| 234 |
+
"fuzz_spec": {"s": {"type": "str", "alphabet": "IVXLCDM", "max_len": 8}},
|
| 235 |
+
},
|
| 236 |
+
{
|
| 237 |
+
"name": "levenshtein_distance",
|
| 238 |
+
"target_function_name": "levenshtein_distance",
|
| 239 |
+
"signature": "levenshtein_distance(a: str, b: str) -> int",
|
| 240 |
+
"description": (
|
| 241 |
+
"Classic edit distance between two strings: minimum number of "
|
| 242 |
+
"single-character insertions, deletions, or substitutions to "
|
| 243 |
+
"transform a into b. Both arguments must be str."
|
| 244 |
+
),
|
| 245 |
+
"difficulty": "hard",
|
| 246 |
+
"source_code": (
|
| 247 |
+
"def levenshtein_distance(a: str, b: str) -> int:\n"
|
| 248 |
+
" if not isinstance(a, str) or not isinstance(b, str):\n"
|
| 249 |
+
" raise TypeError('both arguments must be str')\n"
|
| 250 |
+
" if a == b:\n"
|
| 251 |
+
" return 0\n"
|
| 252 |
+
" if not a:\n"
|
| 253 |
+
" return len(b)\n"
|
| 254 |
+
" if not b:\n"
|
| 255 |
+
" return len(a)\n"
|
| 256 |
+
" prev = list(range(len(b) + 1))\n"
|
| 257 |
+
" for i, ca in enumerate(a, 1):\n"
|
| 258 |
+
" cur = [i] + [0] * len(b)\n"
|
| 259 |
+
" for j, cb in enumerate(b, 1):\n"
|
| 260 |
+
" ins = cur[j-1] + 1\n"
|
| 261 |
+
" dele = prev[j] + 1\n"
|
| 262 |
+
" sub = prev[j-1] + (ca != cb)\n"
|
| 263 |
+
" cur[j] = min(ins, dele, sub)\n"
|
| 264 |
+
" prev = cur\n"
|
| 265 |
+
" return prev[-1]\n"
|
| 266 |
+
),
|
| 267 |
+
"edge_cases": [
|
| 268 |
+
'("", "")', '("a", "")', '("", "a")', '("kitten", "sitting")',
|
| 269 |
+
'("flaw", "lawn")', '("abc", "abc")',
|
| 270 |
+
],
|
| 271 |
+
"fuzz_spec": {
|
| 272 |
+
"a": {"type": "str", "alphabet": "abc", "max_len": 6},
|
| 273 |
+
"b": {"type": "str", "alphabet": "abc", "max_len": 6},
|
| 274 |
+
},
|
| 275 |
+
},
|
| 276 |
+
{
|
| 277 |
+
"name": "flatten_list",
|
| 278 |
+
"target_function_name": "flatten_list",
|
| 279 |
+
"signature": "flatten_list(xs: list) -> list",
|
| 280 |
+
"description": (
|
| 281 |
+
"Recursively flatten a nested list of arbitrary depth. Tuples "
|
| 282 |
+
"are also flattened; non-list/tuple atoms (ints, strs, ...) "
|
| 283 |
+
"pass through unchanged."
|
| 284 |
+
),
|
| 285 |
+
"difficulty": "medium",
|
| 286 |
+
"source_code": (
|
| 287 |
+
"def flatten_list(xs):\n"
|
| 288 |
+
" if not isinstance(xs, (list, tuple)):\n"
|
| 289 |
+
" raise TypeError('input must be list or tuple')\n"
|
| 290 |
+
" out = []\n"
|
| 291 |
+
" stack = list(xs)\n"
|
| 292 |
+
" # iterative DFS to avoid recursion limits on adversarial input\n"
|
| 293 |
+
" rev = []\n"
|
| 294 |
+
" rev.extend(reversed(stack))\n"
|
| 295 |
+
" while rev:\n"
|
| 296 |
+
" x = rev.pop()\n"
|
| 297 |
+
" if isinstance(x, (list, tuple)):\n"
|
| 298 |
+
" for y in reversed(x):\n"
|
| 299 |
+
" rev.append(y)\n"
|
| 300 |
+
" else:\n"
|
| 301 |
+
" out.append(x)\n"
|
| 302 |
+
" return out\n"
|
| 303 |
+
),
|
| 304 |
+
"edge_cases": [
|
| 305 |
+
"[]", "[1]", "[[1, 2], [3, 4]]",
|
| 306 |
+
"[1, [2, [3, [4, [5]]]]]", "[[], [], 1]",
|
| 307 |
+
],
|
| 308 |
+
"fuzz_spec": {
|
| 309 |
+
"xs": {
|
| 310 |
+
"type": "list",
|
| 311 |
+
"elem": {"type": "int", "min": -10, "max": 10},
|
| 312 |
+
"max_len": 6,
|
| 313 |
+
}
|
| 314 |
+
},
|
| 315 |
+
},
|
| 316 |
+
{
|
| 317 |
+
"name": "merge_sorted",
|
| 318 |
+
"target_function_name": "merge_sorted",
|
| 319 |
+
"signature": "merge_sorted(a: list[int], b: list[int]) -> list[int]",
|
| 320 |
+
"description": (
|
| 321 |
+
"Merge two pre-sorted lists of ints into a single sorted list. "
|
| 322 |
+
"Both arguments must be lists; elements must be ints (bools "
|
| 323 |
+
"rejected). The classic merge step of merge-sort."
|
| 324 |
+
),
|
| 325 |
+
"difficulty": "medium",
|
| 326 |
+
"source_code": (
|
| 327 |
+
"def merge_sorted(a, b):\n"
|
| 328 |
+
" if not isinstance(a, list) or not isinstance(b, list):\n"
|
| 329 |
+
" raise TypeError('both arguments must be list')\n"
|
| 330 |
+
" for x in (*a, *b):\n"
|
| 331 |
+
" if not isinstance(x, int) or isinstance(x, bool):\n"
|
| 332 |
+
" raise TypeError('elements must be int')\n"
|
| 333 |
+
" out = []\n"
|
| 334 |
+
" i = j = 0\n"
|
| 335 |
+
" while i < len(a) and j < len(b):\n"
|
| 336 |
+
" if a[i] <= b[j]:\n"
|
| 337 |
+
" out.append(a[i]); i += 1\n"
|
| 338 |
+
" else:\n"
|
| 339 |
+
" out.append(b[j]); j += 1\n"
|
| 340 |
+
" out.extend(a[i:])\n"
|
| 341 |
+
" out.extend(b[j:])\n"
|
| 342 |
+
" return out\n"
|
| 343 |
+
),
|
| 344 |
+
"edge_cases": [
|
| 345 |
+
"([], [])", "([1, 2, 3], [])", "([], [1, 2, 3])",
|
| 346 |
+
"([1, 3, 5], [2, 4, 6])", "([1, 1], [1, 1])",
|
| 347 |
+
],
|
| 348 |
+
"fuzz_spec": {
|
| 349 |
+
"a": {"type": "list", "elem": {"type": "int", "min": -20, "max": 20}, "max_len": 5},
|
| 350 |
+
"b": {"type": "list", "elem": {"type": "int", "min": -20, "max": 20}, "max_len": 5},
|
| 351 |
+
},
|
| 352 |
+
},
|
| 353 |
+
{
|
| 354 |
+
"name": "run_length_encode",
|
| 355 |
+
"target_function_name": "run_length_encode",
|
| 356 |
+
"signature": "run_length_encode(s: str) -> list[tuple[str, int]]",
|
| 357 |
+
"description": (
|
| 358 |
+
"Run-length encoding: returns a list of (character, count) "
|
| 359 |
+
"tuples for each run of identical characters in s. Empty "
|
| 360 |
+
"input yields an empty list."
|
| 361 |
+
),
|
| 362 |
+
"difficulty": "easy",
|
| 363 |
+
"source_code": (
|
| 364 |
+
"def run_length_encode(s):\n"
|
| 365 |
+
" if not isinstance(s, str):\n"
|
| 366 |
+
" raise TypeError('input must be str')\n"
|
| 367 |
+
" if not s:\n"
|
| 368 |
+
" return []\n"
|
| 369 |
+
" out = []\n"
|
| 370 |
+
" cur = s[0]\n"
|
| 371 |
+
" n = 1\n"
|
| 372 |
+
" for ch in s[1:]:\n"
|
| 373 |
+
" if ch == cur:\n"
|
| 374 |
+
" n += 1\n"
|
| 375 |
+
" else:\n"
|
| 376 |
+
" out.append((cur, n))\n"
|
| 377 |
+
" cur = ch\n"
|
| 378 |
+
" n = 1\n"
|
| 379 |
+
" out.append((cur, n))\n"
|
| 380 |
+
" return out\n"
|
| 381 |
+
),
|
| 382 |
+
"edge_cases": ['""', '"a"', '"aa"', '"abc"', '"aaabbbccc"', '"aaaaaaaaaa"'],
|
| 383 |
+
"fuzz_spec": {"s": {"type": "str", "alphabet": "ab", "max_len": 12}},
|
| 384 |
+
},
|
| 385 |
+
{
|
| 386 |
+
"name": "binary_search",
|
| 387 |
+
"target_function_name": "binary_search",
|
| 388 |
+
"signature": "binary_search(arr: list[int], target: int) -> int",
|
| 389 |
+
"description": (
|
| 390 |
+
"Return the index of target in the sorted ascending list arr, "
|
| 391 |
+
"or -1 if not present. arr must be a list of ints; target "
|
| 392 |
+
"must be int. The list is assumed sorted."
|
| 393 |
+
),
|
| 394 |
+
"difficulty": "medium",
|
| 395 |
+
"source_code": (
|
| 396 |
+
"def binary_search(arr, target):\n"
|
| 397 |
+
" if not isinstance(arr, list):\n"
|
| 398 |
+
" raise TypeError('arr must be list')\n"
|
| 399 |
+
" if not isinstance(target, int) or isinstance(target, bool):\n"
|
| 400 |
+
" raise TypeError('target must be int')\n"
|
| 401 |
+
" lo, hi = 0, len(arr) - 1\n"
|
| 402 |
+
" while lo <= hi:\n"
|
| 403 |
+
" mid = (lo + hi) // 2\n"
|
| 404 |
+
" v = arr[mid]\n"
|
| 405 |
+
" if v == target:\n"
|
| 406 |
+
" return mid\n"
|
| 407 |
+
" if v < target:\n"
|
| 408 |
+
" lo = mid + 1\n"
|
| 409 |
+
" else:\n"
|
| 410 |
+
" hi = mid - 1\n"
|
| 411 |
+
" return -1\n"
|
| 412 |
+
),
|
| 413 |
+
"edge_cases": [
|
| 414 |
+
"([], 3)", "([1], 1)", "([1], 2)",
|
| 415 |
+
"([1, 2, 3, 4, 5], 3)", "([1, 2, 3, 4, 5], 0)",
|
| 416 |
+
"([1, 2, 3, 4, 5], 6)",
|
| 417 |
+
],
|
| 418 |
+
"fuzz_spec": {
|
| 419 |
+
"arr": {"type": "list", "elem": {"type": "int", "min": -20, "max": 20}, "max_len": 8},
|
| 420 |
+
"target": {"type": "int", "min": -20, "max": 20},
|
| 421 |
+
},
|
| 422 |
+
},
|
| 423 |
+
]
|
| 424 |
+
|
| 425 |
+
|
| 426 |
+
def _builtin_to_row(name: str) -> Dict[str, Any]:
|
| 427 |
+
spec = BLACK_BOX_FUNCTIONS[name]
|
| 428 |
+
src_meta = _BUILTIN_SOURCE[name]
|
| 429 |
+
return {
|
| 430 |
+
"name": name,
|
| 431 |
+
"target_function_name": src_meta["target_function_name"],
|
| 432 |
+
"signature": spec.signature,
|
| 433 |
+
"description": spec.description,
|
| 434 |
+
"difficulty": spec.difficulty,
|
| 435 |
+
"source_code": src_meta["source_code"],
|
| 436 |
+
"edge_cases_json": json.dumps(src_meta["edge_cases"]),
|
| 437 |
+
"fuzz_spec_json": json.dumps(src_meta["fuzz_spec"]),
|
| 438 |
+
}
|
| 439 |
+
|
| 440 |
+
|
| 441 |
+
def _new_task_to_row(meta: Dict[str, Any]) -> Dict[str, Any]:
|
| 442 |
+
return {
|
| 443 |
+
"name": meta["name"],
|
| 444 |
+
"target_function_name": meta["target_function_name"],
|
| 445 |
+
"signature": meta["signature"],
|
| 446 |
+
"description": meta["description"],
|
| 447 |
+
"difficulty": meta["difficulty"],
|
| 448 |
+
"source_code": meta["source_code"],
|
| 449 |
+
"edge_cases_json": json.dumps(meta["edge_cases"]),
|
| 450 |
+
"fuzz_spec_json": json.dumps(meta["fuzz_spec"]),
|
| 451 |
+
}
|
| 452 |
+
|
| 453 |
+
|
| 454 |
+
def build_rows() -> List[Dict[str, Any]]:
|
| 455 |
+
rows: List[Dict[str, Any]] = []
|
| 456 |
+
for name in BLACK_BOX_FUNCTIONS:
|
| 457 |
+
rows.append(_builtin_to_row(name))
|
| 458 |
+
for meta in _NEW_TASK_ROWS:
|
| 459 |
+
rows.append(_new_task_to_row(meta))
|
| 460 |
+
return rows
|
| 461 |
+
|
| 462 |
+
|
| 463 |
+
def push_to_hub(rows: List[Dict[str, Any]], dataset_id: str, *, private: bool = False) -> str:
|
| 464 |
+
"""Push the row list to ``dataset_id`` (overwriting any prior contents).
|
| 465 |
+
Returns the hub URL.
|
| 466 |
+
"""
|
| 467 |
+
from datasets import Dataset
|
| 468 |
+
from huggingface_hub import HfApi
|
| 469 |
+
|
| 470 |
+
api = HfApi()
|
| 471 |
+
api.create_repo(
|
| 472 |
+
repo_id=dataset_id,
|
| 473 |
+
repo_type="dataset",
|
| 474 |
+
exist_ok=True,
|
| 475 |
+
private=private,
|
| 476 |
+
)
|
| 477 |
+
|
| 478 |
+
ds = Dataset.from_list(rows)
|
| 479 |
+
log.info("pushing %d row(s) to %s", len(rows), dataset_id)
|
| 480 |
+
ds.push_to_hub(dataset_id, split="train", private=private)
|
| 481 |
+
return f"https://huggingface.co/datasets/{dataset_id}"
|
| 482 |
+
|
| 483 |
+
|
| 484 |
+
def main(argv: Optional[List[str]] = None) -> int:
|
| 485 |
+
p = argparse.ArgumentParser(description="Bootstrap the OpenSleuth Hub task catalog.")
|
| 486 |
+
p.add_argument("--dataset-id", default=DATASET_ID)
|
| 487 |
+
p.add_argument("--dry-run", action="store_true", help="Print row count, don't push.")
|
| 488 |
+
p.add_argument("--private", action="store_true", help="Create as private dataset.")
|
| 489 |
+
args = p.parse_args(argv)
|
| 490 |
+
|
| 491 |
+
rows = build_rows()
|
| 492 |
+
log.info("built %d row(s) (%d builtin + %d new)",
|
| 493 |
+
len(rows), len(BLACK_BOX_FUNCTIONS), len(_NEW_TASK_ROWS))
|
| 494 |
+
for r in rows:
|
| 495 |
+
log.info(" %-22s difficulty=%-6s edges=%-2d",
|
| 496 |
+
r["name"], r["difficulty"], len(json.loads(r["edge_cases_json"])))
|
| 497 |
+
|
| 498 |
+
if args.dry_run:
|
| 499 |
+
log.info("--dry-run: not pushing")
|
| 500 |
+
return 0
|
| 501 |
+
|
| 502 |
+
url = push_to_hub(rows, args.dataset_id, private=args.private)
|
| 503 |
+
log.info("dataset live at %s", url)
|
| 504 |
+
return 0
|
| 505 |
+
|
| 506 |
+
|
| 507 |
+
if __name__ == "__main__":
|
| 508 |
+
sys.exit(main())
|
opensleuth_env/task_catalog.py
ADDED
|
@@ -0,0 +1,469 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""TaskCatalog: resolve a target function from one of three sources.
|
| 2 |
+
|
| 3 |
+
OpenSleuth Level 2 makes the env open-ended. Where v0.3 only knew about the
|
| 4 |
+
9 hand-written ``BLACK_BOX_FUNCTIONS``, the catalog accepts targets from:
|
| 5 |
+
|
| 6 |
+
1. **Caller-supplied** -- per-/reset payload, the most specific source.
|
| 7 |
+
The caller passes ``target_code`` + ``target_function_name`` (and
|
| 8 |
+
optionally ``edge_cases`` / ``fuzz_spec``) and we compile the source
|
| 9 |
+
in the same hardened sandbox the verifier uses for submissions.
|
| 10 |
+
|
| 11 |
+
2. **Hub dataset** -- ``anugrah55/opensleuth-tasks`` on Hugging Face Hub.
|
| 12 |
+
Each row carries ``{name, signature, description, difficulty,
|
| 13 |
+
source_code, edge_cases_json, fuzz_spec_json}``. Loaded lazily on
|
| 14 |
+
first reset and cached in-process.
|
| 15 |
+
|
| 16 |
+
3. **Builtin registry** -- the original 9 ``BLACK_BOX_FUNCTIONS``. Kept
|
| 17 |
+
as the safety-net so the in-flight trainer keeps working unchanged.
|
| 18 |
+
|
| 19 |
+
Resolution priority: caller-supplied wins, then Hub by name, then builtin.
|
| 20 |
+
This makes "trainer asks for fibonacci" still resolve to the builtin
|
| 21 |
+
fibonacci even when the Hub copy exists, *unless* the caller explicitly
|
| 22 |
+
overrides via ``target_code``.
|
| 23 |
+
|
| 24 |
+
Sandbox: caller-supplied / Hub source code is executed via the same
|
| 25 |
+
``_make_safe_globals`` whitelist as agent submissions (no ``__import__``,
|
| 26 |
+
``open``, ``eval``, ...). On top we statically reject any source that
|
| 27 |
+
imports ``opensleuth_*`` to prevent oracle-cheesing.
|
| 28 |
+
"""
|
| 29 |
+
|
| 30 |
+
from __future__ import annotations
|
| 31 |
+
|
| 32 |
+
import ast
|
| 33 |
+
import inspect
|
| 34 |
+
import json
|
| 35 |
+
import logging
|
| 36 |
+
import threading
|
| 37 |
+
from typing import Any, Callable, Dict, List, Optional
|
| 38 |
+
|
| 39 |
+
from .auto_fuzzer import auto_fuzz, make_fuzzer
|
| 40 |
+
from .black_box import BLACK_BOX_FUNCTIONS, FunctionSpec
|
| 41 |
+
from .verifier import _make_safe_globals # reuse the hardened sandbox
|
| 42 |
+
|
| 43 |
+
log = logging.getLogger("opensleuth.task_catalog")
|
| 44 |
+
|
| 45 |
+
HUB_DATASET_ID = "anugrah55/opensleuth-tasks"
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
class TaskResolutionError(ValueError):
|
| 49 |
+
"""Raised when a /reset request can't be turned into a FunctionSpec."""
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
# ---------------------------------------------------------------------------
|
| 53 |
+
# Caller / Hub source-code compilation
|
| 54 |
+
# ---------------------------------------------------------------------------
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
_FORBIDDEN_PREFIXES = ("opensleuth", "opensleuth_env")
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def _statically_reject_oracle_import(code: str) -> Optional[str]:
|
| 61 |
+
"""Return an error string if the source statically imports the env's own
|
| 62 |
+
oracle module (which would let the agent / Hub author cheese the
|
| 63 |
+
verifier). The hardened sandbox already blocks ``__import__``, but we
|
| 64 |
+
fail fast and surface a clear error.
|
| 65 |
+
"""
|
| 66 |
+
try:
|
| 67 |
+
tree = ast.parse(code)
|
| 68 |
+
except SyntaxError as e:
|
| 69 |
+
return f"target_code is not valid Python: {e}"
|
| 70 |
+
for node in ast.walk(tree):
|
| 71 |
+
if isinstance(node, ast.Import):
|
| 72 |
+
for alias in node.names:
|
| 73 |
+
if any(alias.name.startswith(p) for p in _FORBIDDEN_PREFIXES):
|
| 74 |
+
return (
|
| 75 |
+
f"target_code is not allowed to import {alias.name!r} "
|
| 76 |
+
"(oracle import)."
|
| 77 |
+
)
|
| 78 |
+
elif isinstance(node, ast.ImportFrom):
|
| 79 |
+
mod = node.module or ""
|
| 80 |
+
if any(mod.startswith(p) for p in _FORBIDDEN_PREFIXES):
|
| 81 |
+
return (
|
| 82 |
+
f"target_code is not allowed to import from {mod!r} "
|
| 83 |
+
"(oracle import)."
|
| 84 |
+
)
|
| 85 |
+
return None
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def _compile_target_in_sandbox(code: str, function_name: str) -> Callable[..., Any]:
|
| 89 |
+
"""Compile ``code`` in the same restricted globals the verifier uses for
|
| 90 |
+
agent submissions, then return the named callable. Raises
|
| 91 |
+
``TaskResolutionError`` on any problem so /reset can return a clean 400.
|
| 92 |
+
"""
|
| 93 |
+
err = _statically_reject_oracle_import(code)
|
| 94 |
+
if err:
|
| 95 |
+
raise TaskResolutionError(err)
|
| 96 |
+
safe_globals = _make_safe_globals()
|
| 97 |
+
local_scope: Dict[str, Any] = {}
|
| 98 |
+
try:
|
| 99 |
+
exec(code, safe_globals, local_scope)
|
| 100 |
+
except Exception as e: # noqa: BLE001
|
| 101 |
+
raise TaskResolutionError(
|
| 102 |
+
f"target_code raised at definition time: {type(e).__name__}: {e}"
|
| 103 |
+
) from e
|
| 104 |
+
fn = local_scope.get(function_name) or safe_globals.get(function_name)
|
| 105 |
+
if not callable(fn):
|
| 106 |
+
raise TaskResolutionError(
|
| 107 |
+
f"target_code does not define a callable named {function_name!r}."
|
| 108 |
+
)
|
| 109 |
+
return fn
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
def _arity_of(fn: Callable[..., Any]) -> int:
|
| 113 |
+
"""Number of positional / positional-or-keyword params on ``fn``."""
|
| 114 |
+
try:
|
| 115 |
+
sig = inspect.signature(fn)
|
| 116 |
+
except (TypeError, ValueError):
|
| 117 |
+
return 1
|
| 118 |
+
n = 0
|
| 119 |
+
for p in sig.parameters.values():
|
| 120 |
+
if p.kind in (
|
| 121 |
+
inspect.Parameter.POSITIONAL_ONLY,
|
| 122 |
+
inspect.Parameter.POSITIONAL_OR_KEYWORD,
|
| 123 |
+
):
|
| 124 |
+
n += 1
|
| 125 |
+
return max(n, 1)
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
def _signature_string(fn: Callable[..., Any], name: str) -> str:
|
| 129 |
+
try:
|
| 130 |
+
sig = inspect.signature(fn)
|
| 131 |
+
return f"{name}{sig}"
|
| 132 |
+
except (TypeError, ValueError):
|
| 133 |
+
return f"{name}(...)"
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
def _description_of(fn: Callable[..., Any]) -> str:
|
| 137 |
+
return inspect.getdoc(fn) or ""
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
def _parse_edge_cases(edge_cases: Optional[List[Any]]) -> List[Any]:
|
| 141 |
+
"""Edge cases arrive as a list of strings (Python literal reprs) when
|
| 142 |
+
coming from the API or from the Hub's ``edge_cases_json`` column. Each
|
| 143 |
+
string is parsed via ``ast.literal_eval``. Already-parsed values
|
| 144 |
+
(e.g. ints from the bootstrap script) are passed through unchanged.
|
| 145 |
+
"""
|
| 146 |
+
if not edge_cases:
|
| 147 |
+
return []
|
| 148 |
+
parsed: List[Any] = []
|
| 149 |
+
for raw in edge_cases:
|
| 150 |
+
if isinstance(raw, str):
|
| 151 |
+
try:
|
| 152 |
+
parsed.append(ast.literal_eval(raw))
|
| 153 |
+
except (ValueError, SyntaxError) as e:
|
| 154 |
+
raise TaskResolutionError(
|
| 155 |
+
f"edge_cases entry {raw!r} is not a Python literal: {e}"
|
| 156 |
+
) from e
|
| 157 |
+
else:
|
| 158 |
+
parsed.append(raw)
|
| 159 |
+
return parsed
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
def _flatten_unary_edges(arity: int, edges: List[Any]) -> List[Any]:
|
| 163 |
+
"""For unary fns we accept either ``[5, 10]`` or ``[(5,), (10,)]`` and
|
| 164 |
+
normalise to flat values; for multi-arg fns we require tuples and pass
|
| 165 |
+
them through."""
|
| 166 |
+
if arity == 1:
|
| 167 |
+
out = []
|
| 168 |
+
for e in edges:
|
| 169 |
+
if isinstance(e, tuple) and len(e) == 1:
|
| 170 |
+
out.append(e[0])
|
| 171 |
+
else:
|
| 172 |
+
out.append(e)
|
| 173 |
+
return out
|
| 174 |
+
out = []
|
| 175 |
+
for e in edges:
|
| 176 |
+
if not isinstance(e, tuple):
|
| 177 |
+
raise TaskResolutionError(
|
| 178 |
+
f"edge_cases for a {arity}-arg target must be tuples, "
|
| 179 |
+
f"got {type(e).__name__}: {e!r}"
|
| 180 |
+
)
|
| 181 |
+
out.append(e)
|
| 182 |
+
return out
|
| 183 |
+
|
| 184 |
+
|
| 185 |
+
def _spec_from_callable(
|
| 186 |
+
name: str,
|
| 187 |
+
fn: Callable[..., Any],
|
| 188 |
+
*,
|
| 189 |
+
description: Optional[str] = None,
|
| 190 |
+
signature: Optional[str] = None,
|
| 191 |
+
difficulty: str = "medium",
|
| 192 |
+
edge_cases: Optional[List[Any]] = None,
|
| 193 |
+
fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
|
| 194 |
+
source: str = "user",
|
| 195 |
+
) -> FunctionSpec:
|
| 196 |
+
"""Build a FunctionSpec from a Python callable + optional metadata.
|
| 197 |
+
|
| 198 |
+
Wraps ``auto_fuzz`` for the fuzzer. The arity is auto-detected from
|
| 199 |
+
``inspect.signature`` so ``unpack_args`` is set correctly: unary fns
|
| 200 |
+
behave like the existing builtins (single-arg call), N-arg fns flow
|
| 201 |
+
through the tuple-unpacking path in env / verifier.
|
| 202 |
+
"""
|
| 203 |
+
arity = _arity_of(fn)
|
| 204 |
+
unpack = arity > 1
|
| 205 |
+
|
| 206 |
+
parsed_edges = _flatten_unary_edges(arity, _parse_edge_cases(edge_cases))
|
| 207 |
+
|
| 208 |
+
if unpack:
|
| 209 |
+
# Catalog-level adapter: keep the public spec.fn one-arg-style for
|
| 210 |
+
# the *unary* path so existing call sites work, but for multi-arg
|
| 211 |
+
# the env/verifier respect ``unpack_args`` and call ``fn(*args)``.
|
| 212 |
+
# We still store the original here -- env._handle_probe and
|
| 213 |
+
# verify_submission do the unpacking.
|
| 214 |
+
ref_fn: Callable[..., Any] = fn
|
| 215 |
+
|
| 216 |
+
def _fuzzer(rng, n):
|
| 217 |
+
return auto_fuzz(fn, n, rng, fuzz_spec=fuzz_spec)
|
| 218 |
+
|
| 219 |
+
else:
|
| 220 |
+
ref_fn = fn
|
| 221 |
+
|
| 222 |
+
def _unary_fuzzer(rng, n):
|
| 223 |
+
tuples = auto_fuzz(fn, n, rng, fuzz_spec=fuzz_spec)
|
| 224 |
+
return [t[0] if isinstance(t, tuple) and len(t) == 1 else t for t in tuples]
|
| 225 |
+
|
| 226 |
+
_fuzzer = _unary_fuzzer
|
| 227 |
+
|
| 228 |
+
return FunctionSpec(
|
| 229 |
+
name=name,
|
| 230 |
+
fn=ref_fn,
|
| 231 |
+
signature=signature or _signature_string(fn, name),
|
| 232 |
+
description=description or _description_of(fn),
|
| 233 |
+
fuzzer=_fuzzer,
|
| 234 |
+
difficulty=difficulty,
|
| 235 |
+
edge_cases=parsed_edges,
|
| 236 |
+
unpack_args=unpack,
|
| 237 |
+
source=source,
|
| 238 |
+
)
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
# ---------------------------------------------------------------------------
|
| 242 |
+
# Hub loader
|
| 243 |
+
# ---------------------------------------------------------------------------
|
| 244 |
+
|
| 245 |
+
|
| 246 |
+
class _HubCache:
|
| 247 |
+
"""Lazily loads the Hub dataset into ``{name: FunctionSpec}``. Thread-
|
| 248 |
+
safe initialisation; subsequent reads are lock-free."""
|
| 249 |
+
|
| 250 |
+
def __init__(self, dataset_id: str):
|
| 251 |
+
self.dataset_id = dataset_id
|
| 252 |
+
self._lock = threading.Lock()
|
| 253 |
+
self._loaded: bool = False
|
| 254 |
+
self._specs: Dict[str, FunctionSpec] = {}
|
| 255 |
+
self._raw_rows: List[Dict[str, Any]] = []
|
| 256 |
+
self._load_error: Optional[str] = None
|
| 257 |
+
|
| 258 |
+
@property
|
| 259 |
+
def loaded(self) -> bool:
|
| 260 |
+
return self._loaded
|
| 261 |
+
|
| 262 |
+
@property
|
| 263 |
+
def load_error(self) -> Optional[str]:
|
| 264 |
+
return self._load_error
|
| 265 |
+
|
| 266 |
+
def _row_to_spec(self, row: Dict[str, Any]) -> Optional[FunctionSpec]:
|
| 267 |
+
name = row.get("name")
|
| 268 |
+
code = row.get("source_code")
|
| 269 |
+
if not name or not code:
|
| 270 |
+
return None
|
| 271 |
+
fn_name = row.get("target_function_name") or name
|
| 272 |
+
try:
|
| 273 |
+
fn = _compile_target_in_sandbox(code, fn_name)
|
| 274 |
+
except TaskResolutionError as e:
|
| 275 |
+
log.warning("hub task %r failed to compile: %s", name, e)
|
| 276 |
+
return None
|
| 277 |
+
edge_cases_raw = row.get("edge_cases_json") or "[]"
|
| 278 |
+
fuzz_spec_raw = row.get("fuzz_spec_json") or "null"
|
| 279 |
+
try:
|
| 280 |
+
edge_cases = json.loads(edge_cases_raw) if isinstance(edge_cases_raw, str) else edge_cases_raw
|
| 281 |
+
except json.JSONDecodeError:
|
| 282 |
+
edge_cases = []
|
| 283 |
+
try:
|
| 284 |
+
fuzz_spec = json.loads(fuzz_spec_raw) if isinstance(fuzz_spec_raw, str) else fuzz_spec_raw
|
| 285 |
+
except json.JSONDecodeError:
|
| 286 |
+
fuzz_spec = None
|
| 287 |
+
try:
|
| 288 |
+
return _spec_from_callable(
|
| 289 |
+
name=name,
|
| 290 |
+
fn=fn,
|
| 291 |
+
description=row.get("description") or _description_of(fn),
|
| 292 |
+
signature=row.get("signature") or _signature_string(fn, name),
|
| 293 |
+
difficulty=row.get("difficulty") or "medium",
|
| 294 |
+
edge_cases=edge_cases,
|
| 295 |
+
fuzz_spec=fuzz_spec,
|
| 296 |
+
source="hub",
|
| 297 |
+
)
|
| 298 |
+
except TaskResolutionError as e:
|
| 299 |
+
log.warning("hub task %r could not be specced: %s", name, e)
|
| 300 |
+
return None
|
| 301 |
+
|
| 302 |
+
def ensure_loaded(self) -> None:
|
| 303 |
+
if self._loaded:
|
| 304 |
+
return
|
| 305 |
+
with self._lock:
|
| 306 |
+
if self._loaded:
|
| 307 |
+
return
|
| 308 |
+
try:
|
| 309 |
+
from datasets import load_dataset # type: ignore
|
| 310 |
+
|
| 311 |
+
ds = load_dataset(self.dataset_id, split="train")
|
| 312 |
+
rows = list(ds)
|
| 313 |
+
specs: Dict[str, FunctionSpec] = {}
|
| 314 |
+
for row in rows:
|
| 315 |
+
spec = self._row_to_spec(row)
|
| 316 |
+
if spec is not None:
|
| 317 |
+
specs[spec.name] = spec
|
| 318 |
+
self._specs = specs
|
| 319 |
+
self._raw_rows = rows
|
| 320 |
+
log.info(
|
| 321 |
+
"loaded %d task(s) from %s (%d row(s) total)",
|
| 322 |
+
len(specs),
|
| 323 |
+
self.dataset_id,
|
| 324 |
+
len(rows),
|
| 325 |
+
)
|
| 326 |
+
except Exception as e: # noqa: BLE001
|
| 327 |
+
# Hub unreachable / not yet bootstrapped / offline. We swallow
|
| 328 |
+
# the error so the env keeps working from the builtin
|
| 329 |
+
# registry alone -- this is what lets the trainer keep
|
| 330 |
+
# running even if the Hub goes down mid-rollout.
|
| 331 |
+
self._load_error = f"{type(e).__name__}: {e}"
|
| 332 |
+
log.warning("hub dataset %s unavailable: %s", self.dataset_id, self._load_error)
|
| 333 |
+
finally:
|
| 334 |
+
self._loaded = True
|
| 335 |
+
|
| 336 |
+
def specs(self) -> Dict[str, FunctionSpec]:
|
| 337 |
+
self.ensure_loaded()
|
| 338 |
+
return self._specs
|
| 339 |
+
|
| 340 |
+
def rows(self) -> List[Dict[str, Any]]:
|
| 341 |
+
self.ensure_loaded()
|
| 342 |
+
return self._raw_rows
|
| 343 |
+
|
| 344 |
+
|
| 345 |
+
# ---------------------------------------------------------------------------
|
| 346 |
+
# TaskCatalog
|
| 347 |
+
# ---------------------------------------------------------------------------
|
| 348 |
+
|
| 349 |
+
|
| 350 |
+
class TaskCatalog:
|
| 351 |
+
"""Resolves /reset payloads to FunctionSpecs from caller / Hub / builtin."""
|
| 352 |
+
|
| 353 |
+
def __init__(
|
| 354 |
+
self,
|
| 355 |
+
hub_dataset_id: str = HUB_DATASET_ID,
|
| 356 |
+
*,
|
| 357 |
+
enable_hub: bool = True,
|
| 358 |
+
) -> None:
|
| 359 |
+
self.hub_dataset_id = hub_dataset_id
|
| 360 |
+
self.enable_hub = enable_hub
|
| 361 |
+
self._hub = _HubCache(hub_dataset_id) if enable_hub else None
|
| 362 |
+
|
| 363 |
+
# --- Resolution --------------------------------------------------------
|
| 364 |
+
|
| 365 |
+
def resolve(
|
| 366 |
+
self,
|
| 367 |
+
target_name: Optional[str] = None,
|
| 368 |
+
target_code: Optional[str] = None,
|
| 369 |
+
target_function_name: Optional[str] = None,
|
| 370 |
+
edge_cases: Optional[List[Any]] = None,
|
| 371 |
+
fuzz_spec: Optional[Dict[str, Dict[str, Any]]] = None,
|
| 372 |
+
) -> FunctionSpec:
|
| 373 |
+
# 1. Caller-supplied: highest priority.
|
| 374 |
+
if target_code is not None:
|
| 375 |
+
if not target_function_name:
|
| 376 |
+
raise TaskResolutionError(
|
| 377 |
+
"target_code requires target_function_name to identify "
|
| 378 |
+
"which callable in the source to use."
|
| 379 |
+
)
|
| 380 |
+
fn = _compile_target_in_sandbox(target_code, target_function_name)
|
| 381 |
+
return _spec_from_callable(
|
| 382 |
+
name=target_function_name,
|
| 383 |
+
fn=fn,
|
| 384 |
+
edge_cases=edge_cases,
|
| 385 |
+
fuzz_spec=fuzz_spec,
|
| 386 |
+
source="user",
|
| 387 |
+
)
|
| 388 |
+
|
| 389 |
+
# 2. & 3. Hub-by-name / builtin-by-name. Builtin wins for legacy names
|
| 390 |
+
# (so the trainer's "fibonacci" always means the in-process oracle,
|
| 391 |
+
# never a possibly-modified Hub copy).
|
| 392 |
+
if not target_name:
|
| 393 |
+
raise TaskResolutionError(
|
| 394 |
+
"Either target_name or (target_code + target_function_name) must be set."
|
| 395 |
+
)
|
| 396 |
+
|
| 397 |
+
if target_name in BLACK_BOX_FUNCTIONS:
|
| 398 |
+
return BLACK_BOX_FUNCTIONS[target_name]
|
| 399 |
+
|
| 400 |
+
if self._hub is not None:
|
| 401 |
+
hub_specs = self._hub.specs()
|
| 402 |
+
if target_name in hub_specs:
|
| 403 |
+
return hub_specs[target_name]
|
| 404 |
+
|
| 405 |
+
available = self.list_known_names()
|
| 406 |
+
raise TaskResolutionError(
|
| 407 |
+
f"Unknown target function: {target_name!r}. Available: {sorted(available)[:25]}"
|
| 408 |
+
)
|
| 409 |
+
|
| 410 |
+
# --- Listing -----------------------------------------------------------
|
| 411 |
+
|
| 412 |
+
def list_known_names(self) -> List[str]:
|
| 413 |
+
names = set(BLACK_BOX_FUNCTIONS)
|
| 414 |
+
if self._hub is not None:
|
| 415 |
+
try:
|
| 416 |
+
names.update(self._hub.specs())
|
| 417 |
+
except Exception: # noqa: BLE001 -- best effort
|
| 418 |
+
pass
|
| 419 |
+
return sorted(names)
|
| 420 |
+
|
| 421 |
+
def list_builtin(self) -> List[Dict[str, Any]]:
|
| 422 |
+
return [
|
| 423 |
+
{
|
| 424 |
+
"name": s.name,
|
| 425 |
+
"signature": s.signature,
|
| 426 |
+
"description": s.description,
|
| 427 |
+
"difficulty": s.difficulty,
|
| 428 |
+
"edge_case_count": len(s.edge_cases or []),
|
| 429 |
+
"source": "builtin",
|
| 430 |
+
}
|
| 431 |
+
for s in BLACK_BOX_FUNCTIONS.values()
|
| 432 |
+
]
|
| 433 |
+
|
| 434 |
+
def list_hub(self) -> List[Dict[str, Any]]:
|
| 435 |
+
if self._hub is None:
|
| 436 |
+
return []
|
| 437 |
+
out = []
|
| 438 |
+
for s in self._hub.specs().values():
|
| 439 |
+
# Don't shadow builtins in the Hub list (avoids surprising the
|
| 440 |
+
# caller with a "fibonacci@hub" entry that's never used).
|
| 441 |
+
if s.name in BLACK_BOX_FUNCTIONS:
|
| 442 |
+
continue
|
| 443 |
+
out.append(
|
| 444 |
+
{
|
| 445 |
+
"name": s.name,
|
| 446 |
+
"signature": s.signature,
|
| 447 |
+
"description": s.description,
|
| 448 |
+
"difficulty": s.difficulty,
|
| 449 |
+
"edge_case_count": len(s.edge_cases or []),
|
| 450 |
+
"source": "hub",
|
| 451 |
+
}
|
| 452 |
+
)
|
| 453 |
+
return out
|
| 454 |
+
|
| 455 |
+
def list_all(self) -> List[Dict[str, Any]]:
|
| 456 |
+
return self.list_builtin() + self.list_hub()
|
| 457 |
+
|
| 458 |
+
# --- Diagnostics -------------------------------------------------------
|
| 459 |
+
|
| 460 |
+
def hub_status(self) -> Dict[str, Any]:
|
| 461 |
+
if self._hub is None:
|
| 462 |
+
return {"enabled": False}
|
| 463 |
+
return {
|
| 464 |
+
"enabled": True,
|
| 465 |
+
"dataset_id": self.hub_dataset_id,
|
| 466 |
+
"loaded": self._hub.loaded,
|
| 467 |
+
"task_count": len(self._hub.specs()) if self._hub.loaded else None,
|
| 468 |
+
"error": self._hub.load_error,
|
| 469 |
+
}
|
opensleuth_env/verifier.py
CHANGED
|
@@ -180,23 +180,36 @@ class _CallTimeout(Exception):
|
|
| 180 |
pass
|
| 181 |
|
| 182 |
|
| 183 |
-
def _call_with_timeout(fn: Callable, arg: Any, timeout_s: float):
|
| 184 |
def _handler(signum, frame): # noqa: ARG001
|
| 185 |
raise _CallTimeout()
|
| 186 |
|
| 187 |
old = signal.signal(signal.SIGALRM, _handler)
|
| 188 |
signal.setitimer(signal.ITIMER_REAL, timeout_s)
|
| 189 |
try:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
return fn(arg)
|
| 191 |
finally:
|
| 192 |
signal.setitimer(signal.ITIMER_REAL, 0)
|
| 193 |
signal.signal(signal.SIGALRM, old)
|
| 194 |
|
| 195 |
|
| 196 |
-
def _safe_call(fn: Callable, arg: Any, timeout_s: float):
|
| 197 |
-
"""Returns (kind, value): kind in {'val', 'err', 'timeout'}.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
try:
|
| 199 |
-
return ("val", _call_with_timeout(fn, arg, timeout_s))
|
| 200 |
except _CallTimeout:
|
| 201 |
return ("timeout", f"timed out after {timeout_s}s")
|
| 202 |
except Exception as e: # noqa: BLE001
|
|
@@ -270,13 +283,14 @@ def _looks_like_reference_import(code: str) -> bool:
|
|
| 270 |
|
| 271 |
def verify_submission(
|
| 272 |
submitted_code: str,
|
| 273 |
-
target_function: Callable[
|
| 274 |
fuzz_inputs: List[Any],
|
| 275 |
*,
|
| 276 |
target_name: Optional[str] = None,
|
| 277 |
define_timeout_s: float = 5.0,
|
| 278 |
call_timeout_s: float = 1.0,
|
| 279 |
edge_inputs: Optional[List[Any]] = None,
|
|
|
|
| 280 |
) -> VerificationResult:
|
| 281 |
"""Score ``submitted_code`` against ``target_function`` over the supplied
|
| 282 |
``fuzz_inputs`` (random regime) and ``edge_inputs`` (must-pass regime).
|
|
@@ -324,8 +338,8 @@ def verify_submission(
|
|
| 324 |
|
| 325 |
def _score(inputs: List[Any], category: str) -> None:
|
| 326 |
for inp in inputs:
|
| 327 |
-
ref = _safe_call(target_function, inp, call_timeout_s)
|
| 328 |
-
sub = _safe_call(submitted_fn, inp, call_timeout_s)
|
| 329 |
sub_results.append(sub)
|
| 330 |
ref_results.append(ref)
|
| 331 |
if _outputs_equivalent(ref, sub):
|
|
|
|
| 180 |
pass
|
| 181 |
|
| 182 |
|
| 183 |
+
def _call_with_timeout(fn: Callable, arg: Any, timeout_s: float, *, unpack: bool = False):
|
| 184 |
def _handler(signum, frame): # noqa: ARG001
|
| 185 |
raise _CallTimeout()
|
| 186 |
|
| 187 |
old = signal.signal(signal.SIGALRM, _handler)
|
| 188 |
signal.setitimer(signal.ITIMER_REAL, timeout_s)
|
| 189 |
try:
|
| 190 |
+
if unpack:
|
| 191 |
+
if not isinstance(arg, tuple):
|
| 192 |
+
# Defensive: a multi-param target should always receive a
|
| 193 |
+
# tuple, but if the agent's probe input_repr happens to
|
| 194 |
+
# parse to a single value, treat it as a 1-tuple so we get
|
| 195 |
+
# a clear TypeError rather than a confusing call shape.
|
| 196 |
+
arg = (arg,)
|
| 197 |
+
return fn(*arg)
|
| 198 |
return fn(arg)
|
| 199 |
finally:
|
| 200 |
signal.setitimer(signal.ITIMER_REAL, 0)
|
| 201 |
signal.signal(signal.SIGALRM, old)
|
| 202 |
|
| 203 |
|
| 204 |
+
def _safe_call(fn: Callable, arg: Any, timeout_s: float, *, unpack: bool = False):
|
| 205 |
+
"""Returns (kind, value): kind in {'val', 'err', 'timeout'}.
|
| 206 |
+
|
| 207 |
+
When ``unpack`` is True the input ``arg`` is expected to be an args
|
| 208 |
+
tuple and ``fn`` is invoked as ``fn(*arg)``. This is how multi-parameter
|
| 209 |
+
auto-fuzzer-driven targets are scored.
|
| 210 |
+
"""
|
| 211 |
try:
|
| 212 |
+
return ("val", _call_with_timeout(fn, arg, timeout_s, unpack=unpack))
|
| 213 |
except _CallTimeout:
|
| 214 |
return ("timeout", f"timed out after {timeout_s}s")
|
| 215 |
except Exception as e: # noqa: BLE001
|
|
|
|
| 283 |
|
| 284 |
def verify_submission(
|
| 285 |
submitted_code: str,
|
| 286 |
+
target_function: Callable[..., Any],
|
| 287 |
fuzz_inputs: List[Any],
|
| 288 |
*,
|
| 289 |
target_name: Optional[str] = None,
|
| 290 |
define_timeout_s: float = 5.0,
|
| 291 |
call_timeout_s: float = 1.0,
|
| 292 |
edge_inputs: Optional[List[Any]] = None,
|
| 293 |
+
unpack_args: bool = False,
|
| 294 |
) -> VerificationResult:
|
| 295 |
"""Score ``submitted_code`` against ``target_function`` over the supplied
|
| 296 |
``fuzz_inputs`` (random regime) and ``edge_inputs`` (must-pass regime).
|
|
|
|
| 338 |
|
| 339 |
def _score(inputs: List[Any], category: str) -> None:
|
| 340 |
for inp in inputs:
|
| 341 |
+
ref = _safe_call(target_function, inp, call_timeout_s, unpack=unpack_args)
|
| 342 |
+
sub = _safe_call(submitted_fn, inp, call_timeout_s, unpack=unpack_args)
|
| 343 |
sub_results.append(sub)
|
| 344 |
ref_results.append(ref)
|
| 345 |
if _outputs_equivalent(ref, sub):
|
requirements.txt
CHANGED
|
@@ -1,3 +1,8 @@
|
|
| 1 |
fastapi==0.115.6
|
| 2 |
uvicorn[standard]==0.32.1
|
| 3 |
pydantic==2.10.3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
fastapi==0.115.6
|
| 2 |
uvicorn[standard]==0.32.1
|
| 3 |
pydantic==2.10.3
|
| 4 |
+
# Level 2: Hub-driven task catalog. We swallow load failures at runtime so
|
| 5 |
+
# the env still functions if Hub is offline, but the dependency is required
|
| 6 |
+
# for Hub-backed tasks to be discoverable.
|
| 7 |
+
datasets>=3.0.0
|
| 8 |
+
huggingface_hub>=0.25.0
|
server.py
CHANGED
|
@@ -15,18 +15,23 @@ from opensleuth_env import (
|
|
| 15 |
StepRequest,
|
| 16 |
StepResponse,
|
| 17 |
SubmitAction,
|
|
|
|
| 18 |
)
|
| 19 |
|
| 20 |
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
|
| 21 |
log = logging.getLogger("opensleuth.server")
|
| 22 |
|
| 23 |
-
app = FastAPI(title="OpenSleuth Env", version="0.
|
| 24 |
env = OpenSleuthEnv()
|
| 25 |
|
| 26 |
|
| 27 |
@app.get("/health")
|
| 28 |
def health():
|
| 29 |
-
return {
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
|
| 32 |
@app.get("/functions")
|
|
@@ -36,6 +41,10 @@ def list_functions(
|
|
| 36 |
description="Optional filter: easy / medium / hard. Used by the trainer for curriculum scheduling.",
|
| 37 |
),
|
| 38 |
):
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
items = []
|
| 40 |
for s in BLACK_BOX_FUNCTIONS.values():
|
| 41 |
if difficulty is not None and getattr(s, "difficulty", None) != difficulty:
|
|
@@ -47,15 +56,65 @@ def list_functions(
|
|
| 47 |
"description": s.description,
|
| 48 |
"difficulty": getattr(s, "difficulty", None),
|
| 49 |
"edge_case_count": len(getattr(s, "edge_cases", []) or []),
|
|
|
|
| 50 |
}
|
| 51 |
)
|
| 52 |
return {"functions": items}
|
| 53 |
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
@app.post("/reset")
|
| 56 |
def reset(req: ResetRequest):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
try:
|
| 58 |
-
obs = env.reset(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
except ValueError as e:
|
| 60 |
raise HTTPException(status_code=400, detail=str(e)) from e
|
| 61 |
return obs
|
|
@@ -77,8 +136,6 @@ def get_state(episode_id: str):
|
|
| 77 |
return state
|
| 78 |
|
| 79 |
|
| 80 |
-
# Convenience: a flat /step that does reset+step in one call is occasionally
|
| 81 |
-
# useful for shell-style debugging.
|
| 82 |
@app.post("/probe_once")
|
| 83 |
def probe_once(target_name: str, input_repr: str):
|
| 84 |
obs = env.reset(target_name=target_name)
|
|
|
|
| 15 |
StepRequest,
|
| 16 |
StepResponse,
|
| 17 |
SubmitAction,
|
| 18 |
+
TaskCatalog,
|
| 19 |
)
|
| 20 |
|
| 21 |
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
|
| 22 |
log = logging.getLogger("opensleuth.server")
|
| 23 |
|
| 24 |
+
app = FastAPI(title="OpenSleuth Env", version="0.4.0")
|
| 25 |
env = OpenSleuthEnv()
|
| 26 |
|
| 27 |
|
| 28 |
@app.get("/health")
|
| 29 |
def health():
|
| 30 |
+
return {
|
| 31 |
+
"status": "ok",
|
| 32 |
+
"episodes_tracked": len(env._states), # noqa: SLF001
|
| 33 |
+
"hub": env.catalog.hub_status(),
|
| 34 |
+
}
|
| 35 |
|
| 36 |
|
| 37 |
@app.get("/functions")
|
|
|
|
| 41 |
description="Optional filter: easy / medium / hard. Used by the trainer for curriculum scheduling.",
|
| 42 |
),
|
| 43 |
):
|
| 44 |
+
# NOTE -- backwards compatibility: this endpoint deliberately keeps the
|
| 45 |
+
# exact v0.3 shape (just the 9 builtin functions, with the original
|
| 46 |
+
# field set), because the in-flight trainer queries it. The new "source"
|
| 47 |
+
# field is additive. Open-ended / Hub tasks are exposed via /tasks.
|
| 48 |
items = []
|
| 49 |
for s in BLACK_BOX_FUNCTIONS.values():
|
| 50 |
if difficulty is not None and getattr(s, "difficulty", None) != difficulty:
|
|
|
|
| 56 |
"description": s.description,
|
| 57 |
"difficulty": getattr(s, "difficulty", None),
|
| 58 |
"edge_case_count": len(getattr(s, "edge_cases", []) or []),
|
| 59 |
+
"source": "builtin",
|
| 60 |
}
|
| 61 |
)
|
| 62 |
return {"functions": items}
|
| 63 |
|
| 64 |
|
| 65 |
+
@app.get("/tasks")
|
| 66 |
+
def list_tasks(
|
| 67 |
+
source: str = Query(
|
| 68 |
+
"all",
|
| 69 |
+
description="Filter by source: 'builtin', 'hub', or 'all' (default).",
|
| 70 |
+
),
|
| 71 |
+
difficulty: Optional[str] = Query(None, description="Optional curriculum filter."),
|
| 72 |
+
):
|
| 73 |
+
src = source.lower()
|
| 74 |
+
if src == "builtin":
|
| 75 |
+
tasks = env.catalog.list_builtin()
|
| 76 |
+
elif src == "hub":
|
| 77 |
+
tasks = env.catalog.list_hub()
|
| 78 |
+
elif src == "all":
|
| 79 |
+
tasks = env.catalog.list_all()
|
| 80 |
+
else:
|
| 81 |
+
raise HTTPException(
|
| 82 |
+
status_code=400, detail="source must be one of: builtin, hub, all"
|
| 83 |
+
)
|
| 84 |
+
if difficulty is not None:
|
| 85 |
+
tasks = [t for t in tasks if t.get("difficulty") == difficulty]
|
| 86 |
+
return {
|
| 87 |
+
"tasks": tasks,
|
| 88 |
+
"count": len(tasks),
|
| 89 |
+
"hub": env.catalog.hub_status(),
|
| 90 |
+
}
|
| 91 |
+
|
| 92 |
+
|
| 93 |
@app.post("/reset")
|
| 94 |
def reset(req: ResetRequest):
|
| 95 |
+
# Validation: legacy callers pass only target_name; open-ended callers
|
| 96 |
+
# pass target_code + target_function_name. At least one of those paths
|
| 97 |
+
# must be populated.
|
| 98 |
+
if not req.target_name and not req.target_code:
|
| 99 |
+
raise HTTPException(
|
| 100 |
+
status_code=400,
|
| 101 |
+
detail="Either 'target_name' or ('target_code' + 'target_function_name') must be set.",
|
| 102 |
+
)
|
| 103 |
+
if req.target_code and not req.target_function_name:
|
| 104 |
+
raise HTTPException(
|
| 105 |
+
status_code=400,
|
| 106 |
+
detail="'target_function_name' is required when 'target_code' is provided.",
|
| 107 |
+
)
|
| 108 |
try:
|
| 109 |
+
obs = env.reset(
|
| 110 |
+
target_name=req.target_name,
|
| 111 |
+
seed=req.seed,
|
| 112 |
+
max_steps=req.max_steps,
|
| 113 |
+
target_code=req.target_code,
|
| 114 |
+
target_function_name=req.target_function_name,
|
| 115 |
+
edge_cases=req.edge_cases,
|
| 116 |
+
fuzz_spec=req.fuzz_spec,
|
| 117 |
+
)
|
| 118 |
except ValueError as e:
|
| 119 |
raise HTTPException(status_code=400, detail=str(e)) from e
|
| 120 |
return obs
|
|
|
|
| 136 |
return state
|
| 137 |
|
| 138 |
|
|
|
|
|
|
|
| 139 |
@app.post("/probe_once")
|
| 140 |
def probe_once(target_name: str, input_repr: str):
|
| 141 |
obs = env.reset(target_name=target_name)
|