anugrah55 commited on
Commit
dcc4ca0
·
verified ·
1 Parent(s): 0d49efb

Initial demo: live agent rollouts against OpenSleuth env

Browse files
README.md CHANGED
@@ -1,12 +1,78 @@
1
  ---
2
- title: Opensleuth Demo
3
- emoji: 📉
4
- colorFrom: gray
5
  colorTo: purple
6
  sdk: gradio
7
- sdk_version: 6.13.0
8
  app_file: app.py
9
  pinned: false
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: OpenSleuth — Live Agent Demo
3
+ emoji: "\U0001F575"
4
+ colorFrom: indigo
5
  colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
  pinned: false
10
+ license: apache-2.0
11
+ suggested_hardware: cpu-basic
12
+ suggested_storage: small
13
+ short_description: "Watch an LLM reverse-engineer a hidden Python fn live"
14
  ---
15
 
16
+ # OpenSleuth live agent demo
17
+
18
+ Pick a hidden black-box Python function from the OpenSleuth catalog (15
19
+ tasks: easy → hard, mix of builtin and Hub-pushed). Pick an agent backend
20
+ (`oracle`, `base Qwen 0.5B`, `trained Qwen 0.5B (LoRA)`, `trained Qwen 3B
21
+ (LoRA)`). Watch the agent:
22
+
23
+ 1. **Probe** the env (6 inputs drawn from the same auto-fuzzer the verifier
24
+ uses), one at a time, with each `(input → output)` pair streamed live.
25
+ 2. **Submit** a Python replica of the hidden function.
26
+ 3. **Get verified** by the env's domain-aware fuzzer: 100 random inputs +
27
+ the spec's must-pass edge cases, with stratified pass-rates and a
28
+ reward breakdown (execution / edge / complexity / hack penalties /
29
+ perfect bonus).
30
+
31
+ The submitted code is shown syntax-highlighted, and an optional accordion
32
+ runs a quick `oracle` vs `trained-0.5b` head-to-head reward comparison on
33
+ the selected task.
34
+
35
+ ## Backends
36
+
37
+ | Backend | Source | Notes |
38
+ | --- | --- | --- |
39
+ | `oracle` | `oracle.py` reference impl | Always +100; sanity-checks the env. |
40
+ | `base Qwen 0.5B` | `Qwen/Qwen2.5-0.5B-Instruct` | No fine-tuning. |
41
+ | `trained Qwen 0.5B (LoRA)` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` | GRPO LoRA on top of base 0.5B. |
42
+ | `trained Qwen 3B (LoRA)` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` | 3B GRPO run; falls back to "adapter not yet trained" if the repo has no weights yet. |
43
+
44
+ ## Architecture
45
+
46
+ ```
47
+ [demo Space] ──HTTP──> [env Space]
48
+ │ /tasks, /tasks/{name}/sample_inputs,
49
+ │ /reset, /step (probe + submit)
50
+
51
+ └─ HF model load (lazy, cached): base + optional LoRA on CPU
52
+ ```
53
+
54
+ - The env Space is `anugrah55/opensleuth-env-gemini-cli`.
55
+ - The task catalog is `anugrah55/opensleuth-tasks`.
56
+
57
+ ## CPU-basic notes
58
+
59
+ The demo runs on CPU-basic. First generation per backend cold-loads the
60
+ model (~30–90s for 0.5B). To keep latency bounded:
61
+
62
+ - `MAX_NEW_TOKENS=192`
63
+ - Models are cached across runs (in-process LRU).
64
+ - The 3B backend will only attempt a real load if the adapter repo has
65
+ weights pushed; otherwise it short-circuits to a clear UI message.
66
+
67
+ Configure with env vars:
68
+
69
+ | Env var | Default |
70
+ | --- | --- |
71
+ | `OPENSLEUTH_ENV_URL` | `https://anugrah55-opensleuth-env-gemini-cli.hf.space` |
72
+ | `BASE_MODEL_ID` | `Qwen/Qwen2.5-0.5B-Instruct` |
73
+ | `BASE_MODEL_3B_ID` | `Qwen/Qwen2.5-3B-Instruct` |
74
+ | `ADAPTER_05B_ID` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` |
75
+ | `ADAPTER_3B_ID` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` |
76
+ | `MAX_NEW_TOKENS` | `192` |
77
+ | `N_PROBES` | `6` |
78
+ | `HF_TOKEN` | (optional, set as Space secret for gated models) |
__pycache__/app.cpython-313.pyc ADDED
Binary file (33.6 kB). View file
 
__pycache__/oracle.cpython-313.pyc ADDED
Binary file (6.88 kB). View file
 
app.py ADDED
@@ -0,0 +1,700 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """OpenSleuth live demo Space.
2
+
3
+ A clickable Gradio app that lets a viewer watch the OpenSleuth agent solve
4
+ one of the 15 catalog tasks live: pick a black-box function, pick an agent
5
+ backend, watch the agent probe the env, submit a Python replica, and see
6
+ the verifier reward streamed back in real time.
7
+
8
+ Backends:
9
+ * "oracle" — submit the canonical reference implementation
10
+ * "base Qwen 0.5B" — Qwen/Qwen2.5-0.5B-Instruct, no fine-tuning
11
+ * "trained Qwen 0.5B" — base + GRPO LoRA from anugrah55/opensleuth-qwen2.5-0.5b-grpo
12
+ * "trained Qwen 3B" — base + GRPO LoRA from anugrah55/opensleuth-qwen2.5-3b-grpo-v2
13
+ (gracefully degraded if adapter repo is empty)
14
+
15
+ Networks: hits the live env Space at https://anugrah55-opensleuth-env-gemini-cli.hf.space
16
+ for /tasks, /reset, /step (probe + submit), /tasks/{name}/sample_inputs.
17
+
18
+ CPU-basic friendly: model loads are lazy, generations are capped at 192
19
+ new tokens, and we fall back gracefully if a model/adapter is unavailable.
20
+ """
21
+
22
+ from __future__ import annotations
23
+
24
+ import logging
25
+ import os
26
+ import re
27
+ import threading
28
+ import time
29
+ import traceback
30
+ from dataclasses import dataclass
31
+ from typing import Any, Dict, Generator, List, Optional, Tuple
32
+
33
+ import gradio as gr
34
+ import requests
35
+ from huggingface_hub import HfApi
36
+
37
+ from oracle import ORACLE_SOLUTIONS, get_oracle_code
38
+
39
+ # ---------------------------------------------------------------------------
40
+ # Config
41
+ # ---------------------------------------------------------------------------
42
+
43
+ ENV_URL = os.environ.get(
44
+ "OPENSLEUTH_ENV_URL",
45
+ "https://anugrah55-opensleuth-env-gemini-cli.hf.space",
46
+ ).rstrip("/")
47
+
48
+ BASE_MODEL_ID = os.environ.get("BASE_MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct")
49
+ ADAPTER_05B_ID = os.environ.get(
50
+ "ADAPTER_05B_ID", "anugrah55/opensleuth-qwen2.5-0.5b-grpo"
51
+ )
52
+ ADAPTER_3B_ID = os.environ.get(
53
+ "ADAPTER_3B_ID", "anugrah55/opensleuth-qwen2.5-3b-grpo-v2"
54
+ )
55
+ BASE_MODEL_3B_ID = os.environ.get("BASE_MODEL_3B_ID", "Qwen/Qwen2.5-3B-Instruct")
56
+
57
+ MAX_NEW_TOKENS = int(os.environ.get("MAX_NEW_TOKENS", "192"))
58
+ N_PROBES = int(os.environ.get("N_PROBES", "6"))
59
+ HF_TOKEN = os.environ.get("HF_TOKEN")
60
+
61
+ GITHUB_URL = "https://github.com/"
62
+ HUB_DATASET_URL = "https://huggingface.co/datasets/anugrah55/opensleuth-tasks"
63
+ ENV_SPACE_URL = "https://huggingface.co/spaces/anugrah55/opensleuth-env-gemini-cli"
64
+
65
+ SYSTEM_PROMPT = (
66
+ "You are an algorithmic detective. You are given the public signature of a hidden "
67
+ "Python function plus several (input, output) examples observed by probing it. "
68
+ "Your job is to write a Python function that *exactly* reproduces the hidden "
69
+ "function's behavior on all valid inputs. Match its return values AND its "
70
+ "exception types on invalid inputs. Keep your implementation as simple and clean "
71
+ "as possible (it is penalised for being needlessly branchy). Return ONLY the "
72
+ "function definition wrapped in a single ```python ... ``` code block."
73
+ )
74
+
75
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
76
+ log = logging.getLogger("opensleuth.demo")
77
+
78
+
79
+ # ---------------------------------------------------------------------------
80
+ # Env client (thin)
81
+ # ---------------------------------------------------------------------------
82
+
83
+
84
+ class EnvClient:
85
+ def __init__(self, base_url: str, timeout: float = 30.0) -> None:
86
+ self.base_url = base_url.rstrip("/")
87
+ self.timeout = timeout
88
+
89
+ def _get(self, path: str, **params) -> Dict[str, Any]:
90
+ r = requests.get(f"{self.base_url}{path}", params=params or None, timeout=self.timeout)
91
+ r.raise_for_status()
92
+ return r.json()
93
+
94
+ def _post(self, path: str, payload: Dict[str, Any]) -> Dict[str, Any]:
95
+ r = requests.post(f"{self.base_url}{path}", json=payload, timeout=self.timeout)
96
+ r.raise_for_status()
97
+ return r.json()
98
+
99
+ def list_tasks(self) -> List[Dict[str, Any]]:
100
+ return self._get("/tasks")["tasks"]
101
+
102
+ def sample_inputs(self, name: str, n: int = 6, seed: int = 0) -> List[str]:
103
+ return list(self._get(f"/tasks/{name}/sample_inputs", n=n, seed=seed)["inputs"])
104
+
105
+ def reset(self, target_name: str, seed: int = 0, max_steps: int = 25) -> Dict[str, Any]:
106
+ return self._post(
107
+ "/reset",
108
+ {"target_name": target_name, "seed": seed, "max_steps": max_steps},
109
+ )
110
+
111
+ def probe(self, episode_id: str, input_repr: str) -> Dict[str, Any]:
112
+ return self._post(
113
+ "/step",
114
+ {
115
+ "episode_id": episode_id,
116
+ "action": {"action_type": "probe", "input_repr": input_repr},
117
+ },
118
+ )
119
+
120
+ def submit(self, episode_id: str, code: str) -> Dict[str, Any]:
121
+ return self._post(
122
+ "/step",
123
+ {
124
+ "episode_id": episode_id,
125
+ "action": {"action_type": "submit", "code": code},
126
+ },
127
+ )
128
+
129
+
130
+ CLIENT = EnvClient(ENV_URL)
131
+
132
+
133
+ def fetch_tasks() -> List[Dict[str, Any]]:
134
+ """Pull the live task catalog. Falls back to a hardcoded list if env is
135
+ unreachable so the dropdown always has something to show."""
136
+ try:
137
+ return CLIENT.list_tasks()
138
+ except Exception as e: # noqa: BLE001
139
+ log.warning("could not fetch /tasks from env (%s); using static fallback", e)
140
+ return [{"name": n, "signature": "", "description": "", "difficulty": "?",
141
+ "edge_case_count": 0, "source": "fallback"}
142
+ for n in sorted(ORACLE_SOLUTIONS)]
143
+
144
+
145
+ # ---------------------------------------------------------------------------
146
+ # Prompt + code extraction (lifted from training/opensleuth_train/prompt.py)
147
+ # ---------------------------------------------------------------------------
148
+
149
+
150
+ _CODE_RE = re.compile(r"```(?:python)?\s*(.*?)```", re.DOTALL | re.IGNORECASE)
151
+
152
+
153
+ def build_prompt(target_name: str, signature: str, probes: List[Tuple[str, str, bool]]) -> str:
154
+ lines = [
155
+ f"## Hidden function: {target_name}",
156
+ "",
157
+ "### Public signature & docstring",
158
+ signature.strip() or "(no signature provided)",
159
+ "",
160
+ "### Observed probes",
161
+ ]
162
+ if not probes:
163
+ lines.append("(none)")
164
+ else:
165
+ for inp, out, is_err in probes:
166
+ tag = "raises" if is_err else "returns"
167
+ lines.append(f"- input={inp} -> {tag} {out}")
168
+ lines += [
169
+ "",
170
+ "### Task",
171
+ f"Write a Python function named `{target_name}` that reproduces the hidden "
172
+ "function's behaviour. Return ONLY the function definition in a single "
173
+ "```python ... ``` code block. Do not add explanations.",
174
+ ]
175
+ return "\n".join(lines)
176
+
177
+
178
+ def extract_code(completion: str) -> str:
179
+ m = _CODE_RE.search(completion)
180
+ if m:
181
+ return m.group(1).strip()
182
+ return completion.strip()
183
+
184
+
185
+ # ---------------------------------------------------------------------------
186
+ # Backend registry
187
+ # ---------------------------------------------------------------------------
188
+
189
+
190
+ @dataclass
191
+ class BackendInfo:
192
+ key: str
193
+ label: str
194
+ kind: str # "oracle" | "hf" (transformers + peft)
195
+ base_model: Optional[str] = None
196
+ adapter: Optional[str] = None
197
+
198
+
199
+ BACKENDS: Dict[str, BackendInfo] = {
200
+ "oracle": BackendInfo(
201
+ key="oracle",
202
+ label="oracle (reference impl)",
203
+ kind="oracle",
204
+ ),
205
+ "base-0.5b": BackendInfo(
206
+ key="base-0.5b",
207
+ label="base Qwen 0.5B (no fine-tune)",
208
+ kind="hf",
209
+ base_model=BASE_MODEL_ID,
210
+ adapter=None,
211
+ ),
212
+ "trained-0.5b": BackendInfo(
213
+ key="trained-0.5b",
214
+ label="trained Qwen 0.5B (GRPO LoRA)",
215
+ kind="hf",
216
+ base_model=BASE_MODEL_ID,
217
+ adapter=ADAPTER_05B_ID,
218
+ ),
219
+ "trained-3b": BackendInfo(
220
+ key="trained-3b",
221
+ label="trained Qwen 3B (GRPO LoRA)",
222
+ kind="hf",
223
+ base_model=BASE_MODEL_3B_ID,
224
+ adapter=ADAPTER_3B_ID,
225
+ ),
226
+ }
227
+
228
+ BACKEND_CHOICES = [(b.label, b.key) for b in BACKENDS.values()]
229
+
230
+
231
+ def _adapter_has_weights(repo_id: str) -> bool:
232
+ """Hub probe: True iff the adapter repo actually contains adapter
233
+ weights. We treat repos with only `.gitattributes` (still training,
234
+ pre-push) as 'not yet trained'."""
235
+ try:
236
+ api = HfApi(token=HF_TOKEN)
237
+ files = api.list_repo_files(repo_id)
238
+ except Exception as e: # noqa: BLE001
239
+ log.warning("adapter availability probe failed for %s: %s", repo_id, e)
240
+ return False
241
+ return any(f.endswith("adapter_model.safetensors") or f.endswith("adapter_model.bin") for f in files)
242
+
243
+
244
+ # ---------------------------------------------------------------------------
245
+ # Lazy HF model cache
246
+ # ---------------------------------------------------------------------------
247
+
248
+
249
+ _MODEL_LOCK = threading.Lock()
250
+ _LOADED: Dict[str, Tuple[Any, Any]] = {} # cache_key -> (tokenizer, model)
251
+
252
+
253
+ def _model_cache_key(base: str, adapter: Optional[str]) -> str:
254
+ return f"{base}::{adapter or '_base_'}"
255
+
256
+
257
+ def _load_hf(base: str, adapter: Optional[str]) -> Tuple[Any, Any]:
258
+ """Load the (base, optional LoRA) on CPU. Cached across calls."""
259
+ key = _model_cache_key(base, adapter)
260
+ with _MODEL_LOCK:
261
+ if key in _LOADED:
262
+ return _LOADED[key]
263
+ log.info("loading HF model base=%s adapter=%s", base, adapter)
264
+ import torch # noqa: WPS433
265
+ from transformers import AutoModelForCausalLM, AutoTokenizer # noqa: WPS433
266
+
267
+ tok = AutoTokenizer.from_pretrained(base, trust_remote_code=True, token=HF_TOKEN)
268
+ model = AutoModelForCausalLM.from_pretrained(
269
+ base,
270
+ torch_dtype=torch.float32,
271
+ device_map={"": "cpu"},
272
+ trust_remote_code=True,
273
+ low_cpu_mem_usage=True,
274
+ token=HF_TOKEN,
275
+ )
276
+ if adapter:
277
+ from peft import PeftModel # noqa: WPS433
278
+
279
+ model = PeftModel.from_pretrained(model, adapter, token=HF_TOKEN)
280
+ model.eval()
281
+ _LOADED[key] = (tok, model)
282
+ log.info("loaded %s in %d cached models", key, len(_LOADED))
283
+ return tok, model
284
+
285
+
286
+ def _generate_hf(base: str, adapter: Optional[str], prompt: str) -> str:
287
+ tok, model = _load_hf(base, adapter)
288
+ import torch # noqa: WPS433
289
+
290
+ messages = [
291
+ {"role": "system", "content": SYSTEM_PROMPT},
292
+ {"role": "user", "content": prompt},
293
+ ]
294
+ text = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
295
+ inputs = tok(text, return_tensors="pt")
296
+ with torch.no_grad():
297
+ out = model.generate(
298
+ **inputs,
299
+ max_new_tokens=MAX_NEW_TOKENS,
300
+ do_sample=False,
301
+ temperature=1.0,
302
+ pad_token_id=tok.eos_token_id,
303
+ )
304
+ return tok.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
305
+
306
+
307
+ # ---------------------------------------------------------------------------
308
+ # Reward table formatter
309
+ # ---------------------------------------------------------------------------
310
+
311
+
312
+ def _empty_reward_table() -> List[List[Any]]:
313
+ return [
314
+ ["execution_reward", "—"],
315
+ ["edge_pass_rate", "—"],
316
+ ["complexity_penalty", "—"],
317
+ ["reward_hack_penalty", "—"],
318
+ ["floor_penalty", "—"],
319
+ ["perfect_bonus", "—"],
320
+ ["TOTAL reward", "—"],
321
+ ]
322
+
323
+
324
+ def _reward_table_from_info(info: Dict[str, Any], total: float) -> List[List[Any]]:
325
+ def _fmt(x):
326
+ if x is None:
327
+ return "—"
328
+ if isinstance(x, float):
329
+ return f"{x:+.2f}"
330
+ return str(x)
331
+
332
+ edge = info.get("edge_pass_rate")
333
+ edge_str = f"{edge:.0%}" if isinstance(edge, (int, float)) else "—"
334
+ return [
335
+ ["execution_reward", _fmt(info.get("execution_reward"))],
336
+ ["edge_pass_rate", edge_str],
337
+ ["complexity_penalty", _fmt(-(info.get("complexity_penalty") or 0.0))],
338
+ ["reward_hack_penalty", _fmt(-(info.get("reward_hack_penalty") or 0.0))],
339
+ ["floor_penalty", _fmt(-(info.get("floor_penalty") or 0.0))],
340
+ ["perfect_bonus", _fmt(info.get("perfect_bonus"))],
341
+ ["TOTAL reward", _fmt(total)],
342
+ ]
343
+
344
+
345
+ # ---------------------------------------------------------------------------
346
+ # Streaming runner
347
+ # ---------------------------------------------------------------------------
348
+
349
+
350
+ def _format_log(lines: List[str]) -> str:
351
+ return "\n".join(lines)
352
+
353
+
354
+ def run_agent(
355
+ task_name: str,
356
+ backend_key: str,
357
+ seed: int = 0,
358
+ ) -> Generator[Tuple[str, str, List[List[Any]], str], None, None]:
359
+ """Run one agent rollout end-to-end and stream UI updates.
360
+
361
+ Yields tuples of (log_text, code_markdown, reward_table, status).
362
+ """
363
+ backend = BACKENDS.get(backend_key)
364
+ if backend is None:
365
+ yield ("Unknown backend.", "", _empty_reward_table(), "error")
366
+ return
367
+ if not task_name:
368
+ yield ("Pick a task first.", "", _empty_reward_table(), "error")
369
+ return
370
+
371
+ log_lines: List[str] = []
372
+ code_md = ""
373
+ table = _empty_reward_table()
374
+
375
+ def push(line: str = "", *, status: str = "running") -> Tuple[str, str, List[List[Any]], str]:
376
+ if line:
377
+ log_lines.append(line)
378
+ return _format_log(log_lines), code_md, table, status
379
+
380
+ yield push(f"task={task_name} backend={backend.label} seed={seed}")
381
+ yield push(f"env={ENV_URL}")
382
+
383
+ # 1. Reset env
384
+ try:
385
+ ep = CLIENT.reset(task_name, seed=seed, max_steps=N_PROBES + 5)
386
+ except Exception as e: # noqa: BLE001
387
+ yield push(f"[error] /reset failed: {e}", status="error")
388
+ return
389
+ eid = ep["episode_id"]
390
+ sig = ep.get("target_function_signature", "")
391
+ yield push(f"\n=== reset ===\nepisode_id={eid}")
392
+ yield push(f"signature: {sig.splitlines()[0] if sig else '(none)'}")
393
+
394
+ # 2. Sample probe inputs from env's own auto-fuzzer
395
+ try:
396
+ inputs = CLIENT.sample_inputs(task_name, n=N_PROBES, seed=seed)
397
+ except Exception as e: # noqa: BLE001
398
+ yield push(f"[warn] sample_inputs failed: {e}; falling back to ['1']*N", status="running")
399
+ inputs = ["1"] * N_PROBES
400
+
401
+ # 3. Probe loop
402
+ yield push(f"\n=== probing ({len(inputs)} inputs) ===")
403
+ history: List[Tuple[str, str, bool]] = []
404
+ for i, inp in enumerate(inputs, 1):
405
+ try:
406
+ resp = CLIENT.probe(eid, inp)
407
+ except Exception as e: # noqa: BLE001
408
+ yield push(f" probe {i}/{len(inputs)} input={inp} [error] {e}")
409
+ continue
410
+ last = resp["observation"]["probe_history"][-1]
411
+ out = last["output_repr"]
412
+ is_err = bool(last["is_error"])
413
+ history.append((last["input_repr"], out, is_err))
414
+ tag = "raises" if is_err else "->"
415
+ yield push(f" probe {i}/{len(inputs)} input={inp} {tag} {out}")
416
+ time.sleep(0.05) # tiny delay so the UI feels live, not spammed
417
+
418
+ # 4. Build prompt + generate code
419
+ prompt = build_prompt(task_name, sig, history)
420
+ yield push(f"\n=== generating code ({backend.label}) ===")
421
+
422
+ if backend.kind == "oracle":
423
+ completion = "```python\n" + get_oracle_code(task_name) + "```"
424
+ code = extract_code(completion)
425
+ yield push("oracle: pulled canonical reference implementation.")
426
+ elif backend.kind == "hf":
427
+ if backend.adapter and not _adapter_has_weights(backend.adapter):
428
+ yield push(
429
+ f"[info] adapter {backend.adapter!r} has no weights yet "
430
+ f"(repo only contains .gitattributes); falling back to base model output.",
431
+ )
432
+ backend = BackendInfo(
433
+ key=backend.key, label=f"{backend.label} → base fallback",
434
+ kind="hf", base_model=backend.base_model, adapter=None,
435
+ )
436
+ try:
437
+ yield push(
438
+ f"loading {backend.base_model} on CPU"
439
+ + (f" + LoRA {backend.adapter}" if backend.adapter else "")
440
+ + " ... (cold-start may take 30-90s the first time)"
441
+ )
442
+ t0 = time.time()
443
+ completion = _generate_hf(backend.base_model, backend.adapter, prompt)
444
+ yield push(f"generated in {time.time() - t0:.1f}s ({MAX_NEW_TOKENS} max new tokens)")
445
+ except Exception as e: # noqa: BLE001
446
+ tb = traceback.format_exc(limit=2)
447
+ yield push(f"[error] generation failed: {type(e).__name__}: {e}\n{tb}", status="error")
448
+ return
449
+ code = extract_code(completion)
450
+ else:
451
+ yield push(f"[error] unknown backend kind: {backend.kind}", status="error")
452
+ return
453
+
454
+ if not code.strip():
455
+ yield push("[warn] model emitted empty completion; submitting empty stub.")
456
+ code = f"def {task_name}(*args, **kwargs):\n pass\n"
457
+
458
+ code_md = f"```python\n{code}\n```"
459
+ yield push("\n=== submitting code to /step ===")
460
+
461
+ # 5. Submit + verifier breakdown
462
+ try:
463
+ sub_resp = CLIENT.submit(eid, code)
464
+ except Exception as e: # noqa: BLE001
465
+ yield push(f"[error] /submit failed: {e}", status="error")
466
+ return
467
+ info = sub_resp.get("info", {}) or {}
468
+ total = float(sub_resp.get("reward", 0.0))
469
+ table = _reward_table_from_info(info, total)
470
+
471
+ yield push(f"verifier: matches {info.get('matches', 0)}/{info.get('fuzz_count', 0)}")
472
+ if info.get("define_error"):
473
+ yield push(f" define_error: {info['define_error']}")
474
+ by_cat = info.get("matches_by_category") or {}
475
+ counts = info.get("counts_by_category") or {}
476
+ for cat in ("edge", "random"):
477
+ m = by_cat.get(cat)
478
+ c = counts.get(cat)
479
+ if m is not None and c is not None:
480
+ yield push(f" {cat:>6}: {m}/{c}")
481
+ yield push(
482
+ f"\nreward breakdown:"
483
+ f" exec={info.get('execution_reward', 0):.2f}"
484
+ f" -complexity={info.get('complexity_penalty', 0):.2f}"
485
+ f" -hack={info.get('reward_hack_penalty', 0):.2f}"
486
+ f" -floor={info.get('floor_penalty', 0):.2f}"
487
+ f" +perfect={info.get('perfect_bonus', 0):.2f}"
488
+ )
489
+ final_status = "done"
490
+ if info.get("execution_reward", 0) >= 99.999:
491
+ yield push(f"\n*** TOTAL REWARD = {total:+.2f} (PERFECT) ***", status=final_status)
492
+ else:
493
+ yield push(f"\nTOTAL REWARD = {total:+.2f}", status=final_status)
494
+
495
+
496
+ # ---------------------------------------------------------------------------
497
+ # UI helpers
498
+ # ---------------------------------------------------------------------------
499
+
500
+
501
+ def _task_label(t: Dict[str, Any]) -> str:
502
+ diff = t.get("difficulty") or "?"
503
+ src = t.get("source", "?")
504
+ sig = t.get("signature") or t["name"]
505
+ return f"[{diff}/{src}] {sig}"
506
+
507
+
508
+ def build_task_choices() -> List[Tuple[str, str]]:
509
+ tasks = fetch_tasks()
510
+ tasks_sorted = sorted(
511
+ tasks,
512
+ key=lambda t: (
513
+ {"easy": 0, "medium": 1, "hard": 2}.get(t.get("difficulty") or "", 9),
514
+ t["name"],
515
+ ),
516
+ )
517
+ return [(_task_label(t), t["name"]) for t in tasks_sorted]
518
+
519
+
520
+ # ---------------------------------------------------------------------------
521
+ # Comparison: oracle vs trained adapter on a single task
522
+ # ---------------------------------------------------------------------------
523
+
524
+
525
+ def quick_compare(task_name: str, seed: int = 0) -> str:
526
+ """Side-by-side: oracle reward vs trained-0.5b reward on the same task.
527
+
528
+ Used by the 'baseline-vs-trained' panel. Runs *non-streaming* and just
529
+ returns a Markdown summary (we already have streaming for the main
530
+ panel). Falls back gracefully if either backend fails.
531
+ """
532
+ out_lines = [f"### Reward comparison on `{task_name}` (seed={seed})", ""]
533
+ rows: List[Tuple[str, str]] = []
534
+ for key in ("oracle", "trained-0.5b"):
535
+ backend = BACKENDS[key]
536
+ try:
537
+ ep = CLIENT.reset(task_name, seed=seed, max_steps=2)
538
+ except Exception as e: # noqa: BLE001
539
+ rows.append((backend.label, f"reset failed: {e}"))
540
+ continue
541
+ if backend.kind == "oracle":
542
+ code = get_oracle_code(task_name)
543
+ else:
544
+ if backend.adapter and not _adapter_has_weights(backend.adapter):
545
+ rows.append((backend.label, "adapter not yet trained"))
546
+ continue
547
+ try:
548
+ inputs = CLIENT.sample_inputs(task_name, n=N_PROBES, seed=seed)
549
+ history = []
550
+ for inp in inputs:
551
+ try:
552
+ r = CLIENT.probe(ep["episode_id"], inp)
553
+ last = r["observation"]["probe_history"][-1]
554
+ history.append((last["input_repr"], last["output_repr"], bool(last["is_error"])))
555
+ except Exception: # noqa: BLE001
556
+ pass
557
+ prompt = build_prompt(task_name, ep.get("target_function_signature", ""), history)
558
+ completion = _generate_hf(backend.base_model, backend.adapter, prompt)
559
+ code = extract_code(completion) or f"def {task_name}(*a, **k): pass"
560
+ except Exception as e: # noqa: BLE001
561
+ rows.append((backend.label, f"generation failed: {e}"))
562
+ continue
563
+ try:
564
+ sub = CLIENT.submit(ep["episode_id"], code)
565
+ total = float(sub.get("reward", 0.0))
566
+ info = sub.get("info", {}) or {}
567
+ rows.append(
568
+ (
569
+ backend.label,
570
+ f"reward={total:+.2f} exec={info.get('execution_reward', 0):.0f}/100"
571
+ f" matches={info.get('matches', 0)}/{info.get('fuzz_count', 0)}",
572
+ )
573
+ )
574
+ except Exception as e: # noqa: BLE001
575
+ rows.append((backend.label, f"submit failed: {e}"))
576
+ out_lines.append("| backend | result |")
577
+ out_lines.append("| --- | --- |")
578
+ for label, r in rows:
579
+ out_lines.append(f"| {label} | {r} |")
580
+ return "\n".join(out_lines)
581
+
582
+
583
+ # ---------------------------------------------------------------------------
584
+ # UI
585
+ # ---------------------------------------------------------------------------
586
+
587
+ INTRO_MARKDOWN = """
588
+ # OpenSleuth — live agent demo
589
+
590
+ **The Algorithmic Detective:** an LLM agent reverse-engineers an unknown
591
+ black-box Python function by *probing* it with inputs and then *submitting*
592
+ a Python replica. The env scores the submission by domain-aware fuzzing
593
+ against the hidden reference, with edge-case stratification, a complexity
594
+ penalty, and anti-reward-hacking signals.
595
+
596
+ Pick a task, pick an agent, hit **Run agent**.
597
+ """.strip()
598
+
599
+ FOOTER_MARKDOWN = f"""
600
+ ---
601
+
602
+ **Links** ·
603
+ [env Space]({ENV_SPACE_URL}) ·
604
+ [task dataset]({HUB_DATASET_URL}) ·
605
+ [GitHub]({GITHUB_URL})
606
+
607
+ **Backends:** `oracle` is the known-correct reference impl (always +100).
608
+ `base Qwen 0.5B` is `Qwen/Qwen2.5-0.5B-Instruct` with no fine-tuning.
609
+ `trained Qwen 0.5B` is the GRPO LoRA at `{ADAPTER_05B_ID}`.
610
+ `trained Qwen 3B` is the GRPO LoRA at `{ADAPTER_3B_ID}` (gracefully
611
+ falls back to "adapter not yet trained" if the repo has no weights).
612
+
613
+ Models run on CPU-basic, so first generation per backend includes a cold-load
614
+ (~30–90s for 0.5B). Generations are capped at {MAX_NEW_TOKENS} new tokens.
615
+ """.strip()
616
+
617
+
618
+ def build_ui() -> gr.Blocks:
619
+ with gr.Blocks(title="OpenSleuth — live agent demo", theme=gr.themes.Soft()) as demo:
620
+ gr.Markdown(INTRO_MARKDOWN)
621
+
622
+ # populated lazily so the Space can boot even if the env is mid-deploy
623
+ task_choices = gr.State(value=[])
624
+
625
+ with gr.Row():
626
+ task_dd = gr.Dropdown(
627
+ label="Task (15 black-box functions, easy → hard)",
628
+ choices=[],
629
+ value=None,
630
+ interactive=True,
631
+ )
632
+ backend_dd = gr.Dropdown(
633
+ label="Agent backend",
634
+ choices=BACKEND_CHOICES,
635
+ value="oracle",
636
+ interactive=True,
637
+ )
638
+ seed_in = gr.Number(label="Seed", value=0, precision=0, scale=0, minimum=0)
639
+ run_btn = gr.Button("Run agent", variant="primary", scale=0)
640
+
641
+ with gr.Row():
642
+ log_box = gr.Textbox(
643
+ label="Live agent log",
644
+ value="(idle — pick a task and a backend, then hit Run agent)",
645
+ lines=22,
646
+ max_lines=40,
647
+ interactive=False,
648
+ show_copy_button=True,
649
+ )
650
+
651
+ with gr.Row():
652
+ with gr.Column(scale=2):
653
+ code_md = gr.Markdown(label="Submitted code", value="")
654
+ with gr.Column(scale=1):
655
+ reward_tbl = gr.Dataframe(
656
+ headers=["component", "value"],
657
+ value=_empty_reward_table(),
658
+ label="Reward breakdown",
659
+ interactive=False,
660
+ wrap=True,
661
+ )
662
+
663
+ with gr.Accordion("oracle vs trained-0.5b head-to-head", open=False):
664
+ with gr.Row():
665
+ cmp_btn = gr.Button("Run quick comparison", variant="secondary")
666
+ cmp_md = gr.Markdown(value="(no comparison run yet)")
667
+
668
+ gr.Markdown(FOOTER_MARKDOWN)
669
+
670
+ # ---- wiring ------------------------------------------------------
671
+ def _refresh_tasks():
672
+ choices = build_task_choices()
673
+ default = choices[0][1] if choices else None
674
+ return gr.Dropdown(choices=choices, value=default), choices
675
+
676
+ demo.load(_refresh_tasks, outputs=[task_dd, task_choices])
677
+
678
+ run_btn.click(
679
+ fn=run_agent,
680
+ inputs=[task_dd, backend_dd, seed_in],
681
+ outputs=[log_box, code_md, reward_tbl, gr.State()],
682
+ show_progress="minimal",
683
+ )
684
+
685
+ cmp_btn.click(
686
+ fn=quick_compare,
687
+ inputs=[task_dd, seed_in],
688
+ outputs=[cmp_md],
689
+ show_progress="minimal",
690
+ )
691
+
692
+ return demo
693
+
694
+
695
+ if __name__ == "__main__":
696
+ ui = build_ui()
697
+ ui.queue(default_concurrency_limit=2).launch(
698
+ server_name="0.0.0.0",
699
+ server_port=int(os.environ.get("PORT", "7860")),
700
+ )
oracle.py ADDED
@@ -0,0 +1,231 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Per-task reference implementations.
2
+
3
+ These are the *known-correct* solutions for each of the 15 tasks the OpenSleuth
4
+ env exposes. They mirror the rows pushed to ``anugrah55/opensleuth-tasks`` by
5
+ ``env/opensleuth_env/scripts/bootstrap_tasks_dataset.py`` (which itself mirrors
6
+ the in-process oracle in ``env/opensleuth_env/black_box.py``).
7
+
8
+ The "oracle" demo backend just looks up the task name here and submits the
9
+ canonical source. It exists so the viewer can immediately see what a perfect
10
+ score looks like end-to-end (signature → probes → submit → +100 reward).
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ from typing import Dict
16
+
17
+
18
+ ORACLE_SOLUTIONS: Dict[str, str] = {
19
+ # ---- 9 builtins -------------------------------------------------------
20
+ "fibonacci": (
21
+ "def fibonacci(n):\n"
22
+ " if not isinstance(n, int) or isinstance(n, bool) or n <= 0 or n > 90:\n"
23
+ " raise ValueError('Input must be a positive integer <= 90.')\n"
24
+ " a, b = 0, 1\n"
25
+ " for _ in range(n - 1):\n"
26
+ " a, b = b, a + b\n"
27
+ " return b if n > 0 else a\n"
28
+ ),
29
+ "reverse_string": (
30
+ "def reverse_string(s):\n"
31
+ " if not isinstance(s, str):\n"
32
+ " raise TypeError('Input must be a string.')\n"
33
+ " return s[::-1]\n"
34
+ ),
35
+ "is_palindrome": (
36
+ "def is_palindrome(s):\n"
37
+ " if not isinstance(s, str):\n"
38
+ " raise TypeError('Input must be a string.')\n"
39
+ " cleaned = ''.join(ch.lower() for ch in s if ch.isalnum())\n"
40
+ " return cleaned == cleaned[::-1]\n"
41
+ ),
42
+ "digit_sum": (
43
+ "def digit_sum(n):\n"
44
+ " if not isinstance(n, int) or isinstance(n, bool):\n"
45
+ " raise TypeError('Input must be int.')\n"
46
+ " if n < 0:\n"
47
+ " raise ValueError('Input must be non-negative.')\n"
48
+ " return sum(int(c) for c in str(n))\n"
49
+ ),
50
+ "count_vowels": (
51
+ "def count_vowels(s):\n"
52
+ " if not isinstance(s, str):\n"
53
+ " raise TypeError('Input must be a string.')\n"
54
+ " return sum(1 for c in s.lower() if c in 'aeiou')\n"
55
+ ),
56
+ "gcd": (
57
+ "def gcd(pair):\n"
58
+ " if not isinstance(pair, (list, tuple)) or len(pair) != 2:\n"
59
+ " raise TypeError('Input must be a 2-element list or tuple.')\n"
60
+ " a, b = pair\n"
61
+ " if not all(isinstance(x, int) and not isinstance(x, bool) for x in (a, b)):\n"
62
+ " raise TypeError('Both elements must be int.')\n"
63
+ " if a < 0 or b < 0:\n"
64
+ " raise ValueError('Both elements must be non-negative.')\n"
65
+ " while b:\n"
66
+ " a, b = b, a % b\n"
67
+ " return a\n"
68
+ ),
69
+ "sort_unique": (
70
+ "def sort_unique(xs):\n"
71
+ " if not isinstance(xs, list):\n"
72
+ " raise TypeError('Input must be a list.')\n"
73
+ " if not all(isinstance(x, int) and not isinstance(x, bool) for x in xs):\n"
74
+ " raise TypeError('All elements must be int.')\n"
75
+ " return sorted(set(xs))\n"
76
+ ),
77
+ "caesar_cipher": (
78
+ "def caesar_cipher(s):\n"
79
+ " if not isinstance(s, str):\n"
80
+ " raise TypeError('Input must be a string.')\n"
81
+ " out = []\n"
82
+ " for ch in s:\n"
83
+ " if 'a' <= ch <= 'z':\n"
84
+ " out.append(chr((ord(ch) - ord('a') + 3) % 26 + ord('a')))\n"
85
+ " else:\n"
86
+ " out.append(ch)\n"
87
+ " return ''.join(out)\n"
88
+ ),
89
+ "is_prime": (
90
+ "def is_prime(n):\n"
91
+ " if not isinstance(n, int) or isinstance(n, bool):\n"
92
+ " raise TypeError('Input must be int.')\n"
93
+ " if n < 2:\n"
94
+ " return False\n"
95
+ " if n < 4:\n"
96
+ " return True\n"
97
+ " if n % 2 == 0:\n"
98
+ " return False\n"
99
+ " i = 3\n"
100
+ " while i * i <= n:\n"
101
+ " if n % i == 0:\n"
102
+ " return False\n"
103
+ " i += 2\n"
104
+ " return True\n"
105
+ ),
106
+ # ---- 6 hub-pushed tasks -----------------------------------------------
107
+ "roman_to_int": (
108
+ "def roman_to_int(s):\n"
109
+ " if not isinstance(s, str):\n"
110
+ " raise TypeError('input must be str')\n"
111
+ " table = {'I':1,'V':5,'X':10,'L':50,'C':100,'D':500,'M':1000}\n"
112
+ " total = 0\n"
113
+ " prev = 0\n"
114
+ " for ch in reversed(s.upper()):\n"
115
+ " if ch not in table:\n"
116
+ " raise ValueError(f'invalid roman numeral character: {ch!r}')\n"
117
+ " v = table[ch]\n"
118
+ " if v < prev:\n"
119
+ " total -= v\n"
120
+ " else:\n"
121
+ " total += v\n"
122
+ " prev = v\n"
123
+ " return total\n"
124
+ ),
125
+ "levenshtein_distance": (
126
+ "def levenshtein_distance(a, b):\n"
127
+ " if not isinstance(a, str) or not isinstance(b, str):\n"
128
+ " raise TypeError('both arguments must be str')\n"
129
+ " if a == b:\n"
130
+ " return 0\n"
131
+ " if not a:\n"
132
+ " return len(b)\n"
133
+ " if not b:\n"
134
+ " return len(a)\n"
135
+ " prev = list(range(len(b) + 1))\n"
136
+ " for i, ca in enumerate(a, 1):\n"
137
+ " cur = [i] + [0] * len(b)\n"
138
+ " for j, cb in enumerate(b, 1):\n"
139
+ " ins = cur[j-1] + 1\n"
140
+ " dele = prev[j] + 1\n"
141
+ " sub = prev[j-1] + (ca != cb)\n"
142
+ " cur[j] = min(ins, dele, sub)\n"
143
+ " prev = cur\n"
144
+ " return prev[-1]\n"
145
+ ),
146
+ "flatten_list": (
147
+ "def flatten_list(xs):\n"
148
+ " if not isinstance(xs, (list, tuple)):\n"
149
+ " raise TypeError('input must be list or tuple')\n"
150
+ " out = []\n"
151
+ " rev = []\n"
152
+ " rev.extend(reversed(list(xs)))\n"
153
+ " while rev:\n"
154
+ " x = rev.pop()\n"
155
+ " if isinstance(x, (list, tuple)):\n"
156
+ " for y in reversed(x):\n"
157
+ " rev.append(y)\n"
158
+ " else:\n"
159
+ " out.append(x)\n"
160
+ " return out\n"
161
+ ),
162
+ "merge_sorted": (
163
+ "def merge_sorted(a, b):\n"
164
+ " if not isinstance(a, list) or not isinstance(b, list):\n"
165
+ " raise TypeError('both arguments must be list')\n"
166
+ " for x in (*a, *b):\n"
167
+ " if not isinstance(x, int) or isinstance(x, bool):\n"
168
+ " raise TypeError('elements must be int')\n"
169
+ " out = []\n"
170
+ " i = j = 0\n"
171
+ " while i < len(a) and j < len(b):\n"
172
+ " if a[i] <= b[j]:\n"
173
+ " out.append(a[i]); i += 1\n"
174
+ " else:\n"
175
+ " out.append(b[j]); j += 1\n"
176
+ " out.extend(a[i:])\n"
177
+ " out.extend(b[j:])\n"
178
+ " return out\n"
179
+ ),
180
+ "run_length_encode": (
181
+ "def run_length_encode(s):\n"
182
+ " if not isinstance(s, str):\n"
183
+ " raise TypeError('input must be str')\n"
184
+ " if not s:\n"
185
+ " return []\n"
186
+ " out = []\n"
187
+ " cur = s[0]\n"
188
+ " n = 1\n"
189
+ " for ch in s[1:]:\n"
190
+ " if ch == cur:\n"
191
+ " n += 1\n"
192
+ " else:\n"
193
+ " out.append((cur, n))\n"
194
+ " cur = ch\n"
195
+ " n = 1\n"
196
+ " out.append((cur, n))\n"
197
+ " return out\n"
198
+ ),
199
+ "binary_search": (
200
+ "def binary_search(arr, target):\n"
201
+ " if not isinstance(arr, list):\n"
202
+ " raise TypeError('arr must be list')\n"
203
+ " if not isinstance(target, int) or isinstance(target, bool):\n"
204
+ " raise TypeError('target must be int')\n"
205
+ " lo, hi = 0, len(arr) - 1\n"
206
+ " while lo <= hi:\n"
207
+ " mid = (lo + hi) // 2\n"
208
+ " v = arr[mid]\n"
209
+ " if v == target:\n"
210
+ " return mid\n"
211
+ " if v < target:\n"
212
+ " lo = mid + 1\n"
213
+ " else:\n"
214
+ " hi = mid - 1\n"
215
+ " return -1\n"
216
+ ),
217
+ }
218
+
219
+
220
+ def get_oracle_code(task_name: str) -> str:
221
+ """Return the canonical source for ``task_name``, or a stub raising
222
+ NotImplementedError if the task isn't in the oracle catalog."""
223
+ code = ORACLE_SOLUTIONS.get(task_name)
224
+ if code is not None:
225
+ return code
226
+ return (
227
+ f"def {task_name}(*args, **kwargs):\n"
228
+ f" raise NotImplementedError(\n"
229
+ f" 'No oracle reference for {task_name!r}. Try the model backends.'\n"
230
+ f" )\n"
231
+ )
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ gradio>=4.44.0,<5
2
+ requests>=2.31
3
+ huggingface_hub>=0.24
4
+ transformers>=4.45
5
+ peft>=0.13
6
+ accelerate>=0.34
7
+ --extra-index-url https://download.pytorch.org/whl/cpu
8
+ torch==2.4.1