Spaces:
Runtime error
Runtime error
Initial demo: live agent rollouts against OpenSleuth env
Browse files- README.md +71 -5
- __pycache__/app.cpython-313.pyc +0 -0
- __pycache__/oracle.cpython-313.pyc +0 -0
- app.py +700 -0
- oracle.py +231 -0
- requirements.txt +8 -0
README.md
CHANGED
|
@@ -1,12 +1,78 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: OpenSleuth — Live Agent Demo
|
| 3 |
+
emoji: "\U0001F575"
|
| 4 |
+
colorFrom: indigo
|
| 5 |
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.0
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
suggested_hardware: cpu-basic
|
| 12 |
+
suggested_storage: small
|
| 13 |
+
short_description: "Watch an LLM reverse-engineer a hidden Python fn live"
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# OpenSleuth — live agent demo
|
| 17 |
+
|
| 18 |
+
Pick a hidden black-box Python function from the OpenSleuth catalog (15
|
| 19 |
+
tasks: easy → hard, mix of builtin and Hub-pushed). Pick an agent backend
|
| 20 |
+
(`oracle`, `base Qwen 0.5B`, `trained Qwen 0.5B (LoRA)`, `trained Qwen 3B
|
| 21 |
+
(LoRA)`). Watch the agent:
|
| 22 |
+
|
| 23 |
+
1. **Probe** the env (6 inputs drawn from the same auto-fuzzer the verifier
|
| 24 |
+
uses), one at a time, with each `(input → output)` pair streamed live.
|
| 25 |
+
2. **Submit** a Python replica of the hidden function.
|
| 26 |
+
3. **Get verified** by the env's domain-aware fuzzer: 100 random inputs +
|
| 27 |
+
the spec's must-pass edge cases, with stratified pass-rates and a
|
| 28 |
+
reward breakdown (execution / edge / complexity / hack penalties /
|
| 29 |
+
perfect bonus).
|
| 30 |
+
|
| 31 |
+
The submitted code is shown syntax-highlighted, and an optional accordion
|
| 32 |
+
runs a quick `oracle` vs `trained-0.5b` head-to-head reward comparison on
|
| 33 |
+
the selected task.
|
| 34 |
+
|
| 35 |
+
## Backends
|
| 36 |
+
|
| 37 |
+
| Backend | Source | Notes |
|
| 38 |
+
| --- | --- | --- |
|
| 39 |
+
| `oracle` | `oracle.py` reference impl | Always +100; sanity-checks the env. |
|
| 40 |
+
| `base Qwen 0.5B` | `Qwen/Qwen2.5-0.5B-Instruct` | No fine-tuning. |
|
| 41 |
+
| `trained Qwen 0.5B (LoRA)` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` | GRPO LoRA on top of base 0.5B. |
|
| 42 |
+
| `trained Qwen 3B (LoRA)` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` | 3B GRPO run; falls back to "adapter not yet trained" if the repo has no weights yet. |
|
| 43 |
+
|
| 44 |
+
## Architecture
|
| 45 |
+
|
| 46 |
+
```
|
| 47 |
+
[demo Space] ──HTTP──> [env Space]
|
| 48 |
+
│ /tasks, /tasks/{name}/sample_inputs,
|
| 49 |
+
│ /reset, /step (probe + submit)
|
| 50 |
+
│
|
| 51 |
+
└─ HF model load (lazy, cached): base + optional LoRA on CPU
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
- The env Space is `anugrah55/opensleuth-env-gemini-cli`.
|
| 55 |
+
- The task catalog is `anugrah55/opensleuth-tasks`.
|
| 56 |
+
|
| 57 |
+
## CPU-basic notes
|
| 58 |
+
|
| 59 |
+
The demo runs on CPU-basic. First generation per backend cold-loads the
|
| 60 |
+
model (~30–90s for 0.5B). To keep latency bounded:
|
| 61 |
+
|
| 62 |
+
- `MAX_NEW_TOKENS=192`
|
| 63 |
+
- Models are cached across runs (in-process LRU).
|
| 64 |
+
- The 3B backend will only attempt a real load if the adapter repo has
|
| 65 |
+
weights pushed; otherwise it short-circuits to a clear UI message.
|
| 66 |
+
|
| 67 |
+
Configure with env vars:
|
| 68 |
+
|
| 69 |
+
| Env var | Default |
|
| 70 |
+
| --- | --- |
|
| 71 |
+
| `OPENSLEUTH_ENV_URL` | `https://anugrah55-opensleuth-env-gemini-cli.hf.space` |
|
| 72 |
+
| `BASE_MODEL_ID` | `Qwen/Qwen2.5-0.5B-Instruct` |
|
| 73 |
+
| `BASE_MODEL_3B_ID` | `Qwen/Qwen2.5-3B-Instruct` |
|
| 74 |
+
| `ADAPTER_05B_ID` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` |
|
| 75 |
+
| `ADAPTER_3B_ID` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` |
|
| 76 |
+
| `MAX_NEW_TOKENS` | `192` |
|
| 77 |
+
| `N_PROBES` | `6` |
|
| 78 |
+
| `HF_TOKEN` | (optional, set as Space secret for gated models) |
|
__pycache__/app.cpython-313.pyc
ADDED
|
Binary file (33.6 kB). View file
|
|
|
__pycache__/oracle.cpython-313.pyc
ADDED
|
Binary file (6.88 kB). View file
|
|
|
app.py
ADDED
|
@@ -0,0 +1,700 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""OpenSleuth live demo Space.
|
| 2 |
+
|
| 3 |
+
A clickable Gradio app that lets a viewer watch the OpenSleuth agent solve
|
| 4 |
+
one of the 15 catalog tasks live: pick a black-box function, pick an agent
|
| 5 |
+
backend, watch the agent probe the env, submit a Python replica, and see
|
| 6 |
+
the verifier reward streamed back in real time.
|
| 7 |
+
|
| 8 |
+
Backends:
|
| 9 |
+
* "oracle" — submit the canonical reference implementation
|
| 10 |
+
* "base Qwen 0.5B" — Qwen/Qwen2.5-0.5B-Instruct, no fine-tuning
|
| 11 |
+
* "trained Qwen 0.5B" — base + GRPO LoRA from anugrah55/opensleuth-qwen2.5-0.5b-grpo
|
| 12 |
+
* "trained Qwen 3B" — base + GRPO LoRA from anugrah55/opensleuth-qwen2.5-3b-grpo-v2
|
| 13 |
+
(gracefully degraded if adapter repo is empty)
|
| 14 |
+
|
| 15 |
+
Networks: hits the live env Space at https://anugrah55-opensleuth-env-gemini-cli.hf.space
|
| 16 |
+
for /tasks, /reset, /step (probe + submit), /tasks/{name}/sample_inputs.
|
| 17 |
+
|
| 18 |
+
CPU-basic friendly: model loads are lazy, generations are capped at 192
|
| 19 |
+
new tokens, and we fall back gracefully if a model/adapter is unavailable.
|
| 20 |
+
"""
|
| 21 |
+
|
| 22 |
+
from __future__ import annotations
|
| 23 |
+
|
| 24 |
+
import logging
|
| 25 |
+
import os
|
| 26 |
+
import re
|
| 27 |
+
import threading
|
| 28 |
+
import time
|
| 29 |
+
import traceback
|
| 30 |
+
from dataclasses import dataclass
|
| 31 |
+
from typing import Any, Dict, Generator, List, Optional, Tuple
|
| 32 |
+
|
| 33 |
+
import gradio as gr
|
| 34 |
+
import requests
|
| 35 |
+
from huggingface_hub import HfApi
|
| 36 |
+
|
| 37 |
+
from oracle import ORACLE_SOLUTIONS, get_oracle_code
|
| 38 |
+
|
| 39 |
+
# ---------------------------------------------------------------------------
|
| 40 |
+
# Config
|
| 41 |
+
# ---------------------------------------------------------------------------
|
| 42 |
+
|
| 43 |
+
ENV_URL = os.environ.get(
|
| 44 |
+
"OPENSLEUTH_ENV_URL",
|
| 45 |
+
"https://anugrah55-opensleuth-env-gemini-cli.hf.space",
|
| 46 |
+
).rstrip("/")
|
| 47 |
+
|
| 48 |
+
BASE_MODEL_ID = os.environ.get("BASE_MODEL_ID", "Qwen/Qwen2.5-0.5B-Instruct")
|
| 49 |
+
ADAPTER_05B_ID = os.environ.get(
|
| 50 |
+
"ADAPTER_05B_ID", "anugrah55/opensleuth-qwen2.5-0.5b-grpo"
|
| 51 |
+
)
|
| 52 |
+
ADAPTER_3B_ID = os.environ.get(
|
| 53 |
+
"ADAPTER_3B_ID", "anugrah55/opensleuth-qwen2.5-3b-grpo-v2"
|
| 54 |
+
)
|
| 55 |
+
BASE_MODEL_3B_ID = os.environ.get("BASE_MODEL_3B_ID", "Qwen/Qwen2.5-3B-Instruct")
|
| 56 |
+
|
| 57 |
+
MAX_NEW_TOKENS = int(os.environ.get("MAX_NEW_TOKENS", "192"))
|
| 58 |
+
N_PROBES = int(os.environ.get("N_PROBES", "6"))
|
| 59 |
+
HF_TOKEN = os.environ.get("HF_TOKEN")
|
| 60 |
+
|
| 61 |
+
GITHUB_URL = "https://github.com/"
|
| 62 |
+
HUB_DATASET_URL = "https://huggingface.co/datasets/anugrah55/opensleuth-tasks"
|
| 63 |
+
ENV_SPACE_URL = "https://huggingface.co/spaces/anugrah55/opensleuth-env-gemini-cli"
|
| 64 |
+
|
| 65 |
+
SYSTEM_PROMPT = (
|
| 66 |
+
"You are an algorithmic detective. You are given the public signature of a hidden "
|
| 67 |
+
"Python function plus several (input, output) examples observed by probing it. "
|
| 68 |
+
"Your job is to write a Python function that *exactly* reproduces the hidden "
|
| 69 |
+
"function's behavior on all valid inputs. Match its return values AND its "
|
| 70 |
+
"exception types on invalid inputs. Keep your implementation as simple and clean "
|
| 71 |
+
"as possible (it is penalised for being needlessly branchy). Return ONLY the "
|
| 72 |
+
"function definition wrapped in a single ```python ... ``` code block."
|
| 73 |
+
)
|
| 74 |
+
|
| 75 |
+
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
|
| 76 |
+
log = logging.getLogger("opensleuth.demo")
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
# ---------------------------------------------------------------------------
|
| 80 |
+
# Env client (thin)
|
| 81 |
+
# ---------------------------------------------------------------------------
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
class EnvClient:
|
| 85 |
+
def __init__(self, base_url: str, timeout: float = 30.0) -> None:
|
| 86 |
+
self.base_url = base_url.rstrip("/")
|
| 87 |
+
self.timeout = timeout
|
| 88 |
+
|
| 89 |
+
def _get(self, path: str, **params) -> Dict[str, Any]:
|
| 90 |
+
r = requests.get(f"{self.base_url}{path}", params=params or None, timeout=self.timeout)
|
| 91 |
+
r.raise_for_status()
|
| 92 |
+
return r.json()
|
| 93 |
+
|
| 94 |
+
def _post(self, path: str, payload: Dict[str, Any]) -> Dict[str, Any]:
|
| 95 |
+
r = requests.post(f"{self.base_url}{path}", json=payload, timeout=self.timeout)
|
| 96 |
+
r.raise_for_status()
|
| 97 |
+
return r.json()
|
| 98 |
+
|
| 99 |
+
def list_tasks(self) -> List[Dict[str, Any]]:
|
| 100 |
+
return self._get("/tasks")["tasks"]
|
| 101 |
+
|
| 102 |
+
def sample_inputs(self, name: str, n: int = 6, seed: int = 0) -> List[str]:
|
| 103 |
+
return list(self._get(f"/tasks/{name}/sample_inputs", n=n, seed=seed)["inputs"])
|
| 104 |
+
|
| 105 |
+
def reset(self, target_name: str, seed: int = 0, max_steps: int = 25) -> Dict[str, Any]:
|
| 106 |
+
return self._post(
|
| 107 |
+
"/reset",
|
| 108 |
+
{"target_name": target_name, "seed": seed, "max_steps": max_steps},
|
| 109 |
+
)
|
| 110 |
+
|
| 111 |
+
def probe(self, episode_id: str, input_repr: str) -> Dict[str, Any]:
|
| 112 |
+
return self._post(
|
| 113 |
+
"/step",
|
| 114 |
+
{
|
| 115 |
+
"episode_id": episode_id,
|
| 116 |
+
"action": {"action_type": "probe", "input_repr": input_repr},
|
| 117 |
+
},
|
| 118 |
+
)
|
| 119 |
+
|
| 120 |
+
def submit(self, episode_id: str, code: str) -> Dict[str, Any]:
|
| 121 |
+
return self._post(
|
| 122 |
+
"/step",
|
| 123 |
+
{
|
| 124 |
+
"episode_id": episode_id,
|
| 125 |
+
"action": {"action_type": "submit", "code": code},
|
| 126 |
+
},
|
| 127 |
+
)
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
CLIENT = EnvClient(ENV_URL)
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def fetch_tasks() -> List[Dict[str, Any]]:
|
| 134 |
+
"""Pull the live task catalog. Falls back to a hardcoded list if env is
|
| 135 |
+
unreachable so the dropdown always has something to show."""
|
| 136 |
+
try:
|
| 137 |
+
return CLIENT.list_tasks()
|
| 138 |
+
except Exception as e: # noqa: BLE001
|
| 139 |
+
log.warning("could not fetch /tasks from env (%s); using static fallback", e)
|
| 140 |
+
return [{"name": n, "signature": "", "description": "", "difficulty": "?",
|
| 141 |
+
"edge_case_count": 0, "source": "fallback"}
|
| 142 |
+
for n in sorted(ORACLE_SOLUTIONS)]
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
# ---------------------------------------------------------------------------
|
| 146 |
+
# Prompt + code extraction (lifted from training/opensleuth_train/prompt.py)
|
| 147 |
+
# ---------------------------------------------------------------------------
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
_CODE_RE = re.compile(r"```(?:python)?\s*(.*?)```", re.DOTALL | re.IGNORECASE)
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
def build_prompt(target_name: str, signature: str, probes: List[Tuple[str, str, bool]]) -> str:
|
| 154 |
+
lines = [
|
| 155 |
+
f"## Hidden function: {target_name}",
|
| 156 |
+
"",
|
| 157 |
+
"### Public signature & docstring",
|
| 158 |
+
signature.strip() or "(no signature provided)",
|
| 159 |
+
"",
|
| 160 |
+
"### Observed probes",
|
| 161 |
+
]
|
| 162 |
+
if not probes:
|
| 163 |
+
lines.append("(none)")
|
| 164 |
+
else:
|
| 165 |
+
for inp, out, is_err in probes:
|
| 166 |
+
tag = "raises" if is_err else "returns"
|
| 167 |
+
lines.append(f"- input={inp} -> {tag} {out}")
|
| 168 |
+
lines += [
|
| 169 |
+
"",
|
| 170 |
+
"### Task",
|
| 171 |
+
f"Write a Python function named `{target_name}` that reproduces the hidden "
|
| 172 |
+
"function's behaviour. Return ONLY the function definition in a single "
|
| 173 |
+
"```python ... ``` code block. Do not add explanations.",
|
| 174 |
+
]
|
| 175 |
+
return "\n".join(lines)
|
| 176 |
+
|
| 177 |
+
|
| 178 |
+
def extract_code(completion: str) -> str:
|
| 179 |
+
m = _CODE_RE.search(completion)
|
| 180 |
+
if m:
|
| 181 |
+
return m.group(1).strip()
|
| 182 |
+
return completion.strip()
|
| 183 |
+
|
| 184 |
+
|
| 185 |
+
# ---------------------------------------------------------------------------
|
| 186 |
+
# Backend registry
|
| 187 |
+
# ---------------------------------------------------------------------------
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
@dataclass
|
| 191 |
+
class BackendInfo:
|
| 192 |
+
key: str
|
| 193 |
+
label: str
|
| 194 |
+
kind: str # "oracle" | "hf" (transformers + peft)
|
| 195 |
+
base_model: Optional[str] = None
|
| 196 |
+
adapter: Optional[str] = None
|
| 197 |
+
|
| 198 |
+
|
| 199 |
+
BACKENDS: Dict[str, BackendInfo] = {
|
| 200 |
+
"oracle": BackendInfo(
|
| 201 |
+
key="oracle",
|
| 202 |
+
label="oracle (reference impl)",
|
| 203 |
+
kind="oracle",
|
| 204 |
+
),
|
| 205 |
+
"base-0.5b": BackendInfo(
|
| 206 |
+
key="base-0.5b",
|
| 207 |
+
label="base Qwen 0.5B (no fine-tune)",
|
| 208 |
+
kind="hf",
|
| 209 |
+
base_model=BASE_MODEL_ID,
|
| 210 |
+
adapter=None,
|
| 211 |
+
),
|
| 212 |
+
"trained-0.5b": BackendInfo(
|
| 213 |
+
key="trained-0.5b",
|
| 214 |
+
label="trained Qwen 0.5B (GRPO LoRA)",
|
| 215 |
+
kind="hf",
|
| 216 |
+
base_model=BASE_MODEL_ID,
|
| 217 |
+
adapter=ADAPTER_05B_ID,
|
| 218 |
+
),
|
| 219 |
+
"trained-3b": BackendInfo(
|
| 220 |
+
key="trained-3b",
|
| 221 |
+
label="trained Qwen 3B (GRPO LoRA)",
|
| 222 |
+
kind="hf",
|
| 223 |
+
base_model=BASE_MODEL_3B_ID,
|
| 224 |
+
adapter=ADAPTER_3B_ID,
|
| 225 |
+
),
|
| 226 |
+
}
|
| 227 |
+
|
| 228 |
+
BACKEND_CHOICES = [(b.label, b.key) for b in BACKENDS.values()]
|
| 229 |
+
|
| 230 |
+
|
| 231 |
+
def _adapter_has_weights(repo_id: str) -> bool:
|
| 232 |
+
"""Hub probe: True iff the adapter repo actually contains adapter
|
| 233 |
+
weights. We treat repos with only `.gitattributes` (still training,
|
| 234 |
+
pre-push) as 'not yet trained'."""
|
| 235 |
+
try:
|
| 236 |
+
api = HfApi(token=HF_TOKEN)
|
| 237 |
+
files = api.list_repo_files(repo_id)
|
| 238 |
+
except Exception as e: # noqa: BLE001
|
| 239 |
+
log.warning("adapter availability probe failed for %s: %s", repo_id, e)
|
| 240 |
+
return False
|
| 241 |
+
return any(f.endswith("adapter_model.safetensors") or f.endswith("adapter_model.bin") for f in files)
|
| 242 |
+
|
| 243 |
+
|
| 244 |
+
# ---------------------------------------------------------------------------
|
| 245 |
+
# Lazy HF model cache
|
| 246 |
+
# ---------------------------------------------------------------------------
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
_MODEL_LOCK = threading.Lock()
|
| 250 |
+
_LOADED: Dict[str, Tuple[Any, Any]] = {} # cache_key -> (tokenizer, model)
|
| 251 |
+
|
| 252 |
+
|
| 253 |
+
def _model_cache_key(base: str, adapter: Optional[str]) -> str:
|
| 254 |
+
return f"{base}::{adapter or '_base_'}"
|
| 255 |
+
|
| 256 |
+
|
| 257 |
+
def _load_hf(base: str, adapter: Optional[str]) -> Tuple[Any, Any]:
|
| 258 |
+
"""Load the (base, optional LoRA) on CPU. Cached across calls."""
|
| 259 |
+
key = _model_cache_key(base, adapter)
|
| 260 |
+
with _MODEL_LOCK:
|
| 261 |
+
if key in _LOADED:
|
| 262 |
+
return _LOADED[key]
|
| 263 |
+
log.info("loading HF model base=%s adapter=%s", base, adapter)
|
| 264 |
+
import torch # noqa: WPS433
|
| 265 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer # noqa: WPS433
|
| 266 |
+
|
| 267 |
+
tok = AutoTokenizer.from_pretrained(base, trust_remote_code=True, token=HF_TOKEN)
|
| 268 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 269 |
+
base,
|
| 270 |
+
torch_dtype=torch.float32,
|
| 271 |
+
device_map={"": "cpu"},
|
| 272 |
+
trust_remote_code=True,
|
| 273 |
+
low_cpu_mem_usage=True,
|
| 274 |
+
token=HF_TOKEN,
|
| 275 |
+
)
|
| 276 |
+
if adapter:
|
| 277 |
+
from peft import PeftModel # noqa: WPS433
|
| 278 |
+
|
| 279 |
+
model = PeftModel.from_pretrained(model, adapter, token=HF_TOKEN)
|
| 280 |
+
model.eval()
|
| 281 |
+
_LOADED[key] = (tok, model)
|
| 282 |
+
log.info("loaded %s in %d cached models", key, len(_LOADED))
|
| 283 |
+
return tok, model
|
| 284 |
+
|
| 285 |
+
|
| 286 |
+
def _generate_hf(base: str, adapter: Optional[str], prompt: str) -> str:
|
| 287 |
+
tok, model = _load_hf(base, adapter)
|
| 288 |
+
import torch # noqa: WPS433
|
| 289 |
+
|
| 290 |
+
messages = [
|
| 291 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 292 |
+
{"role": "user", "content": prompt},
|
| 293 |
+
]
|
| 294 |
+
text = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
|
| 295 |
+
inputs = tok(text, return_tensors="pt")
|
| 296 |
+
with torch.no_grad():
|
| 297 |
+
out = model.generate(
|
| 298 |
+
**inputs,
|
| 299 |
+
max_new_tokens=MAX_NEW_TOKENS,
|
| 300 |
+
do_sample=False,
|
| 301 |
+
temperature=1.0,
|
| 302 |
+
pad_token_id=tok.eos_token_id,
|
| 303 |
+
)
|
| 304 |
+
return tok.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 305 |
+
|
| 306 |
+
|
| 307 |
+
# ---------------------------------------------------------------------------
|
| 308 |
+
# Reward table formatter
|
| 309 |
+
# ---------------------------------------------------------------------------
|
| 310 |
+
|
| 311 |
+
|
| 312 |
+
def _empty_reward_table() -> List[List[Any]]:
|
| 313 |
+
return [
|
| 314 |
+
["execution_reward", "—"],
|
| 315 |
+
["edge_pass_rate", "—"],
|
| 316 |
+
["complexity_penalty", "—"],
|
| 317 |
+
["reward_hack_penalty", "—"],
|
| 318 |
+
["floor_penalty", "—"],
|
| 319 |
+
["perfect_bonus", "—"],
|
| 320 |
+
["TOTAL reward", "—"],
|
| 321 |
+
]
|
| 322 |
+
|
| 323 |
+
|
| 324 |
+
def _reward_table_from_info(info: Dict[str, Any], total: float) -> List[List[Any]]:
|
| 325 |
+
def _fmt(x):
|
| 326 |
+
if x is None:
|
| 327 |
+
return "—"
|
| 328 |
+
if isinstance(x, float):
|
| 329 |
+
return f"{x:+.2f}"
|
| 330 |
+
return str(x)
|
| 331 |
+
|
| 332 |
+
edge = info.get("edge_pass_rate")
|
| 333 |
+
edge_str = f"{edge:.0%}" if isinstance(edge, (int, float)) else "—"
|
| 334 |
+
return [
|
| 335 |
+
["execution_reward", _fmt(info.get("execution_reward"))],
|
| 336 |
+
["edge_pass_rate", edge_str],
|
| 337 |
+
["complexity_penalty", _fmt(-(info.get("complexity_penalty") or 0.0))],
|
| 338 |
+
["reward_hack_penalty", _fmt(-(info.get("reward_hack_penalty") or 0.0))],
|
| 339 |
+
["floor_penalty", _fmt(-(info.get("floor_penalty") or 0.0))],
|
| 340 |
+
["perfect_bonus", _fmt(info.get("perfect_bonus"))],
|
| 341 |
+
["TOTAL reward", _fmt(total)],
|
| 342 |
+
]
|
| 343 |
+
|
| 344 |
+
|
| 345 |
+
# ---------------------------------------------------------------------------
|
| 346 |
+
# Streaming runner
|
| 347 |
+
# ---------------------------------------------------------------------------
|
| 348 |
+
|
| 349 |
+
|
| 350 |
+
def _format_log(lines: List[str]) -> str:
|
| 351 |
+
return "\n".join(lines)
|
| 352 |
+
|
| 353 |
+
|
| 354 |
+
def run_agent(
|
| 355 |
+
task_name: str,
|
| 356 |
+
backend_key: str,
|
| 357 |
+
seed: int = 0,
|
| 358 |
+
) -> Generator[Tuple[str, str, List[List[Any]], str], None, None]:
|
| 359 |
+
"""Run one agent rollout end-to-end and stream UI updates.
|
| 360 |
+
|
| 361 |
+
Yields tuples of (log_text, code_markdown, reward_table, status).
|
| 362 |
+
"""
|
| 363 |
+
backend = BACKENDS.get(backend_key)
|
| 364 |
+
if backend is None:
|
| 365 |
+
yield ("Unknown backend.", "", _empty_reward_table(), "error")
|
| 366 |
+
return
|
| 367 |
+
if not task_name:
|
| 368 |
+
yield ("Pick a task first.", "", _empty_reward_table(), "error")
|
| 369 |
+
return
|
| 370 |
+
|
| 371 |
+
log_lines: List[str] = []
|
| 372 |
+
code_md = ""
|
| 373 |
+
table = _empty_reward_table()
|
| 374 |
+
|
| 375 |
+
def push(line: str = "", *, status: str = "running") -> Tuple[str, str, List[List[Any]], str]:
|
| 376 |
+
if line:
|
| 377 |
+
log_lines.append(line)
|
| 378 |
+
return _format_log(log_lines), code_md, table, status
|
| 379 |
+
|
| 380 |
+
yield push(f"task={task_name} backend={backend.label} seed={seed}")
|
| 381 |
+
yield push(f"env={ENV_URL}")
|
| 382 |
+
|
| 383 |
+
# 1. Reset env
|
| 384 |
+
try:
|
| 385 |
+
ep = CLIENT.reset(task_name, seed=seed, max_steps=N_PROBES + 5)
|
| 386 |
+
except Exception as e: # noqa: BLE001
|
| 387 |
+
yield push(f"[error] /reset failed: {e}", status="error")
|
| 388 |
+
return
|
| 389 |
+
eid = ep["episode_id"]
|
| 390 |
+
sig = ep.get("target_function_signature", "")
|
| 391 |
+
yield push(f"\n=== reset ===\nepisode_id={eid}")
|
| 392 |
+
yield push(f"signature: {sig.splitlines()[0] if sig else '(none)'}")
|
| 393 |
+
|
| 394 |
+
# 2. Sample probe inputs from env's own auto-fuzzer
|
| 395 |
+
try:
|
| 396 |
+
inputs = CLIENT.sample_inputs(task_name, n=N_PROBES, seed=seed)
|
| 397 |
+
except Exception as e: # noqa: BLE001
|
| 398 |
+
yield push(f"[warn] sample_inputs failed: {e}; falling back to ['1']*N", status="running")
|
| 399 |
+
inputs = ["1"] * N_PROBES
|
| 400 |
+
|
| 401 |
+
# 3. Probe loop
|
| 402 |
+
yield push(f"\n=== probing ({len(inputs)} inputs) ===")
|
| 403 |
+
history: List[Tuple[str, str, bool]] = []
|
| 404 |
+
for i, inp in enumerate(inputs, 1):
|
| 405 |
+
try:
|
| 406 |
+
resp = CLIENT.probe(eid, inp)
|
| 407 |
+
except Exception as e: # noqa: BLE001
|
| 408 |
+
yield push(f" probe {i}/{len(inputs)} input={inp} [error] {e}")
|
| 409 |
+
continue
|
| 410 |
+
last = resp["observation"]["probe_history"][-1]
|
| 411 |
+
out = last["output_repr"]
|
| 412 |
+
is_err = bool(last["is_error"])
|
| 413 |
+
history.append((last["input_repr"], out, is_err))
|
| 414 |
+
tag = "raises" if is_err else "->"
|
| 415 |
+
yield push(f" probe {i}/{len(inputs)} input={inp} {tag} {out}")
|
| 416 |
+
time.sleep(0.05) # tiny delay so the UI feels live, not spammed
|
| 417 |
+
|
| 418 |
+
# 4. Build prompt + generate code
|
| 419 |
+
prompt = build_prompt(task_name, sig, history)
|
| 420 |
+
yield push(f"\n=== generating code ({backend.label}) ===")
|
| 421 |
+
|
| 422 |
+
if backend.kind == "oracle":
|
| 423 |
+
completion = "```python\n" + get_oracle_code(task_name) + "```"
|
| 424 |
+
code = extract_code(completion)
|
| 425 |
+
yield push("oracle: pulled canonical reference implementation.")
|
| 426 |
+
elif backend.kind == "hf":
|
| 427 |
+
if backend.adapter and not _adapter_has_weights(backend.adapter):
|
| 428 |
+
yield push(
|
| 429 |
+
f"[info] adapter {backend.adapter!r} has no weights yet "
|
| 430 |
+
f"(repo only contains .gitattributes); falling back to base model output.",
|
| 431 |
+
)
|
| 432 |
+
backend = BackendInfo(
|
| 433 |
+
key=backend.key, label=f"{backend.label} → base fallback",
|
| 434 |
+
kind="hf", base_model=backend.base_model, adapter=None,
|
| 435 |
+
)
|
| 436 |
+
try:
|
| 437 |
+
yield push(
|
| 438 |
+
f"loading {backend.base_model} on CPU"
|
| 439 |
+
+ (f" + LoRA {backend.adapter}" if backend.adapter else "")
|
| 440 |
+
+ " ... (cold-start may take 30-90s the first time)"
|
| 441 |
+
)
|
| 442 |
+
t0 = time.time()
|
| 443 |
+
completion = _generate_hf(backend.base_model, backend.adapter, prompt)
|
| 444 |
+
yield push(f"generated in {time.time() - t0:.1f}s ({MAX_NEW_TOKENS} max new tokens)")
|
| 445 |
+
except Exception as e: # noqa: BLE001
|
| 446 |
+
tb = traceback.format_exc(limit=2)
|
| 447 |
+
yield push(f"[error] generation failed: {type(e).__name__}: {e}\n{tb}", status="error")
|
| 448 |
+
return
|
| 449 |
+
code = extract_code(completion)
|
| 450 |
+
else:
|
| 451 |
+
yield push(f"[error] unknown backend kind: {backend.kind}", status="error")
|
| 452 |
+
return
|
| 453 |
+
|
| 454 |
+
if not code.strip():
|
| 455 |
+
yield push("[warn] model emitted empty completion; submitting empty stub.")
|
| 456 |
+
code = f"def {task_name}(*args, **kwargs):\n pass\n"
|
| 457 |
+
|
| 458 |
+
code_md = f"```python\n{code}\n```"
|
| 459 |
+
yield push("\n=== submitting code to /step ===")
|
| 460 |
+
|
| 461 |
+
# 5. Submit + verifier breakdown
|
| 462 |
+
try:
|
| 463 |
+
sub_resp = CLIENT.submit(eid, code)
|
| 464 |
+
except Exception as e: # noqa: BLE001
|
| 465 |
+
yield push(f"[error] /submit failed: {e}", status="error")
|
| 466 |
+
return
|
| 467 |
+
info = sub_resp.get("info", {}) or {}
|
| 468 |
+
total = float(sub_resp.get("reward", 0.0))
|
| 469 |
+
table = _reward_table_from_info(info, total)
|
| 470 |
+
|
| 471 |
+
yield push(f"verifier: matches {info.get('matches', 0)}/{info.get('fuzz_count', 0)}")
|
| 472 |
+
if info.get("define_error"):
|
| 473 |
+
yield push(f" define_error: {info['define_error']}")
|
| 474 |
+
by_cat = info.get("matches_by_category") or {}
|
| 475 |
+
counts = info.get("counts_by_category") or {}
|
| 476 |
+
for cat in ("edge", "random"):
|
| 477 |
+
m = by_cat.get(cat)
|
| 478 |
+
c = counts.get(cat)
|
| 479 |
+
if m is not None and c is not None:
|
| 480 |
+
yield push(f" {cat:>6}: {m}/{c}")
|
| 481 |
+
yield push(
|
| 482 |
+
f"\nreward breakdown:"
|
| 483 |
+
f" exec={info.get('execution_reward', 0):.2f}"
|
| 484 |
+
f" -complexity={info.get('complexity_penalty', 0):.2f}"
|
| 485 |
+
f" -hack={info.get('reward_hack_penalty', 0):.2f}"
|
| 486 |
+
f" -floor={info.get('floor_penalty', 0):.2f}"
|
| 487 |
+
f" +perfect={info.get('perfect_bonus', 0):.2f}"
|
| 488 |
+
)
|
| 489 |
+
final_status = "done"
|
| 490 |
+
if info.get("execution_reward", 0) >= 99.999:
|
| 491 |
+
yield push(f"\n*** TOTAL REWARD = {total:+.2f} (PERFECT) ***", status=final_status)
|
| 492 |
+
else:
|
| 493 |
+
yield push(f"\nTOTAL REWARD = {total:+.2f}", status=final_status)
|
| 494 |
+
|
| 495 |
+
|
| 496 |
+
# ---------------------------------------------------------------------------
|
| 497 |
+
# UI helpers
|
| 498 |
+
# ---------------------------------------------------------------------------
|
| 499 |
+
|
| 500 |
+
|
| 501 |
+
def _task_label(t: Dict[str, Any]) -> str:
|
| 502 |
+
diff = t.get("difficulty") or "?"
|
| 503 |
+
src = t.get("source", "?")
|
| 504 |
+
sig = t.get("signature") or t["name"]
|
| 505 |
+
return f"[{diff}/{src}] {sig}"
|
| 506 |
+
|
| 507 |
+
|
| 508 |
+
def build_task_choices() -> List[Tuple[str, str]]:
|
| 509 |
+
tasks = fetch_tasks()
|
| 510 |
+
tasks_sorted = sorted(
|
| 511 |
+
tasks,
|
| 512 |
+
key=lambda t: (
|
| 513 |
+
{"easy": 0, "medium": 1, "hard": 2}.get(t.get("difficulty") or "", 9),
|
| 514 |
+
t["name"],
|
| 515 |
+
),
|
| 516 |
+
)
|
| 517 |
+
return [(_task_label(t), t["name"]) for t in tasks_sorted]
|
| 518 |
+
|
| 519 |
+
|
| 520 |
+
# ---------------------------------------------------------------------------
|
| 521 |
+
# Comparison: oracle vs trained adapter on a single task
|
| 522 |
+
# ---------------------------------------------------------------------------
|
| 523 |
+
|
| 524 |
+
|
| 525 |
+
def quick_compare(task_name: str, seed: int = 0) -> str:
|
| 526 |
+
"""Side-by-side: oracle reward vs trained-0.5b reward on the same task.
|
| 527 |
+
|
| 528 |
+
Used by the 'baseline-vs-trained' panel. Runs *non-streaming* and just
|
| 529 |
+
returns a Markdown summary (we already have streaming for the main
|
| 530 |
+
panel). Falls back gracefully if either backend fails.
|
| 531 |
+
"""
|
| 532 |
+
out_lines = [f"### Reward comparison on `{task_name}` (seed={seed})", ""]
|
| 533 |
+
rows: List[Tuple[str, str]] = []
|
| 534 |
+
for key in ("oracle", "trained-0.5b"):
|
| 535 |
+
backend = BACKENDS[key]
|
| 536 |
+
try:
|
| 537 |
+
ep = CLIENT.reset(task_name, seed=seed, max_steps=2)
|
| 538 |
+
except Exception as e: # noqa: BLE001
|
| 539 |
+
rows.append((backend.label, f"reset failed: {e}"))
|
| 540 |
+
continue
|
| 541 |
+
if backend.kind == "oracle":
|
| 542 |
+
code = get_oracle_code(task_name)
|
| 543 |
+
else:
|
| 544 |
+
if backend.adapter and not _adapter_has_weights(backend.adapter):
|
| 545 |
+
rows.append((backend.label, "adapter not yet trained"))
|
| 546 |
+
continue
|
| 547 |
+
try:
|
| 548 |
+
inputs = CLIENT.sample_inputs(task_name, n=N_PROBES, seed=seed)
|
| 549 |
+
history = []
|
| 550 |
+
for inp in inputs:
|
| 551 |
+
try:
|
| 552 |
+
r = CLIENT.probe(ep["episode_id"], inp)
|
| 553 |
+
last = r["observation"]["probe_history"][-1]
|
| 554 |
+
history.append((last["input_repr"], last["output_repr"], bool(last["is_error"])))
|
| 555 |
+
except Exception: # noqa: BLE001
|
| 556 |
+
pass
|
| 557 |
+
prompt = build_prompt(task_name, ep.get("target_function_signature", ""), history)
|
| 558 |
+
completion = _generate_hf(backend.base_model, backend.adapter, prompt)
|
| 559 |
+
code = extract_code(completion) or f"def {task_name}(*a, **k): pass"
|
| 560 |
+
except Exception as e: # noqa: BLE001
|
| 561 |
+
rows.append((backend.label, f"generation failed: {e}"))
|
| 562 |
+
continue
|
| 563 |
+
try:
|
| 564 |
+
sub = CLIENT.submit(ep["episode_id"], code)
|
| 565 |
+
total = float(sub.get("reward", 0.0))
|
| 566 |
+
info = sub.get("info", {}) or {}
|
| 567 |
+
rows.append(
|
| 568 |
+
(
|
| 569 |
+
backend.label,
|
| 570 |
+
f"reward={total:+.2f} exec={info.get('execution_reward', 0):.0f}/100"
|
| 571 |
+
f" matches={info.get('matches', 0)}/{info.get('fuzz_count', 0)}",
|
| 572 |
+
)
|
| 573 |
+
)
|
| 574 |
+
except Exception as e: # noqa: BLE001
|
| 575 |
+
rows.append((backend.label, f"submit failed: {e}"))
|
| 576 |
+
out_lines.append("| backend | result |")
|
| 577 |
+
out_lines.append("| --- | --- |")
|
| 578 |
+
for label, r in rows:
|
| 579 |
+
out_lines.append(f"| {label} | {r} |")
|
| 580 |
+
return "\n".join(out_lines)
|
| 581 |
+
|
| 582 |
+
|
| 583 |
+
# ---------------------------------------------------------------------------
|
| 584 |
+
# UI
|
| 585 |
+
# ---------------------------------------------------------------------------
|
| 586 |
+
|
| 587 |
+
INTRO_MARKDOWN = """
|
| 588 |
+
# OpenSleuth — live agent demo
|
| 589 |
+
|
| 590 |
+
**The Algorithmic Detective:** an LLM agent reverse-engineers an unknown
|
| 591 |
+
black-box Python function by *probing* it with inputs and then *submitting*
|
| 592 |
+
a Python replica. The env scores the submission by domain-aware fuzzing
|
| 593 |
+
against the hidden reference, with edge-case stratification, a complexity
|
| 594 |
+
penalty, and anti-reward-hacking signals.
|
| 595 |
+
|
| 596 |
+
Pick a task, pick an agent, hit **Run agent**.
|
| 597 |
+
""".strip()
|
| 598 |
+
|
| 599 |
+
FOOTER_MARKDOWN = f"""
|
| 600 |
+
---
|
| 601 |
+
|
| 602 |
+
**Links** ·
|
| 603 |
+
[env Space]({ENV_SPACE_URL}) ·
|
| 604 |
+
[task dataset]({HUB_DATASET_URL}) ·
|
| 605 |
+
[GitHub]({GITHUB_URL})
|
| 606 |
+
|
| 607 |
+
**Backends:** `oracle` is the known-correct reference impl (always +100).
|
| 608 |
+
`base Qwen 0.5B` is `Qwen/Qwen2.5-0.5B-Instruct` with no fine-tuning.
|
| 609 |
+
`trained Qwen 0.5B` is the GRPO LoRA at `{ADAPTER_05B_ID}`.
|
| 610 |
+
`trained Qwen 3B` is the GRPO LoRA at `{ADAPTER_3B_ID}` (gracefully
|
| 611 |
+
falls back to "adapter not yet trained" if the repo has no weights).
|
| 612 |
+
|
| 613 |
+
Models run on CPU-basic, so first generation per backend includes a cold-load
|
| 614 |
+
(~30–90s for 0.5B). Generations are capped at {MAX_NEW_TOKENS} new tokens.
|
| 615 |
+
""".strip()
|
| 616 |
+
|
| 617 |
+
|
| 618 |
+
def build_ui() -> gr.Blocks:
|
| 619 |
+
with gr.Blocks(title="OpenSleuth — live agent demo", theme=gr.themes.Soft()) as demo:
|
| 620 |
+
gr.Markdown(INTRO_MARKDOWN)
|
| 621 |
+
|
| 622 |
+
# populated lazily so the Space can boot even if the env is mid-deploy
|
| 623 |
+
task_choices = gr.State(value=[])
|
| 624 |
+
|
| 625 |
+
with gr.Row():
|
| 626 |
+
task_dd = gr.Dropdown(
|
| 627 |
+
label="Task (15 black-box functions, easy → hard)",
|
| 628 |
+
choices=[],
|
| 629 |
+
value=None,
|
| 630 |
+
interactive=True,
|
| 631 |
+
)
|
| 632 |
+
backend_dd = gr.Dropdown(
|
| 633 |
+
label="Agent backend",
|
| 634 |
+
choices=BACKEND_CHOICES,
|
| 635 |
+
value="oracle",
|
| 636 |
+
interactive=True,
|
| 637 |
+
)
|
| 638 |
+
seed_in = gr.Number(label="Seed", value=0, precision=0, scale=0, minimum=0)
|
| 639 |
+
run_btn = gr.Button("Run agent", variant="primary", scale=0)
|
| 640 |
+
|
| 641 |
+
with gr.Row():
|
| 642 |
+
log_box = gr.Textbox(
|
| 643 |
+
label="Live agent log",
|
| 644 |
+
value="(idle — pick a task and a backend, then hit Run agent)",
|
| 645 |
+
lines=22,
|
| 646 |
+
max_lines=40,
|
| 647 |
+
interactive=False,
|
| 648 |
+
show_copy_button=True,
|
| 649 |
+
)
|
| 650 |
+
|
| 651 |
+
with gr.Row():
|
| 652 |
+
with gr.Column(scale=2):
|
| 653 |
+
code_md = gr.Markdown(label="Submitted code", value="")
|
| 654 |
+
with gr.Column(scale=1):
|
| 655 |
+
reward_tbl = gr.Dataframe(
|
| 656 |
+
headers=["component", "value"],
|
| 657 |
+
value=_empty_reward_table(),
|
| 658 |
+
label="Reward breakdown",
|
| 659 |
+
interactive=False,
|
| 660 |
+
wrap=True,
|
| 661 |
+
)
|
| 662 |
+
|
| 663 |
+
with gr.Accordion("oracle vs trained-0.5b head-to-head", open=False):
|
| 664 |
+
with gr.Row():
|
| 665 |
+
cmp_btn = gr.Button("Run quick comparison", variant="secondary")
|
| 666 |
+
cmp_md = gr.Markdown(value="(no comparison run yet)")
|
| 667 |
+
|
| 668 |
+
gr.Markdown(FOOTER_MARKDOWN)
|
| 669 |
+
|
| 670 |
+
# ---- wiring ------------------------------------------------------
|
| 671 |
+
def _refresh_tasks():
|
| 672 |
+
choices = build_task_choices()
|
| 673 |
+
default = choices[0][1] if choices else None
|
| 674 |
+
return gr.Dropdown(choices=choices, value=default), choices
|
| 675 |
+
|
| 676 |
+
demo.load(_refresh_tasks, outputs=[task_dd, task_choices])
|
| 677 |
+
|
| 678 |
+
run_btn.click(
|
| 679 |
+
fn=run_agent,
|
| 680 |
+
inputs=[task_dd, backend_dd, seed_in],
|
| 681 |
+
outputs=[log_box, code_md, reward_tbl, gr.State()],
|
| 682 |
+
show_progress="minimal",
|
| 683 |
+
)
|
| 684 |
+
|
| 685 |
+
cmp_btn.click(
|
| 686 |
+
fn=quick_compare,
|
| 687 |
+
inputs=[task_dd, seed_in],
|
| 688 |
+
outputs=[cmp_md],
|
| 689 |
+
show_progress="minimal",
|
| 690 |
+
)
|
| 691 |
+
|
| 692 |
+
return demo
|
| 693 |
+
|
| 694 |
+
|
| 695 |
+
if __name__ == "__main__":
|
| 696 |
+
ui = build_ui()
|
| 697 |
+
ui.queue(default_concurrency_limit=2).launch(
|
| 698 |
+
server_name="0.0.0.0",
|
| 699 |
+
server_port=int(os.environ.get("PORT", "7860")),
|
| 700 |
+
)
|
oracle.py
ADDED
|
@@ -0,0 +1,231 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Per-task reference implementations.
|
| 2 |
+
|
| 3 |
+
These are the *known-correct* solutions for each of the 15 tasks the OpenSleuth
|
| 4 |
+
env exposes. They mirror the rows pushed to ``anugrah55/opensleuth-tasks`` by
|
| 5 |
+
``env/opensleuth_env/scripts/bootstrap_tasks_dataset.py`` (which itself mirrors
|
| 6 |
+
the in-process oracle in ``env/opensleuth_env/black_box.py``).
|
| 7 |
+
|
| 8 |
+
The "oracle" demo backend just looks up the task name here and submits the
|
| 9 |
+
canonical source. It exists so the viewer can immediately see what a perfect
|
| 10 |
+
score looks like end-to-end (signature → probes → submit → +100 reward).
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
from typing import Dict
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
ORACLE_SOLUTIONS: Dict[str, str] = {
|
| 19 |
+
# ---- 9 builtins -------------------------------------------------------
|
| 20 |
+
"fibonacci": (
|
| 21 |
+
"def fibonacci(n):\n"
|
| 22 |
+
" if not isinstance(n, int) or isinstance(n, bool) or n <= 0 or n > 90:\n"
|
| 23 |
+
" raise ValueError('Input must be a positive integer <= 90.')\n"
|
| 24 |
+
" a, b = 0, 1\n"
|
| 25 |
+
" for _ in range(n - 1):\n"
|
| 26 |
+
" a, b = b, a + b\n"
|
| 27 |
+
" return b if n > 0 else a\n"
|
| 28 |
+
),
|
| 29 |
+
"reverse_string": (
|
| 30 |
+
"def reverse_string(s):\n"
|
| 31 |
+
" if not isinstance(s, str):\n"
|
| 32 |
+
" raise TypeError('Input must be a string.')\n"
|
| 33 |
+
" return s[::-1]\n"
|
| 34 |
+
),
|
| 35 |
+
"is_palindrome": (
|
| 36 |
+
"def is_palindrome(s):\n"
|
| 37 |
+
" if not isinstance(s, str):\n"
|
| 38 |
+
" raise TypeError('Input must be a string.')\n"
|
| 39 |
+
" cleaned = ''.join(ch.lower() for ch in s if ch.isalnum())\n"
|
| 40 |
+
" return cleaned == cleaned[::-1]\n"
|
| 41 |
+
),
|
| 42 |
+
"digit_sum": (
|
| 43 |
+
"def digit_sum(n):\n"
|
| 44 |
+
" if not isinstance(n, int) or isinstance(n, bool):\n"
|
| 45 |
+
" raise TypeError('Input must be int.')\n"
|
| 46 |
+
" if n < 0:\n"
|
| 47 |
+
" raise ValueError('Input must be non-negative.')\n"
|
| 48 |
+
" return sum(int(c) for c in str(n))\n"
|
| 49 |
+
),
|
| 50 |
+
"count_vowels": (
|
| 51 |
+
"def count_vowels(s):\n"
|
| 52 |
+
" if not isinstance(s, str):\n"
|
| 53 |
+
" raise TypeError('Input must be a string.')\n"
|
| 54 |
+
" return sum(1 for c in s.lower() if c in 'aeiou')\n"
|
| 55 |
+
),
|
| 56 |
+
"gcd": (
|
| 57 |
+
"def gcd(pair):\n"
|
| 58 |
+
" if not isinstance(pair, (list, tuple)) or len(pair) != 2:\n"
|
| 59 |
+
" raise TypeError('Input must be a 2-element list or tuple.')\n"
|
| 60 |
+
" a, b = pair\n"
|
| 61 |
+
" if not all(isinstance(x, int) and not isinstance(x, bool) for x in (a, b)):\n"
|
| 62 |
+
" raise TypeError('Both elements must be int.')\n"
|
| 63 |
+
" if a < 0 or b < 0:\n"
|
| 64 |
+
" raise ValueError('Both elements must be non-negative.')\n"
|
| 65 |
+
" while b:\n"
|
| 66 |
+
" a, b = b, a % b\n"
|
| 67 |
+
" return a\n"
|
| 68 |
+
),
|
| 69 |
+
"sort_unique": (
|
| 70 |
+
"def sort_unique(xs):\n"
|
| 71 |
+
" if not isinstance(xs, list):\n"
|
| 72 |
+
" raise TypeError('Input must be a list.')\n"
|
| 73 |
+
" if not all(isinstance(x, int) and not isinstance(x, bool) for x in xs):\n"
|
| 74 |
+
" raise TypeError('All elements must be int.')\n"
|
| 75 |
+
" return sorted(set(xs))\n"
|
| 76 |
+
),
|
| 77 |
+
"caesar_cipher": (
|
| 78 |
+
"def caesar_cipher(s):\n"
|
| 79 |
+
" if not isinstance(s, str):\n"
|
| 80 |
+
" raise TypeError('Input must be a string.')\n"
|
| 81 |
+
" out = []\n"
|
| 82 |
+
" for ch in s:\n"
|
| 83 |
+
" if 'a' <= ch <= 'z':\n"
|
| 84 |
+
" out.append(chr((ord(ch) - ord('a') + 3) % 26 + ord('a')))\n"
|
| 85 |
+
" else:\n"
|
| 86 |
+
" out.append(ch)\n"
|
| 87 |
+
" return ''.join(out)\n"
|
| 88 |
+
),
|
| 89 |
+
"is_prime": (
|
| 90 |
+
"def is_prime(n):\n"
|
| 91 |
+
" if not isinstance(n, int) or isinstance(n, bool):\n"
|
| 92 |
+
" raise TypeError('Input must be int.')\n"
|
| 93 |
+
" if n < 2:\n"
|
| 94 |
+
" return False\n"
|
| 95 |
+
" if n < 4:\n"
|
| 96 |
+
" return True\n"
|
| 97 |
+
" if n % 2 == 0:\n"
|
| 98 |
+
" return False\n"
|
| 99 |
+
" i = 3\n"
|
| 100 |
+
" while i * i <= n:\n"
|
| 101 |
+
" if n % i == 0:\n"
|
| 102 |
+
" return False\n"
|
| 103 |
+
" i += 2\n"
|
| 104 |
+
" return True\n"
|
| 105 |
+
),
|
| 106 |
+
# ---- 6 hub-pushed tasks -----------------------------------------------
|
| 107 |
+
"roman_to_int": (
|
| 108 |
+
"def roman_to_int(s):\n"
|
| 109 |
+
" if not isinstance(s, str):\n"
|
| 110 |
+
" raise TypeError('input must be str')\n"
|
| 111 |
+
" table = {'I':1,'V':5,'X':10,'L':50,'C':100,'D':500,'M':1000}\n"
|
| 112 |
+
" total = 0\n"
|
| 113 |
+
" prev = 0\n"
|
| 114 |
+
" for ch in reversed(s.upper()):\n"
|
| 115 |
+
" if ch not in table:\n"
|
| 116 |
+
" raise ValueError(f'invalid roman numeral character: {ch!r}')\n"
|
| 117 |
+
" v = table[ch]\n"
|
| 118 |
+
" if v < prev:\n"
|
| 119 |
+
" total -= v\n"
|
| 120 |
+
" else:\n"
|
| 121 |
+
" total += v\n"
|
| 122 |
+
" prev = v\n"
|
| 123 |
+
" return total\n"
|
| 124 |
+
),
|
| 125 |
+
"levenshtein_distance": (
|
| 126 |
+
"def levenshtein_distance(a, b):\n"
|
| 127 |
+
" if not isinstance(a, str) or not isinstance(b, str):\n"
|
| 128 |
+
" raise TypeError('both arguments must be str')\n"
|
| 129 |
+
" if a == b:\n"
|
| 130 |
+
" return 0\n"
|
| 131 |
+
" if not a:\n"
|
| 132 |
+
" return len(b)\n"
|
| 133 |
+
" if not b:\n"
|
| 134 |
+
" return len(a)\n"
|
| 135 |
+
" prev = list(range(len(b) + 1))\n"
|
| 136 |
+
" for i, ca in enumerate(a, 1):\n"
|
| 137 |
+
" cur = [i] + [0] * len(b)\n"
|
| 138 |
+
" for j, cb in enumerate(b, 1):\n"
|
| 139 |
+
" ins = cur[j-1] + 1\n"
|
| 140 |
+
" dele = prev[j] + 1\n"
|
| 141 |
+
" sub = prev[j-1] + (ca != cb)\n"
|
| 142 |
+
" cur[j] = min(ins, dele, sub)\n"
|
| 143 |
+
" prev = cur\n"
|
| 144 |
+
" return prev[-1]\n"
|
| 145 |
+
),
|
| 146 |
+
"flatten_list": (
|
| 147 |
+
"def flatten_list(xs):\n"
|
| 148 |
+
" if not isinstance(xs, (list, tuple)):\n"
|
| 149 |
+
" raise TypeError('input must be list or tuple')\n"
|
| 150 |
+
" out = []\n"
|
| 151 |
+
" rev = []\n"
|
| 152 |
+
" rev.extend(reversed(list(xs)))\n"
|
| 153 |
+
" while rev:\n"
|
| 154 |
+
" x = rev.pop()\n"
|
| 155 |
+
" if isinstance(x, (list, tuple)):\n"
|
| 156 |
+
" for y in reversed(x):\n"
|
| 157 |
+
" rev.append(y)\n"
|
| 158 |
+
" else:\n"
|
| 159 |
+
" out.append(x)\n"
|
| 160 |
+
" return out\n"
|
| 161 |
+
),
|
| 162 |
+
"merge_sorted": (
|
| 163 |
+
"def merge_sorted(a, b):\n"
|
| 164 |
+
" if not isinstance(a, list) or not isinstance(b, list):\n"
|
| 165 |
+
" raise TypeError('both arguments must be list')\n"
|
| 166 |
+
" for x in (*a, *b):\n"
|
| 167 |
+
" if not isinstance(x, int) or isinstance(x, bool):\n"
|
| 168 |
+
" raise TypeError('elements must be int')\n"
|
| 169 |
+
" out = []\n"
|
| 170 |
+
" i = j = 0\n"
|
| 171 |
+
" while i < len(a) and j < len(b):\n"
|
| 172 |
+
" if a[i] <= b[j]:\n"
|
| 173 |
+
" out.append(a[i]); i += 1\n"
|
| 174 |
+
" else:\n"
|
| 175 |
+
" out.append(b[j]); j += 1\n"
|
| 176 |
+
" out.extend(a[i:])\n"
|
| 177 |
+
" out.extend(b[j:])\n"
|
| 178 |
+
" return out\n"
|
| 179 |
+
),
|
| 180 |
+
"run_length_encode": (
|
| 181 |
+
"def run_length_encode(s):\n"
|
| 182 |
+
" if not isinstance(s, str):\n"
|
| 183 |
+
" raise TypeError('input must be str')\n"
|
| 184 |
+
" if not s:\n"
|
| 185 |
+
" return []\n"
|
| 186 |
+
" out = []\n"
|
| 187 |
+
" cur = s[0]\n"
|
| 188 |
+
" n = 1\n"
|
| 189 |
+
" for ch in s[1:]:\n"
|
| 190 |
+
" if ch == cur:\n"
|
| 191 |
+
" n += 1\n"
|
| 192 |
+
" else:\n"
|
| 193 |
+
" out.append((cur, n))\n"
|
| 194 |
+
" cur = ch\n"
|
| 195 |
+
" n = 1\n"
|
| 196 |
+
" out.append((cur, n))\n"
|
| 197 |
+
" return out\n"
|
| 198 |
+
),
|
| 199 |
+
"binary_search": (
|
| 200 |
+
"def binary_search(arr, target):\n"
|
| 201 |
+
" if not isinstance(arr, list):\n"
|
| 202 |
+
" raise TypeError('arr must be list')\n"
|
| 203 |
+
" if not isinstance(target, int) or isinstance(target, bool):\n"
|
| 204 |
+
" raise TypeError('target must be int')\n"
|
| 205 |
+
" lo, hi = 0, len(arr) - 1\n"
|
| 206 |
+
" while lo <= hi:\n"
|
| 207 |
+
" mid = (lo + hi) // 2\n"
|
| 208 |
+
" v = arr[mid]\n"
|
| 209 |
+
" if v == target:\n"
|
| 210 |
+
" return mid\n"
|
| 211 |
+
" if v < target:\n"
|
| 212 |
+
" lo = mid + 1\n"
|
| 213 |
+
" else:\n"
|
| 214 |
+
" hi = mid - 1\n"
|
| 215 |
+
" return -1\n"
|
| 216 |
+
),
|
| 217 |
+
}
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
def get_oracle_code(task_name: str) -> str:
|
| 221 |
+
"""Return the canonical source for ``task_name``, or a stub raising
|
| 222 |
+
NotImplementedError if the task isn't in the oracle catalog."""
|
| 223 |
+
code = ORACLE_SOLUTIONS.get(task_name)
|
| 224 |
+
if code is not None:
|
| 225 |
+
return code
|
| 226 |
+
return (
|
| 227 |
+
f"def {task_name}(*args, **kwargs):\n"
|
| 228 |
+
f" raise NotImplementedError(\n"
|
| 229 |
+
f" 'No oracle reference for {task_name!r}. Try the model backends.'\n"
|
| 230 |
+
f" )\n"
|
| 231 |
+
)
|
requirements.txt
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=4.44.0,<5
|
| 2 |
+
requests>=2.31
|
| 3 |
+
huggingface_hub>=0.24
|
| 4 |
+
transformers>=4.45
|
| 5 |
+
peft>=0.13
|
| 6 |
+
accelerate>=0.34
|
| 7 |
+
--extra-index-url https://download.pytorch.org/whl/cpu
|
| 8 |
+
torch==2.4.1
|