--- title: OpenSleuth — Live Agent Demo emoji: "\U0001F575" colorFrom: indigo colorTo: purple sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false license: apache-2.0 suggested_hardware: cpu-basic suggested_storage: small short_description: "Watch an LLM reverse-engineer a hidden Python fn live" --- # OpenSleuth — live agent demo Pick a hidden black-box Python function from the OpenSleuth catalog (15 tasks: easy → hard, mix of builtin and Hub-pushed). Pick an agent backend (`oracle`, `base Qwen 0.5B`, `trained Qwen 0.5B (LoRA)`, `trained Qwen 3B (LoRA)`). Watch the agent: 1. **Probe** the env (6 inputs drawn from the same auto-fuzzer the verifier uses), one at a time, with each `(input → output)` pair streamed live. 2. **Submit** a Python replica of the hidden function. 3. **Get verified** by the env's domain-aware fuzzer: 100 random inputs + the spec's must-pass edge cases, with stratified pass-rates and a reward breakdown (execution / edge / complexity / hack penalties / perfect bonus). The submitted code is shown syntax-highlighted, and an optional accordion runs a quick `oracle` vs `trained-0.5b` head-to-head reward comparison on the selected task. ## Backends | Backend | Source | Notes | | --- | --- | --- | | `oracle` | `oracle.py` reference impl | Always +100; sanity-checks the env. | | `base Qwen 0.5B` | `Qwen/Qwen2.5-0.5B-Instruct` | No fine-tuning. | | `trained Qwen 0.5B (LoRA)` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` | GRPO LoRA on top of base 0.5B. | | `trained Qwen 3B (LoRA)` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` | 3B GRPO run; falls back to "adapter not yet trained" if the repo has no weights yet. | ## Architecture ``` [demo Space] ──HTTP──> [env Space] │ /tasks, /tasks/{name}/sample_inputs, │ /reset, /step (probe + submit) │ └─ HF model load (lazy, cached): base + optional LoRA on CPU ``` - The env Space is `anugrah55/opensleuth-env-gemini-cli`. - The task catalog is `anugrah55/opensleuth-tasks`. ## CPU-basic notes The demo runs on CPU-basic. First generation per backend cold-loads the model (~30–90s for 0.5B). To keep latency bounded: - `MAX_NEW_TOKENS=192` - Models are cached across runs (in-process LRU). - The 3B backend will only attempt a real load if the adapter repo has weights pushed; otherwise it short-circuits to a clear UI message. Configure with env vars: | Env var | Default | | --- | --- | | `OPENSLEUTH_ENV_URL` | `https://anugrah55-opensleuth-env-gemini-cli.hf.space` | | `BASE_MODEL_ID` | `Qwen/Qwen2.5-0.5B-Instruct` | | `BASE_MODEL_3B_ID` | `Qwen/Qwen2.5-3B-Instruct` | | `ADAPTER_05B_ID` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` | | `ADAPTER_3B_ID` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` | | `MAX_NEW_TOKENS` | `192` | | `N_PROBES` | `6` | | `HF_TOKEN` | (optional, set as Space secret for gated models) |