Spaces:
Runtime error
Runtime error
| title: OpenSleuth — Live Agent Demo | |
| emoji: "\U0001F575" | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 4.44.0 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| suggested_hardware: cpu-basic | |
| suggested_storage: small | |
| short_description: "Watch an LLM reverse-engineer a hidden Python fn live" | |
| # OpenSleuth — live agent demo | |
| Pick a hidden black-box Python function from the OpenSleuth catalog (15 | |
| tasks: easy → hard, mix of builtin and Hub-pushed). Pick an agent backend | |
| (`oracle`, `base Qwen 0.5B`, `trained Qwen 0.5B (LoRA)`, `trained Qwen 3B | |
| (LoRA)`). Watch the agent: | |
| 1. **Probe** the env (6 inputs drawn from the same auto-fuzzer the verifier | |
| uses), one at a time, with each `(input → output)` pair streamed live. | |
| 2. **Submit** a Python replica of the hidden function. | |
| 3. **Get verified** by the env's domain-aware fuzzer: 100 random inputs + | |
| the spec's must-pass edge cases, with stratified pass-rates and a | |
| reward breakdown (execution / edge / complexity / hack penalties / | |
| perfect bonus). | |
| The submitted code is shown syntax-highlighted, and an optional accordion | |
| runs a quick `oracle` vs `trained-0.5b` head-to-head reward comparison on | |
| the selected task. | |
| ## Backends | |
| | Backend | Source | Notes | | |
| | --- | --- | --- | | |
| | `oracle` | `oracle.py` reference impl | Always +100; sanity-checks the env. | | |
| | `base Qwen 0.5B` | `Qwen/Qwen2.5-0.5B-Instruct` | No fine-tuning. | | |
| | `trained Qwen 0.5B (LoRA)` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` | GRPO LoRA on top of base 0.5B. | | |
| | `trained Qwen 3B (LoRA)` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` | 3B GRPO run; falls back to "adapter not yet trained" if the repo has no weights yet. | | |
| ## Architecture | |
| ``` | |
| [demo Space] ──HTTP──> [env Space] | |
| │ /tasks, /tasks/{name}/sample_inputs, | |
| │ /reset, /step (probe + submit) | |
| │ | |
| └─ HF model load (lazy, cached): base + optional LoRA on CPU | |
| ``` | |
| - The env Space is `anugrah55/opensleuth-env-gemini-cli`. | |
| - The task catalog is `anugrah55/opensleuth-tasks`. | |
| ## CPU-basic notes | |
| The demo runs on CPU-basic. First generation per backend cold-loads the | |
| model (~30–90s for 0.5B). To keep latency bounded: | |
| - `MAX_NEW_TOKENS=192` | |
| - Models are cached across runs (in-process LRU). | |
| - The 3B backend will only attempt a real load if the adapter repo has | |
| weights pushed; otherwise it short-circuits to a clear UI message. | |
| Configure with env vars: | |
| | Env var | Default | | |
| | --- | --- | | |
| | `OPENSLEUTH_ENV_URL` | `https://anugrah55-opensleuth-env-gemini-cli.hf.space` | | |
| | `BASE_MODEL_ID` | `Qwen/Qwen2.5-0.5B-Instruct` | | |
| | `BASE_MODEL_3B_ID` | `Qwen/Qwen2.5-3B-Instruct` | | |
| | `ADAPTER_05B_ID` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` | | |
| | `ADAPTER_3B_ID` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` | | |
| | `MAX_NEW_TOKENS` | `192` | | |
| | `N_PROBES` | `6` | | |
| | `HF_TOKEN` | (optional, set as Space secret for gated models) | | |