opensleuth-demo / README.md
anugrah55's picture
Initial demo: live agent rollouts against OpenSleuth env
dcc4ca0 verified
---
title: OpenSleuth Live Agent Demo
emoji: "\U0001F575"
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
suggested_hardware: cpu-basic
suggested_storage: small
short_description: "Watch an LLM reverse-engineer a hidden Python fn live"
---
# OpenSleuth — live agent demo
Pick a hidden black-box Python function from the OpenSleuth catalog (15
tasks: easy → hard, mix of builtin and Hub-pushed). Pick an agent backend
(`oracle`, `base Qwen 0.5B`, `trained Qwen 0.5B (LoRA)`, `trained Qwen 3B
(LoRA)`). Watch the agent:
1. **Probe** the env (6 inputs drawn from the same auto-fuzzer the verifier
uses), one at a time, with each `(input → output)` pair streamed live.
2. **Submit** a Python replica of the hidden function.
3. **Get verified** by the env's domain-aware fuzzer: 100 random inputs +
the spec's must-pass edge cases, with stratified pass-rates and a
reward breakdown (execution / edge / complexity / hack penalties /
perfect bonus).
The submitted code is shown syntax-highlighted, and an optional accordion
runs a quick `oracle` vs `trained-0.5b` head-to-head reward comparison on
the selected task.
## Backends
| Backend | Source | Notes |
| --- | --- | --- |
| `oracle` | `oracle.py` reference impl | Always +100; sanity-checks the env. |
| `base Qwen 0.5B` | `Qwen/Qwen2.5-0.5B-Instruct` | No fine-tuning. |
| `trained Qwen 0.5B (LoRA)` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` | GRPO LoRA on top of base 0.5B. |
| `trained Qwen 3B (LoRA)` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` | 3B GRPO run; falls back to "adapter not yet trained" if the repo has no weights yet. |
## Architecture
```
[demo Space] ──HTTP──> [env Space]
│ /tasks, /tasks/{name}/sample_inputs,
│ /reset, /step (probe + submit)
└─ HF model load (lazy, cached): base + optional LoRA on CPU
```
- The env Space is `anugrah55/opensleuth-env-gemini-cli`.
- The task catalog is `anugrah55/opensleuth-tasks`.
## CPU-basic notes
The demo runs on CPU-basic. First generation per backend cold-loads the
model (~30–90s for 0.5B). To keep latency bounded:
- `MAX_NEW_TOKENS=192`
- Models are cached across runs (in-process LRU).
- The 3B backend will only attempt a real load if the adapter repo has
weights pushed; otherwise it short-circuits to a clear UI message.
Configure with env vars:
| Env var | Default |
| --- | --- |
| `OPENSLEUTH_ENV_URL` | `https://anugrah55-opensleuth-env-gemini-cli.hf.space` |
| `BASE_MODEL_ID` | `Qwen/Qwen2.5-0.5B-Instruct` |
| `BASE_MODEL_3B_ID` | `Qwen/Qwen2.5-3B-Instruct` |
| `ADAPTER_05B_ID` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` |
| `ADAPTER_3B_ID` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` |
| `MAX_NEW_TOKENS` | `192` |
| `N_PROBES` | `6` |
| `HF_TOKEN` | (optional, set as Space secret for gated models) |