---
title: OpenSleuth — Live Agent Demo
emoji: "\U0001F575"
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
suggested_hardware: cpu-basic
suggested_storage: small
short_description: "Watch an LLM reverse-engineer a hidden Python fn live"
---

# OpenSleuth — live agent demo

Pick a hidden black-box Python function from the OpenSleuth catalog (15
tasks: easy → hard, mix of builtin and Hub-pushed). Pick an agent backend
(`oracle`, `base Qwen 0.5B`, `trained Qwen 0.5B (LoRA)`, `trained Qwen 3B
(LoRA)`). Watch the agent:

1. **Probe** the env (6 inputs drawn from the same auto-fuzzer the verifier
   uses), one at a time, with each `(input → output)` pair streamed live.
2. **Submit** a Python replica of the hidden function.
3. **Get verified** by the env's domain-aware fuzzer: 100 random inputs +
   the spec's must-pass edge cases, with stratified pass-rates and a
   reward breakdown (execution / edge / complexity / hack penalties /
   perfect bonus).

The submitted code is shown syntax-highlighted, and an optional accordion
runs a quick `oracle` vs `trained-0.5b` head-to-head reward comparison on
the selected task.

## Backends

| Backend | Source | Notes |
| --- | --- | --- |
| `oracle` | `oracle.py` reference impl | Always +100; sanity-checks the env. |
| `base Qwen 0.5B` | `Qwen/Qwen2.5-0.5B-Instruct` | No fine-tuning. |
| `trained Qwen 0.5B (LoRA)` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` | GRPO LoRA on top of base 0.5B. |
| `trained Qwen 3B (LoRA)` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` | 3B GRPO run; falls back to "adapter not yet trained" if the repo has no weights yet. |

## Architecture

```
[demo Space] ──HTTP──> [env Space]
   │                    /tasks, /tasks/{name}/sample_inputs,
   │                    /reset, /step (probe + submit)
   │
   └─ HF model load (lazy, cached): base + optional LoRA on CPU
```

- The env Space is `anugrah55/opensleuth-env-gemini-cli`.
- The task catalog is `anugrah55/opensleuth-tasks`.

## CPU-basic notes

The demo runs on CPU-basic. First generation per backend cold-loads the
model (~30–90s for 0.5B). To keep latency bounded:

- `MAX_NEW_TOKENS=192`
- Models are cached across runs (in-process LRU).
- The 3B backend will only attempt a real load if the adapter repo has
  weights pushed; otherwise it short-circuits to a clear UI message.

Configure with env vars:

| Env var | Default |
| --- | --- |
| `OPENSLEUTH_ENV_URL` | `https://anugrah55-opensleuth-env-gemini-cli.hf.space` |
| `BASE_MODEL_ID` | `Qwen/Qwen2.5-0.5B-Instruct` |
| `BASE_MODEL_3B_ID` | `Qwen/Qwen2.5-3B-Instruct` |
| `ADAPTER_05B_ID` | `anugrah55/opensleuth-qwen2.5-0.5b-grpo` |
| `ADAPTER_3B_ID` | `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` |
| `MAX_NEW_TOKENS` | `192` |
| `N_PROBES` | `6` |
| `HF_TOKEN` | (optional, set as Space secret for gated models) |