opensleuth-demo / README.md
anugrah55's picture
Initial demo: live agent rollouts against OpenSleuth env
dcc4ca0 verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: OpenSleuth  Live Agent Demo
emoji: 🕵
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
suggested_hardware: cpu-basic
suggested_storage: small
short_description: Watch an LLM reverse-engineer a hidden Python fn live

OpenSleuth — live agent demo

Pick a hidden black-box Python function from the OpenSleuth catalog (15 tasks: easy → hard, mix of builtin and Hub-pushed). Pick an agent backend (oracle, base Qwen 0.5B, trained Qwen 0.5B (LoRA), trained Qwen 3B (LoRA)). Watch the agent:

  1. Probe the env (6 inputs drawn from the same auto-fuzzer the verifier uses), one at a time, with each (input → output) pair streamed live.
  2. Submit a Python replica of the hidden function.
  3. Get verified by the env's domain-aware fuzzer: 100 random inputs + the spec's must-pass edge cases, with stratified pass-rates and a reward breakdown (execution / edge / complexity / hack penalties / perfect bonus).

The submitted code is shown syntax-highlighted, and an optional accordion runs a quick oracle vs trained-0.5b head-to-head reward comparison on the selected task.

Backends

Backend Source Notes
oracle oracle.py reference impl Always +100; sanity-checks the env.
base Qwen 0.5B Qwen/Qwen2.5-0.5B-Instruct No fine-tuning.
trained Qwen 0.5B (LoRA) anugrah55/opensleuth-qwen2.5-0.5b-grpo GRPO LoRA on top of base 0.5B.
trained Qwen 3B (LoRA) anugrah55/opensleuth-qwen2.5-3b-grpo-v2 3B GRPO run; falls back to "adapter not yet trained" if the repo has no weights yet.

Architecture

[demo Space] ──HTTP──> [env Space]
   │                    /tasks, /tasks/{name}/sample_inputs,
   │                    /reset, /step (probe + submit)
   │
   └─ HF model load (lazy, cached): base + optional LoRA on CPU
  • The env Space is anugrah55/opensleuth-env-gemini-cli.
  • The task catalog is anugrah55/opensleuth-tasks.

CPU-basic notes

The demo runs on CPU-basic. First generation per backend cold-loads the model (~30–90s for 0.5B). To keep latency bounded:

  • MAX_NEW_TOKENS=192
  • Models are cached across runs (in-process LRU).
  • The 3B backend will only attempt a real load if the adapter repo has weights pushed; otherwise it short-circuits to a clear UI message.

Configure with env vars:

Env var Default
OPENSLEUTH_ENV_URL https://anugrah55-opensleuth-env-gemini-cli.hf.space
BASE_MODEL_ID Qwen/Qwen2.5-0.5B-Instruct
BASE_MODEL_3B_ID Qwen/Qwen2.5-3B-Instruct
ADAPTER_05B_ID anugrah55/opensleuth-qwen2.5-0.5b-grpo
ADAPTER_3B_ID anugrah55/opensleuth-qwen2.5-3b-grpo-v2
MAX_NEW_TOKENS 192
N_PROBES 6
HF_TOKEN (optional, set as Space secret for gated models)