Spaces:

anugrah55
/

opensleuth-demo

Runtime error

App Files Files Community

opensleuth-demo / README.md

anugrah55

Initial demo: live agent rollouts against OpenSleuth env

dcc4ca0 verified 13 days ago

preview code

raw

history blame contribute delete

2.91 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: OpenSleuth — Live Agent Demo
emoji: 🕵
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
license: apache-2.0
suggested_hardware: cpu-basic
suggested_storage: small
short_description: Watch an LLM reverse-engineer a hidden Python fn live

OpenSleuth — live agent demo

Pick a hidden black-box Python function from the OpenSleuth catalog (15 tasks: easy → hard, mix of builtin and Hub-pushed). Pick an agent backend (oracle, base Qwen 0.5B, trained Qwen 0.5B (LoRA), trained Qwen 3B (LoRA)). Watch the agent:

Probe the env (6 inputs drawn from the same auto-fuzzer the verifier uses), one at a time, with each (input → output) pair streamed live.
Submit a Python replica of the hidden function.
Get verified by the env's domain-aware fuzzer: 100 random inputs + the spec's must-pass edge cases, with stratified pass-rates and a reward breakdown (execution / edge / complexity / hack penalties / perfect bonus).

The submitted code is shown syntax-highlighted, and an optional accordion runs a quick oracle vs trained-0.5b head-to-head reward comparison on the selected task.

Backends

Backend	Source	Notes
`oracle`	`oracle.py` reference impl	Always +100; sanity-checks the env.
`base Qwen 0.5B`	`Qwen/Qwen2.5-0.5B-Instruct`	No fine-tuning.
`trained Qwen 0.5B (LoRA)`	`anugrah55/opensleuth-qwen2.5-0.5b-grpo`	GRPO LoRA on top of base 0.5B.
`trained Qwen 3B (LoRA)`	`anugrah55/opensleuth-qwen2.5-3b-grpo-v2`	3B GRPO run; falls back to "adapter not yet trained" if the repo has no weights yet.

Architecture

[demo Space] ──HTTP──> [env Space]
   │                    /tasks, /tasks/{name}/sample_inputs,
   │                    /reset, /step (probe + submit)
   │
   └─ HF model load (lazy, cached): base + optional LoRA on CPU

The env Space is anugrah55/opensleuth-env-gemini-cli.
The task catalog is anugrah55/opensleuth-tasks.

CPU-basic notes

The demo runs on CPU-basic. First generation per backend cold-loads the model (~30–90s for 0.5B). To keep latency bounded:

MAX_NEW_TOKENS=192
Models are cached across runs (in-process LRU).
The 3B backend will only attempt a real load if the adapter repo has weights pushed; otherwise it short-circuits to a clear UI message.

Configure with env vars:

Env var	Default
`OPENSLEUTH_ENV_URL`	`https://anugrah55-opensleuth-env-gemini-cli.hf.space`
`BASE_MODEL_ID`	`Qwen/Qwen2.5-0.5B-Instruct`
`BASE_MODEL_3B_ID`	`Qwen/Qwen2.5-3B-Instruct`
`ADAPTER_05B_ID`	`anugrah55/opensleuth-qwen2.5-0.5b-grpo`
`ADAPTER_3B_ID`	`anugrah55/opensleuth-qwen2.5-3b-grpo-v2`
`MAX_NEW_TOKENS`	`192`
`N_PROBES`	`6`
`HF_TOKEN`	(optional, set as Space secret for gated models)