Spaces:

anugrah55
/

opensleuth-demo

Runtime error

App Files Files Community

opensleuth-demo / README.md

anugrah55

Initial demo: live agent rollouts against OpenSleuth env

dcc4ca0 verified 13 days ago

preview code

raw

history blame contribute delete

2.91 kB

	---
	title: OpenSleuth — Live Agent Demo
	emoji: "\U0001F575"
	colorFrom: indigo
	colorTo: purple
	sdk: gradio
	sdk_version: 4.44.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	suggested_hardware: cpu-basic
	suggested_storage: small
	short_description: "Watch an LLM reverse-engineer a hidden Python fn live"
	---

	# OpenSleuth — live agent demo

	Pick a hidden black-box Python function from the OpenSleuth catalog (15
	tasks: easy → hard, mix of builtin and Hub-pushed). Pick an agent backend
	(`oracle`, `base Qwen 0.5B`, `trained Qwen 0.5B (LoRA)`, `trained Qwen 3B
	(LoRA)`). Watch the agent:

	1. Probe the env (6 inputs drawn from the same auto-fuzzer the verifier
	uses), one at a time, with each `(input → output)` pair streamed live.
	2. Submit a Python replica of the hidden function.
	3. Get verified by the env's domain-aware fuzzer: 100 random inputs +
	the spec's must-pass edge cases, with stratified pass-rates and a
	reward breakdown (execution / edge / complexity / hack penalties /
	perfect bonus).

	The submitted code is shown syntax-highlighted, and an optional accordion
	runs a quick `oracle` vs `trained-0.5b` head-to-head reward comparison on
	the selected task.

	## Backends

	\| Backend \| Source \| Notes \|
	\| --- \| --- \| --- \|
	\| `oracle` \| `oracle.py` reference impl \| Always +100; sanity-checks the env. \|
	\| `base Qwen 0.5B` \| `Qwen/Qwen2.5-0.5B-Instruct` \| No fine-tuning. \|
	\| `trained Qwen 0.5B (LoRA)` \| `anugrah55/opensleuth-qwen2.5-0.5b-grpo` \| GRPO LoRA on top of base 0.5B. \|
	\| `trained Qwen 3B (LoRA)` \| `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` \| 3B GRPO run; falls back to "adapter not yet trained" if the repo has no weights yet. \|

	## Architecture

	```
	[demo Space] ──HTTP──> [env Space]
	│ /tasks, /tasks/{name}/sample_inputs,
	│ /reset, /step (probe + submit)
	│
	└─ HF model load (lazy, cached): base + optional LoRA on CPU
	```

	- The env Space is `anugrah55/opensleuth-env-gemini-cli`.
	- The task catalog is `anugrah55/opensleuth-tasks`.

	## CPU-basic notes

	The demo runs on CPU-basic. First generation per backend cold-loads the
	model (~30–90s for 0.5B). To keep latency bounded:

	- `MAX_NEW_TOKENS=192`
	- Models are cached across runs (in-process LRU).
	- The 3B backend will only attempt a real load if the adapter repo has
	weights pushed; otherwise it short-circuits to a clear UI message.

	Configure with env vars:

	\| Env var \| Default \|
	\| --- \| --- \|
	\| `OPENSLEUTH_ENV_URL` \| `https://anugrah55-opensleuth-env-gemini-cli.hf.space` \|
	\| `BASE_MODEL_ID` \| `Qwen/Qwen2.5-0.5B-Instruct` \|
	\| `BASE_MODEL_3B_ID` \| `Qwen/Qwen2.5-3B-Instruct` \|
	\| `ADAPTER_05B_ID` \| `anugrah55/opensleuth-qwen2.5-0.5b-grpo` \|
	\| `ADAPTER_3B_ID` \| `anugrah55/opensleuth-qwen2.5-3b-grpo-v2` \|
	\| `MAX_NEW_TOKENS` \| `192` \|
	\| `N_PROBES` \| `6` \|
	\| `HF_TOKEN` \| (optional, set as Space secret for gated models) \|