Spaces:

anugrah55
/

opensleuth-env-gemini-cli

Paused

App Files Files Community

opensleuth-env-gemini-cli / README.md

anugrah55

OpenEnv 0.2.3 conformance: mount /openenv sub-app, add adapter + tests + example client

31715b5 verified 12 days ago

preview code

raw

history blame contribute delete

8.72 kB

	---
	title: OpenSleuth Env
	emoji: 🕵️
	colorFrom: indigo
	colorTo: pink
	sdk: docker
	app_port: 7860
	pinned: false
	suggested_hardware: cpu-basic
	---

	# OpenSleuth — Environment

	FastAPI service that exposes an OpenEnv-style `/reset` + `/step` API for the
	Algorithmic Detective task. An agent has to figure out an unknown Python
	function by probing it, then submit Python source that replicates it.

	## Endpoints

	\| Method \| Path \| Body \| Notes \|
	\|-------:\|---------------\|----------------------------------------\|----------------------------------------\|
	\| GET \| `/health` \| — \| Liveness probe (also reports Hub-catalog status). \|
	\| GET \| `/functions` \| optional `?difficulty=easy\\|medium\\|hard` \| Catalogue of the 9 builtin black-boxes (back-compat shape). \|
	\| GET \| `/tasks` \| optional `?source=builtin\\|hub\\|all` \| Open-ended catalog (Level 2): builtins + Hub-loaded rows. \|
	\| POST \| `/reset` \| `{"target_name": "fibonacci", "seed": 0}` or `{"target_code": "...", "target_function_name": "..."}` \| Starts an episode. Caller-supplied target_code wins over target_name. \|
	\| POST \| `/step` \| `{"episode_id": "...", "action": {...}}` \| One agent action. \|
	\| GET \| `/state/{eid}`\| — \| Inspect the live state of an episode (debug). \|

	### Action shapes

	```json
	{"action_type": "probe", "input_repr": "5"} // input_repr is parsed via ast.literal_eval
	{"action_type": "submit", "code": "def fibonacci(n):..."}
	```

	### Reward (v0.3 – paper-driven update)

	Inspired by Masud et al. 2026 (Reward Engineering for RL in Software Tasks,
	arXiv:2601.19100) and Ibrahim et al. 2024 (*Comprehensive Overview of Reward
	Engineering and Shaping*, arXiv:2408.10215).

	* Probe: `-1` step cost, plus `+2` per newly-seen output, `+5` per
	newly-seen exception type, and `+0.5` per newly-explored input bucket
	(CovRL-Fuzz / SimHash-style coverage bonus).
	* Submit (terminal):
	`execution_reward − complexity_penalty − reward_hack_penalty − floor_penalty
	(+50 perfect bonus if 100% match)` where:
	* `execution_reward` ∈ `[0, 100]` is computed over stratified fuzz
	inputs: spec-defined `edge_cases` are always tested in addition to the
	random fuzz batch, and the per-category match counts are returned in
	`info["matches_by_category"]`.
	* `floor_penalty` is a hard `-25` for sub-50% match-rate submissions
	(Vul-R2 style; Wen et al. 2025), preventing agents from learning that
	emitting any function pays out.
	* `reward_hack_penalty` fires for static import-of-reference attempts
	(`+25`) and for "constant-output" collapse against a diverse reference
	(`+15`). The sandbox additionally blocks `__import__`, `open`,
	`eval`, `exec`, `compile`, etc.

	### Open-ended tasks (Level 2)

	The env resolves a target function from three sources, in priority order:

	1. Caller-supplied — `POST /reset` with `target_code` + `target_function_name`
	(and optionally `edge_cases` + `fuzz_spec`). The source is compiled in the
	same hardened sandbox the verifier uses for agent submissions; static-import
	of `opensleuth_*` is rejected up front. This lets a trainer hand the env an
	arbitrary unseen task per rollout without any redeploy.

	2. Hub dataset — [`anugrah55/opensleuth-tasks`](https://huggingface.co/datasets/anugrah55/opensleuth-tasks).
	Loaded lazily on first `/reset`, cached in-process. Each row has
	`{name, target_function_name, signature, description, difficulty,
	source_code, edge_cases_json, fuzz_spec_json}`.

	3. Builtin registry — the original 9 functions in `black_box.py` are kept
	as the safety-net so the in-flight trainer keeps working unchanged. Builtins
	win by name over Hub copies, so `target_name="fibonacci"` always resolves
	to the in-process oracle.

	#### Adding new tasks

	* Per-reset (one-shot): pass `target_code` + `target_function_name` to
	`/reset`. Multi-arg signatures are supported via the auto-fuzzer (which
	introspects `inspect.signature` + `typing.get_type_hints`); pass
	`edge_cases` as a list of Python literal strings and `fuzz_spec` as a
	per-parameter override map.

	* Persistent: append a row to the Hub dataset and the env will pick it
	up on its next process-start. The bootstrap script
	(`opensleuth_env/scripts/bootstrap_tasks_dataset.py`) is idempotent —
	re-running it overwrites the dataset with the latest builtin + curated
	rows.

	```bash
	# Push the curated 9 + 6 = 15-task seed catalog.
	PYTHONPATH=. python -m opensleuth_env.scripts.bootstrap_tasks_dataset
	```

	### Backwards compatibility

	Existing trainer / eval clients only read `info["execution_reward"]`,
	`info["matches"]`, `info["fuzz_count"]` and `resp["reward"]` — all preserved
	with the same meaning. New fields (`difficulty`, `coverage_buckets_seen`,
	`matches_by_category`, `edge_pass_rate`, `reward_hack_penalty`,
	`floor_penalty`, `perfect_bonus`) are additive and ignored by older clients.

	`/reset` retains its v0.3 shape: `{"target_name": "fibonacci", "seed": 0,
	"max_steps": 25}` works exactly as before. The four new optional fields
	(`target_code`, `target_function_name`, `edge_cases`, `fuzz_spec`) are
	additive. `/functions` returns the same shape as before (with one additive
	`source` field). Open-ended/Hub tasks are exposed via the new `/tasks`
	endpoint so older clients aren't surprised.

	## OpenEnv conformance

	This Space targets the [meta-pytorch / OpenEnv](https://github.com/meta-pytorch/OpenEnv)
	v0.2.3 spec (`pip install openenv-core==0.2.3`). The OpenEnv-conformant
	surface is mounted at *`/openenv/`** alongside (not on top of) the legacy
	endpoints listed above so the in-flight trainer keeps working unchanged.

	\| OpenEnv route \| Path \| Notes \|
	\|--------------------------\|-----------------------\|----------------------------------------------------------\|
	\| `GET /health` \| `/openenv/health` \| `{"status": "healthy"}` \|
	\| `GET /metadata` \| `/openenv/metadata` \| `EnvironmentMetadata` (name, description, version, ...) \|
	\| `GET /schema` \| `/openenv/schema` \| JSON schemas for `action`, `observation`, `state` \|
	\| `GET /state` \| `/openenv/state` \| Episode `State` (episode_id, step_count, ...) \|
	\| `POST /reset` \| `/openenv/reset` \| Returns `{"observation", "reward", "done"}` envelope \|
	\| `POST /step` \| `/openenv/step` \| Body: `{"action": {"action_type": "probe"\|"submit", ...}}` \|
	\| `WS /ws` \| `/openenv/ws` \| Persistent session: `reset` → `step`* → `state` → `close` \|

	`OpenSleuthEnvironment` (in `opensleuth_env/openenv_adapter.py`) subclasses
	`openenv.core.env_server.interfaces.Environment`, so any OpenEnv-aware
	harness (`openenv` CLI, `GenericEnvClient`, TRL/torchforge integrations,
	LightningAI Studio, ...) can pick it up via standard introspection.

	### Talking to it as an OpenEnv client

	```python
	import asyncio
	from openenv import GenericEnvClient, GenericAction

	async def main():
	base = "https://anugrah55-opensleuth-env-gemini-cli.hf.space/openenv"
	async with GenericEnvClient(base_url=base) as env:
	result = await env.reset(target_name="fibonacci", max_steps=8)
	result = await env.step(GenericAction(action_type="probe", input_repr="10"))
	print(result.observation["probe_history"][-1])

	asyncio.run(main())
	```

	A runnable end-to-end example lives in [`example_client.py`](example_client.py).

	### What is not yet conformant

	* No MCP tool surface (RFC 003). Our actions are typed Pydantic models, not
	MCP tools, because the underlying probe/submit semantics map cleanly to a
	single `OpenSleuthAction` discriminator. Adding MCP would be additive.
	* No Rubric/EvalHarness integration (RFC 004) — reward shaping lives in
	`opensleuth_env/env.py` and is intentionally not split into a separate
	rubric for now.

	## Hardware

	CPU-only — `cpu-basic` is plenty. Do not assign GPU to this Space.

	## Running locally

	```bash
	pip install -r requirements.txt
	uvicorn server:app --port 7860 --reload
	# legacy contract: http://localhost:7860/{health,reset,step,state/{eid}}
	# OpenEnv-conformant surface: http://localhost:7860/openenv/{health,reset,step,state,schema,metadata,ws}
	```

	To run only the OpenEnv conformance tests:

	```bash
	PYTHONPATH=. python -m pytest tests/test_openenv_conformance.py -v
	```