Spaces:

anugrah55
/

opensleuth-env-gemini-cli

Paused

App Files Files Community

opensleuth-env-gemini-cli / README.md

anugrah55

OpenEnv 0.2.3 conformance: mount /openenv sub-app, add adapter + tests + example client

31715b5 verified 12 days ago

preview code

raw

history blame contribute delete

8.72 kB

metadata

title: OpenSleuth Env
emoji: 🕵️
colorFrom: indigo
colorTo: pink
sdk: docker
app_port: 7860
pinned: false
suggested_hardware: cpu-basic

OpenSleuth — Environment

FastAPI service that exposes an OpenEnv-style /reset + /step API for the Algorithmic Detective task. An agent has to figure out an unknown Python function by probing it, then submit Python source that replicates it.

Endpoints

Method	Path	Body	Notes
GET	`/health`	—	Liveness probe (also reports Hub-catalog status).
GET	`/functions`	optional `?difficulty=easy\|medium\|hard`	Catalogue of the 9 builtin black-boxes (back-compat shape).
GET	`/tasks`	optional `?source=builtin\|hub\|all`	Open-ended catalog (Level 2): builtins + Hub-loaded rows.
POST	`/reset`	`{"target_name": "fibonacci", "seed": 0}` or `{"target_code": "...", "target_function_name": "..."}`	Starts an episode. Caller-supplied target_code wins over target_name.
POST	`/step`	`{"episode_id": "...", "action": {...}}`	One agent action.
GET	`/state/{eid}`	—	Inspect the live state of an episode (debug).

Action shapes

{"action_type": "probe",  "input_repr": "5"}             // input_repr is parsed via ast.literal_eval
{"action_type": "submit", "code": "def fibonacci(n):..."}

Reward (v0.3 – paper-driven update)

Inspired by Masud et al. 2026 (Reward Engineering for RL in Software Tasks, arXiv:2601.19100) and Ibrahim et al. 2024 (Comprehensive Overview of Reward Engineering and Shaping, arXiv:2408.10215).

Probe: -1 step cost, plus +2 per newly-seen output, +5 per newly-seen exception type, and +0.5 per newly-explored input bucket (CovRL-Fuzz / SimHash-style coverage bonus).
Submit (terminal): execution_reward − complexity_penalty − reward_hack_penalty − floor_penalty (+50 perfect bonus if 100% match) where:
- execution_reward ∈ [0, 100] is computed over stratified fuzz inputs: spec-defined edge_cases are always tested in addition to the random fuzz batch, and the per-category match counts are returned in info["matches_by_category"].
- floor_penalty is a hard -25 for sub-50% match-rate submissions (Vul-R2 style; Wen et al. 2025), preventing agents from learning that emitting any function pays out.
- reward_hack_penalty fires for static import-of-reference attempts (+25) and for "constant-output" collapse against a diverse reference (+15). The sandbox additionally blocks __import__, open, eval, exec, compile, etc.

Open-ended tasks (Level 2)

The env resolves a target function from three sources, in priority order:

Caller-supplied — POST /reset with target_code + target_function_name (and optionally edge_cases + fuzz_spec). The source is compiled in the same hardened sandbox the verifier uses for agent submissions; static-import of opensleuth_* is rejected up front. This lets a trainer hand the env an arbitrary unseen task per rollout without any redeploy.
Hub dataset — anugrah55/opensleuth-tasks. Loaded lazily on first /reset, cached in-process. Each row has {name, target_function_name, signature, description, difficulty, source_code, edge_cases_json, fuzz_spec_json}.
Builtin registry — the original 9 functions in black_box.py are kept as the safety-net so the in-flight trainer keeps working unchanged. Builtins win by name over Hub copies, so target_name="fibonacci" always resolves to the in-process oracle.

Adding new tasks

Per-reset (one-shot): pass target_code + target_function_name to /reset. Multi-arg signatures are supported via the auto-fuzzer (which introspects inspect.signature + typing.get_type_hints); pass edge_cases as a list of Python literal strings and fuzz_spec as a per-parameter override map.
Persistent: append a row to the Hub dataset and the env will pick it up on its next process-start. The bootstrap script (opensleuth_env/scripts/bootstrap_tasks_dataset.py) is idempotent — re-running it overwrites the dataset with the latest builtin + curated rows.

# Push the curated 9 + 6 = 15-task seed catalog.
PYTHONPATH=. python -m opensleuth_env.scripts.bootstrap_tasks_dataset

Backwards compatibility

Existing trainer / eval clients only read info["execution_reward"], info["matches"], info["fuzz_count"] and resp["reward"] — all preserved with the same meaning. New fields (difficulty, coverage_buckets_seen, matches_by_category, edge_pass_rate, reward_hack_penalty, floor_penalty, perfect_bonus) are additive and ignored by older clients.

/reset retains its v0.3 shape: {"target_name": "fibonacci", "seed": 0, "max_steps": 25} works exactly as before. The four new optional fields (target_code, target_function_name, edge_cases, fuzz_spec) are additive. /functions returns the same shape as before (with one additive source field). Open-ended/Hub tasks are exposed via the new /tasks endpoint so older clients aren't surprised.

OpenEnv conformance

This Space targets the meta-pytorch / OpenEnv v0.2.3 spec (pip install openenv-core==0.2.3). The OpenEnv-conformant surface is mounted at /openenv/* alongside (not on top of) the legacy endpoints listed above so the in-flight trainer keeps working unchanged.

OpenEnv route	Path	Notes
`GET /health`	`/openenv/health`	`{"status": "healthy"}`
`GET /metadata`	`/openenv/metadata`	`EnvironmentMetadata` (name, description, version, ...)
`GET /schema`	`/openenv/schema`	JSON schemas for `action`, `observation`, `state`
`GET /state`	`/openenv/state`	Episode `State` (episode_id, step_count, ...)
`POST /reset`	`/openenv/reset`	Returns `{"observation", "reward", "done"}` envelope
`POST /step`	`/openenv/step`	Body: `{"action": {"action_type": "probe"
`WS /ws`	`/openenv/ws`	Persistent session: `reset` → `step`* → `state` → `close`

OpenSleuthEnvironment (in opensleuth_env/openenv_adapter.py) subclasses openenv.core.env_server.interfaces.Environment, so any OpenEnv-aware harness (openenv CLI, GenericEnvClient, TRL/torchforge integrations, LightningAI Studio, ...) can pick it up via standard introspection.

Talking to it as an OpenEnv client

import asyncio
from openenv import GenericEnvClient, GenericAction

async def main():
    base = "https://anugrah55-opensleuth-env-gemini-cli.hf.space/openenv"
    async with GenericEnvClient(base_url=base) as env:
        result = await env.reset(target_name="fibonacci", max_steps=8)
        result = await env.step(GenericAction(action_type="probe", input_repr="10"))
        print(result.observation["probe_history"][-1])

asyncio.run(main())

A runnable end-to-end example lives in example_client.py.

What is not yet conformant

No MCP tool surface (RFC 003). Our actions are typed Pydantic models, not MCP tools, because the underlying probe/submit semantics map cleanly to a single OpenSleuthAction discriminator. Adding MCP would be additive.
No Rubric/EvalHarness integration (RFC 004) — reward shaping lives in opensleuth_env/env.py and is intentionally not split into a separate rubric for now.

Hardware

CPU-only — cpu-basic is plenty. Do not assign GPU to this Space.

Running locally

pip install -r requirements.txt
uvicorn server:app --port 7860 --reload
# legacy contract:                http://localhost:7860/{health,reset,step,state/{eid}}
# OpenEnv-conformant surface:     http://localhost:7860/openenv/{health,reset,step,state,schema,metadata,ws}

To run only the OpenEnv conformance tests:

PYTHONPATH=. python -m pytest tests/test_openenv_conformance.py -v