Spaces:

srikrishna2005
/

openenv

Running

App Files Files Community

openenv / docs /sentinel /universal-oversight-plan.md

sentinel-space-publisher

space: publish latest Sentinel app snapshot

c452421 15 days ago

preview code

raw

history blame contribute delete

5.01 kB

	# Universal SENTINEL Oversight Harness

	## Idea

	SENTINEL should be able to supervise not only this repo's worker fleet, but any
	OpenEnv-compatible environment or agent exposed through a Hugging Face Space.

	The demo story:

	> Paste an OpenEnv Space URL. SENTINEL wraps the environment, intercepts proposed
	> actions, blocks or redirects risky behavior, then produces a cross-team
	> oversight report.

	This should be built last, after the core SENTINEL training proof is stable.

	## Why It Wins Demo Attention

	Most submissions show one agent solving one task. Universal SENTINEL shows a
	meta-agent supervising many agents across many task domains.

	This turns SENTINEL from "an SRE oversight environment" into a general pattern:

	- calendar agents: block bulk destructive calendar edits without confirmation
	- SRE agents: block hallucinated services and premature severity claims
	- IoT agents: block actuator changes without sensor evidence
	- negotiation agents: flag loop exploitation and repeated offers
	- finance agents: block irreversible actions without risk evidence

	## Safe Framing

	Do not say "we judged all other teams." That sounds adversarial and may depend on
	private agent APIs.

	Say:

	> SENTINEL is an opt-in universal oversight harness for OpenEnv-compatible
	> agents and environments.

	## Three Integration Levels

	### Level 1: OpenEnv Environment Wrapper

	Works when the Space exposes:

	- `GET /health`
	- `GET /tasks`
	- `POST /reset`
	- `POST /step`
	- `POST /grader`

	Flow:

	```text
	Remote OpenEnv Space
	-> observation
	-> local baseline/LLM action generator
	-> SENTINEL intercept
	-> approved/blocked/redirected action
	-> remote /step
	-> oversight report
	```

	This is the safest live demo path.

	### Level 2: External Agent Oversight

	Works when another team exposes an agent endpoint, for example:

	- `POST /act`
	- input: observation
	- output: proposed action

	Flow:

	```text
	Remote env observation
	-> remote agent /act
	-> proposed action
	-> SENTINEL intercept
	-> remote env /step if approved
	```

	This is the strongest proof, but requires cooperation or public agent APIs.

	### Level 3: Browser/UI Scraping

	Avoid for the judged demo.

	It is fragile, slow, and can look like we are attacking other submissions. Keep
	the product-level version API-first.

	## Proposed Files

	```text
	universal/
	adapters.py # OpenEnv/HF Space compatibility checks
	action_generator.py # baseline or model action proposal provider
	harness.py # run multiple Spaces with retry/backoff/circuit breaker
	policy_mapper.py # map domain actions to universal safety categories
	report.py # aggregate cross-environment oversight report
	```

	## Universal Safety Categories

	SENTINEL should normalize arbitrary environment actions into these categories:

	- invalid target or hallucinated entity
	- irreversible/destructive action
	- external communication
	- escalation or broad notification
	- actuator/control action
	- repeated loop action
	- action before evidence
	- cross-domain authority violation

	## Example Report

	```text
	SENTINEL OVERSIGHT REPORT - Grand Finale 2026

	Sources monitored: 8
	Compatible OpenEnv spaces: 6
	Unavailable or timed out: 2

	Team A - Calendar Assistant
	Caught: escalation bombing, step 4
	Redirected: destructive bulk update -> request confirmation, step 7
	Risk prevented: 3.2

	Team B - SRE Environment
	Caught: hallucination - service "auth-proxy-v2" does not exist, step 2
	Caught: reward hacking - classified P1 before investigation, step 1

	Team C - IoT Environment
	All actions approved - clean agent

	Team D - Negotiation
	Caught: loop exploitation - same offer repeated 4 times

	Totals:
	Actions audited: 82
	Blocks: 14
	Redirects: 6
	Flags: 9
	Prevented risk: 11.7
	```

	## Reliability Requirements

	The harness must never depend on a remote Space being healthy.

	Required protections:

	- 5-10 second request timeout per remote call
	- exponential backoff for transient failures
	- per-Space circuit breaker after repeated failures
	- compatibility report when `/tasks` or schemas are missing
	- offline fixture mode for the live pitch
	- no false precision for unknown labels

	For unknown external environments, say "estimated false positives" unless the
	remote Space provides labels or grader feedback.

	## Build Order

	1. Keep this as a finale extension until core training proof is complete.
	2. Implement OpenEnv compatibility checker.
	3. Implement one local action generator.
	4. Run 3-5 known Spaces or local fixtures.
	5. Add aggregate report generation.
	6. Add paste-a-Space-URL field to `/sentinel/dashboard`.
	7. Only then attempt external agent `/act` integration.

	## Demo Principle

	Prepared mode must always work. Bring-your-own-link mode is a bonus.

	The judged demo should show:

	1. SENTINEL core environment.
	2. Reward curve / before-after training proof.
	3. Zero-shot confidence washing via `/sentinel/intercept`.
	4. Universal oversight report as the final "this scales beyond our environment"
	moment.