openenv / docs /sentinel /universal-oversight-plan.md
sentinel-space-publisher
space: publish latest Sentinel app snapshot
c452421
# Universal SENTINEL Oversight Harness
## Idea
SENTINEL should be able to supervise not only this repo's worker fleet, but any
OpenEnv-compatible environment or agent exposed through a Hugging Face Space.
The demo story:
> Paste an OpenEnv Space URL. SENTINEL wraps the environment, intercepts proposed
> actions, blocks or redirects risky behavior, then produces a cross-team
> oversight report.
This should be built last, after the core SENTINEL training proof is stable.
## Why It Wins Demo Attention
Most submissions show one agent solving one task. Universal SENTINEL shows a
meta-agent supervising many agents across many task domains.
This turns SENTINEL from "an SRE oversight environment" into a general pattern:
- calendar agents: block bulk destructive calendar edits without confirmation
- SRE agents: block hallucinated services and premature severity claims
- IoT agents: block actuator changes without sensor evidence
- negotiation agents: flag loop exploitation and repeated offers
- finance agents: block irreversible actions without risk evidence
## Safe Framing
Do not say "we judged all other teams." That sounds adversarial and may depend on
private agent APIs.
Say:
> SENTINEL is an opt-in universal oversight harness for OpenEnv-compatible
> agents and environments.
## Three Integration Levels
### Level 1: OpenEnv Environment Wrapper
Works when the Space exposes:
- `GET /health`
- `GET /tasks`
- `POST /reset`
- `POST /step`
- `POST /grader`
Flow:
```text
Remote OpenEnv Space
-> observation
-> local baseline/LLM action generator
-> SENTINEL intercept
-> approved/blocked/redirected action
-> remote /step
-> oversight report
```
This is the safest live demo path.
### Level 2: External Agent Oversight
Works when another team exposes an agent endpoint, for example:
- `POST /act`
- input: observation
- output: proposed action
Flow:
```text
Remote env observation
-> remote agent /act
-> proposed action
-> SENTINEL intercept
-> remote env /step if approved
```
This is the strongest proof, but requires cooperation or public agent APIs.
### Level 3: Browser/UI Scraping
Avoid for the judged demo.
It is fragile, slow, and can look like we are attacking other submissions. Keep
the product-level version API-first.
## Proposed Files
```text
universal/
adapters.py # OpenEnv/HF Space compatibility checks
action_generator.py # baseline or model action proposal provider
harness.py # run multiple Spaces with retry/backoff/circuit breaker
policy_mapper.py # map domain actions to universal safety categories
report.py # aggregate cross-environment oversight report
```
## Universal Safety Categories
SENTINEL should normalize arbitrary environment actions into these categories:
- invalid target or hallucinated entity
- irreversible/destructive action
- external communication
- escalation or broad notification
- actuator/control action
- repeated loop action
- action before evidence
- cross-domain authority violation
## Example Report
```text
SENTINEL OVERSIGHT REPORT - Grand Finale 2026
Sources monitored: 8
Compatible OpenEnv spaces: 6
Unavailable or timed out: 2
Team A - Calendar Assistant
Caught: escalation bombing, step 4
Redirected: destructive bulk update -> request confirmation, step 7
Risk prevented: 3.2
Team B - SRE Environment
Caught: hallucination - service "auth-proxy-v2" does not exist, step 2
Caught: reward hacking - classified P1 before investigation, step 1
Team C - IoT Environment
All actions approved - clean agent
Team D - Negotiation
Caught: loop exploitation - same offer repeated 4 times
Totals:
Actions audited: 82
Blocks: 14
Redirects: 6
Flags: 9
Prevented risk: 11.7
```
## Reliability Requirements
The harness must never depend on a remote Space being healthy.
Required protections:
- 5-10 second request timeout per remote call
- exponential backoff for transient failures
- per-Space circuit breaker after repeated failures
- compatibility report when `/tasks` or schemas are missing
- offline fixture mode for the live pitch
- no false precision for unknown labels
For unknown external environments, say "estimated false positives" unless the
remote Space provides labels or grader feedback.
## Build Order
1. Keep this as a finale extension until core training proof is complete.
2. Implement OpenEnv compatibility checker.
3. Implement one local action generator.
4. Run 3-5 known Spaces or local fixtures.
5. Add aggregate report generation.
6. Add paste-a-Space-URL field to `/sentinel/dashboard`.
7. Only then attempt external agent `/act` integration.
## Demo Principle
Prepared mode must always work. Bring-your-own-link mode is a bonus.
The judged demo should show:
1. SENTINEL core environment.
2. Reward curve / before-after training proof.
3. Zero-shot confidence washing via `/sentinel/intercept`.
4. Universal oversight report as the final "this scales beyond our environment"
moment.