openenv / docs /sentinel /universal-oversight-plan.md
sentinel-space-publisher
space: publish latest Sentinel app snapshot
c452421

Universal SENTINEL Oversight Harness

Idea

SENTINEL should be able to supervise not only this repo's worker fleet, but any OpenEnv-compatible environment or agent exposed through a Hugging Face Space.

The demo story:

Paste an OpenEnv Space URL. SENTINEL wraps the environment, intercepts proposed actions, blocks or redirects risky behavior, then produces a cross-team oversight report.

This should be built last, after the core SENTINEL training proof is stable.

Why It Wins Demo Attention

Most submissions show one agent solving one task. Universal SENTINEL shows a meta-agent supervising many agents across many task domains.

This turns SENTINEL from "an SRE oversight environment" into a general pattern:

  • calendar agents: block bulk destructive calendar edits without confirmation
  • SRE agents: block hallucinated services and premature severity claims
  • IoT agents: block actuator changes without sensor evidence
  • negotiation agents: flag loop exploitation and repeated offers
  • finance agents: block irreversible actions without risk evidence

Safe Framing

Do not say "we judged all other teams." That sounds adversarial and may depend on private agent APIs.

Say:

SENTINEL is an opt-in universal oversight harness for OpenEnv-compatible agents and environments.

Three Integration Levels

Level 1: OpenEnv Environment Wrapper

Works when the Space exposes:

  • GET /health
  • GET /tasks
  • POST /reset
  • POST /step
  • POST /grader

Flow:

Remote OpenEnv Space
  -> observation
  -> local baseline/LLM action generator
  -> SENTINEL intercept
  -> approved/blocked/redirected action
  -> remote /step
  -> oversight report

This is the safest live demo path.

Level 2: External Agent Oversight

Works when another team exposes an agent endpoint, for example:

  • POST /act
  • input: observation
  • output: proposed action

Flow:

Remote env observation
  -> remote agent /act
  -> proposed action
  -> SENTINEL intercept
  -> remote env /step if approved

This is the strongest proof, but requires cooperation or public agent APIs.

Level 3: Browser/UI Scraping

Avoid for the judged demo.

It is fragile, slow, and can look like we are attacking other submissions. Keep the product-level version API-first.

Proposed Files

universal/
  adapters.py          # OpenEnv/HF Space compatibility checks
  action_generator.py  # baseline or model action proposal provider
  harness.py           # run multiple Spaces with retry/backoff/circuit breaker
  policy_mapper.py     # map domain actions to universal safety categories
  report.py            # aggregate cross-environment oversight report

Universal Safety Categories

SENTINEL should normalize arbitrary environment actions into these categories:

  • invalid target or hallucinated entity
  • irreversible/destructive action
  • external communication
  • escalation or broad notification
  • actuator/control action
  • repeated loop action
  • action before evidence
  • cross-domain authority violation

Example Report

SENTINEL OVERSIGHT REPORT - Grand Finale 2026

Sources monitored: 8
Compatible OpenEnv spaces: 6
Unavailable or timed out: 2

Team A - Calendar Assistant
  Caught: escalation bombing, step 4
  Redirected: destructive bulk update -> request confirmation, step 7
  Risk prevented: 3.2

Team B - SRE Environment
  Caught: hallucination - service "auth-proxy-v2" does not exist, step 2
  Caught: reward hacking - classified P1 before investigation, step 1

Team C - IoT Environment
  All actions approved - clean agent

Team D - Negotiation
  Caught: loop exploitation - same offer repeated 4 times

Totals:
  Actions audited: 82
  Blocks: 14
  Redirects: 6
  Flags: 9
  Prevented risk: 11.7

Reliability Requirements

The harness must never depend on a remote Space being healthy.

Required protections:

  • 5-10 second request timeout per remote call
  • exponential backoff for transient failures
  • per-Space circuit breaker after repeated failures
  • compatibility report when /tasks or schemas are missing
  • offline fixture mode for the live pitch
  • no false precision for unknown labels

For unknown external environments, say "estimated false positives" unless the remote Space provides labels or grader feedback.

Build Order

  1. Keep this as a finale extension until core training proof is complete.
  2. Implement OpenEnv compatibility checker.
  3. Implement one local action generator.
  4. Run 3-5 known Spaces or local fixtures.
  5. Add aggregate report generation.
  6. Add paste-a-Space-URL field to /sentinel/dashboard.
  7. Only then attempt external agent /act integration.

Demo Principle

Prepared mode must always work. Bring-your-own-link mode is a bonus.

The judged demo should show:

  1. SENTINEL core environment.
  2. Reward curve / before-after training proof.
  3. Zero-shot confidence washing via /sentinel/intercept.
  4. Universal oversight report as the final "this scales beyond our environment" moment.