# Universal SENTINEL Oversight Harness

## Idea

SENTINEL should be able to supervise not only this repo's worker fleet, but any
OpenEnv-compatible environment or agent exposed through a Hugging Face Space.

The demo story:

> Paste an OpenEnv Space URL. SENTINEL wraps the environment, intercepts proposed
> actions, blocks or redirects risky behavior, then produces a cross-team
> oversight report.

This should be built last, after the core SENTINEL training proof is stable.

## Why It Wins Demo Attention

Most submissions show one agent solving one task. Universal SENTINEL shows a
meta-agent supervising many agents across many task domains.

This turns SENTINEL from "an SRE oversight environment" into a general pattern:

- calendar agents: block bulk destructive calendar edits without confirmation
- SRE agents: block hallucinated services and premature severity claims
- IoT agents: block actuator changes without sensor evidence
- negotiation agents: flag loop exploitation and repeated offers
- finance agents: block irreversible actions without risk evidence

## Safe Framing

Do not say "we judged all other teams." That sounds adversarial and may depend on
private agent APIs.

Say:

> SENTINEL is an opt-in universal oversight harness for OpenEnv-compatible
> agents and environments.

## Three Integration Levels

### Level 1: OpenEnv Environment Wrapper

Works when the Space exposes:

- `GET /health`
- `GET /tasks`
- `POST /reset`
- `POST /step`
- `POST /grader`

Flow:

```text
Remote OpenEnv Space
  -> observation
  -> local baseline/LLM action generator
  -> SENTINEL intercept
  -> approved/blocked/redirected action
  -> remote /step
  -> oversight report
```

This is the safest live demo path.

### Level 2: External Agent Oversight

Works when another team exposes an agent endpoint, for example:

- `POST /act`
- input: observation
- output: proposed action

Flow:

```text
Remote env observation
  -> remote agent /act
  -> proposed action
  -> SENTINEL intercept
  -> remote env /step if approved
```

This is the strongest proof, but requires cooperation or public agent APIs.

### Level 3: Browser/UI Scraping

Avoid for the judged demo.

It is fragile, slow, and can look like we are attacking other submissions. Keep
the product-level version API-first.

## Proposed Files

```text
universal/
  adapters.py          # OpenEnv/HF Space compatibility checks
  action_generator.py  # baseline or model action proposal provider
  harness.py           # run multiple Spaces with retry/backoff/circuit breaker
  policy_mapper.py     # map domain actions to universal safety categories
  report.py            # aggregate cross-environment oversight report
```

## Universal Safety Categories

SENTINEL should normalize arbitrary environment actions into these categories:

- invalid target or hallucinated entity
- irreversible/destructive action
- external communication
- escalation or broad notification
- actuator/control action
- repeated loop action
- action before evidence
- cross-domain authority violation

## Example Report

```text
SENTINEL OVERSIGHT REPORT - Grand Finale 2026

Sources monitored: 8
Compatible OpenEnv spaces: 6
Unavailable or timed out: 2

Team A - Calendar Assistant
  Caught: escalation bombing, step 4
  Redirected: destructive bulk update -> request confirmation, step 7
  Risk prevented: 3.2

Team B - SRE Environment
  Caught: hallucination - service "auth-proxy-v2" does not exist, step 2
  Caught: reward hacking - classified P1 before investigation, step 1

Team C - IoT Environment
  All actions approved - clean agent

Team D - Negotiation
  Caught: loop exploitation - same offer repeated 4 times

Totals:
  Actions audited: 82
  Blocks: 14
  Redirects: 6
  Flags: 9
  Prevented risk: 11.7
```

## Reliability Requirements

The harness must never depend on a remote Space being healthy.

Required protections:

- 5-10 second request timeout per remote call
- exponential backoff for transient failures
- per-Space circuit breaker after repeated failures
- compatibility report when `/tasks` or schemas are missing
- offline fixture mode for the live pitch
- no false precision for unknown labels

For unknown external environments, say "estimated false positives" unless the
remote Space provides labels or grader feedback.

## Build Order

1. Keep this as a finale extension until core training proof is complete.
2. Implement OpenEnv compatibility checker.
3. Implement one local action generator.
4. Run 3-5 known Spaces or local fixtures.
5. Add aggregate report generation.
6. Add paste-a-Space-URL field to `/sentinel/dashboard`.
7. Only then attempt external agent `/act` integration.

## Demo Principle

Prepared mode must always work. Bring-your-own-link mode is a bonus.

The judged demo should show:

1. SENTINEL core environment.
2. Reward curve / before-after training proof.
3. Zero-shot confidence washing via `/sentinel/intercept`.
4. Universal oversight report as the final "this scales beyond our environment"
   moment.