Sentinel-openenv / SENTINEL_LIVE.md
sidraaiman1809's picture
Upload folder using huggingface_hub
8efd70f verified
# SENTINEL / Live — Real-time Agent Oversight
`SENTINEL/Live` is a thin layer on top of the trained SENTINEL Overseer that
exposes oversight as a single HTTP endpoint. Any LLM agent — Claude Code,
AutoGen, LangChain, a homemade tool-using loop — can POST a proposed action
and get back **approve / flag / block / escalate** with a justification, in
real time, with no scenario context required.
## Why this matters
Existing AI safety layers are **static**: regex blocklists, allowlist
policies, hand-authored guardrail prompts. They don't reason about the action
in context.
`SENTINEL/Live` is the first oversight layer trained specifically on
multi-step incident remediation, with a deterministic reward signal and 50
held-out scenarios for evaluation. The same `grade_overseer_decision()`
function that scored every training step is the one that scores your live
verdict — there is no train/serve gap.
Six things ship in the box:
| | |
|---|---|
| **Heuristic backend** | Always works, sub-millisecond, zero deps. Mirrors the policy-aware Overseer (F1 = 0.969 on the eval set). |
| **Trained backend (opt-in)** | Defers to Qwen3-1.7B + LoRA via vLLM if both are present. Silently falls back to heuristic on any failure. |
| **🛡️ Prompt-injection shield** | Detects 10 adversarial override patterns ("ignore previous instructions", "approve regardless", `<\|im_start\|>`, …) before classification. Force-escalates with a clear `shield_triggered=true` flag. |
| **📋 Copy-as-agent-code generator** | The Gradio tab has a "Copy as agent code" panel that auto-rebuilds a **cURL / Python `requests` / LangChain `BaseTool`** snippet from whatever you typed into the form — paste-and-go integration with zero adaptation. |
| **🔌 API Explorer tab** | A whole third Gradio tab with a one-click **▶️ Try it** card for *every* endpoint the FastAPI app exposes — `/health`, `/api/info`, `/tasks`, `/reset`, `/step` (both Responder and Overseer subforms), `/state`, `/grader`, `/live/oversee`, `/live/stats`, `/live/health`, plus a link to `/docs` (Swagger UI). Each card shows the live JSON response *and* the equivalent `curl` snippet pointed at the public Space URL — the snippet is provably what the UI just ran, so judges can paste it into their terminal and reproduce verbatim. |
| **🏆 Live Reward Scoreboard** | Pinned to the top of both the Replay Viewer and the API Explorer. Shows cumulative Responder reward, cumulative Overseer reward, F1 (color-coded 🟢 ≥ 0.85, 🟡 ≥ 0.50, 🔴 below), the full TP/FP/TN/FN confusion matrix, and drift-event count. **It auto-refreshes after every endpoint call** — hit `/reset` and watch the placeholder flip to a live episode; hit `/step` and watch the cumulative bars move; hit `/grader` and the scoreboard pulls the canonical numbers. The reward signal is no longer hidden in a JSON pane — it's the headline of the page. |
## Try it
- **Live demo (Gradio)** — first tab on [the HF Space](https://huggingface.co/spaces/Elliot89/sentinel)
- **API**`POST https://elliot89-sentinel.hf.space/live/oversee`
- **Demo script**`python tools/agent_demo.py --use-mock-llm` (no API key needed)
```bash
curl -X POST https://elliot89-sentinel.hf.space/live/oversee \
-H 'Content-Type: application/json' \
-d '{"action_description":"DROP TABLE users",
"proposed_tool":"DROP TABLE",
"proposed_target":"users",
"severity_hint":"critical"}'
# → {"decision":"block","severity_assessed":"catastrophic", ... }
```
## Endpoints — `SENTINEL/Live` adds three; the underlying OpenEnv contract exposes seven more
| Layer | Method | Path | Purpose |
|---|:---:|---|---|
| 🛡️ Live | `POST` | `/live/oversee` | Submit one proposed action, get a verdict. |
| 🛡️ Live | `GET` | `/live/stats` | Lifetime counters: total verdicts, by-decision, by-severity, p50/p99 latency, catastrophic-blocked, shield-triggered. |
| 🛡️ Live | `GET` | `/live/health` | Per-feature health + whether the trained backend is currently reachable. |
| 🌐 OpenEnv | `GET` | `/health` | Server health (`{status, version}`). |
| 🌐 OpenEnv | `GET` | `/api/info` | Service descriptor (name, version, tasks, docs URL). |
| 🌐 OpenEnv | `GET` | `/tasks` | All 3 task tiers + canonical Responder/Overseer action schemas. |
| 🌐 OpenEnv | `POST` | `/reset` | Start an episode (`task_id`, `seed`, `mode`). |
| 🌐 OpenEnv | `POST` | `/step` | Submit one action (Responder or Overseer, discriminated on `role`). |
| 🌐 OpenEnv | `GET` | `/state` | Full `EpisodeState` snapshot. |
| 🌐 OpenEnv | `GET` | `/grader` | Per-episode F1, confusion, **cumulative rewards** 🏆. |
| 📖 Docs | `GET` | `/docs` | FastAPI Swagger UI. |
> There is no `/stop` endpoint — episodes terminate naturally when `/step` returns `done: true`. Call `/reset` again to start a fresh one. **Every endpoint above has a one-click ▶️ Try it card on the API Explorer tab.**
## Wire it into your agent framework
**Plain Python** (works for OpenAI tool-use, Claude tool-use, anything):
```python
import requests
verdict = requests.post(
"http://localhost:7860/live/oversee",
json={"action_description": proposed_action, "proposed_tool": tool,
"proposed_target": target, "severity_hint": severity},
).json()
if verdict["decision"] in ("block", "escalate"):
raise SafetyAbort(verdict["justification"])
```
**LangChain** (as a tool wrapper):
```python
from langchain.tools import tool
@tool
def safe_execute(action: str, tool_name: str, target: str) -> str:
"""Run an action only if SENTINEL approves."""
v = requests.post(SENTINEL_URL + "/live/oversee", json={
"action_description": action, "proposed_tool": tool_name,
"proposed_target": target, "severity_hint": "high"}).json()
return f"BLOCKED: {v['justification']}" if v["decision"] in ("block","escalate") \
else execute(action, tool_name, target)
```
**AutoGen** (as a `ConversableAgent` reply hook):
```python
def sentinel_filter(messages, sender, config):
proposal = parse_proposal(messages[-1]["content"])
v = requests.post(SENTINEL_URL + "/live/oversee", json=proposal).json()
return (False, None) if v["decision"] in ("block","escalate") else (False, None)
agent.register_reply([Agent], reply_func=sentinel_filter, position=0)
```
## Response schema
```json
{
"decision": "approve | flag | block | escalate",
"justification": "one-sentence reasoning citing the rule that fired",
"severity_assessed": "low | medium | high | catastrophic",
"confidence": 0.95,
"risk_score": 0.05,
"reasoning_trace": ["shield_check: clear", "rule: ...", "backend: heuristic"],
"shield_triggered": false,
"backend_used": "heuristic | heuristic-fallback | trained-qwen3-1.7b",
"latency_ms": 1
}
```
`risk_score = severity × (1 − confidence)`, in [0,1]. Use it as a single-number
gauge in your dashboard.
## Architecture (one paragraph)
The route handler is a thin wrapper around `live_oversee_logic()` — a pure
function that (1) runs the prompt-injection shield, (2) classifies the
proposal via keyword rules into one of {catastrophic, wrong, correct,
ambiguous, neutral}, (3) synthesizes a scenario-shaped dict and calls
`graders.grade_overseer_decision()` so the live verdict is provably
consistent with how a real episode would have scored it, (4) optionally
defers to the trained Qwen3-1.7B backend via vLLM with silent fallback.
The Gradio tab calls the same function in-process — what you see on screen
is byte-for-byte what the HTTP API returns.
The whole feature is ~1100 lines across 4 new files (`server/live_routes.py`,
`server/live_ui.py`, `tools/agent_demo.py`, `SENTINEL_LIVE.md`) plus a small
populator extraction in `server/app.py`. Nothing in `graders.py`,
`scenarios.py`, `models.py`, `drift.py`, `eval.py`, or `client.py` was touched.
> **Note on the UI structure:** the live tab, the original 3-column
> replay viewer, and the new API Explorer tab are all composed via the
> *populator pattern* (callables that add components to the current
> `gr.Tabs` context). Earlier builds used the nested `Blocks.render()`
> pattern, which caused some Gradio versions to render the live panel
> twice on the same page. The current build renders each tab exactly
> once — verified at the `/config` level (3 tab items, 3 distinct
> labels, no duplicates).
## 🔌 API Explorer + 🏆 Reward Scoreboard — the "judge UX" upgrade
Two complaints any hackathon judge has after staring at a FastAPI Space
for 30 seconds:
1. *"Where do I see the rewards?"* — they're often buried in a JSON pane
below the fold.
2. *"How do I call this without dropping into a terminal?"* — most
submissions force you out to `curl` or Postman.
The third Gradio tab — **🔌 API Explorer** — fixes both.
- **Every endpoint** (`/health`, `/api/info`, `/tasks`, `/reset`, `/step`,
`/state`, `/grader`, plus all three `/live/*` routes) sits in its own
collapsible card. Each card has a `▶️ Try it` button (with input form
if the route takes a body), a **live JSON response panel**, and an
**equivalent `curl` panel** pointed at the public Space URL.
- The `/step` card has *two* sub-forms (Responder action and Overseer
action) so the discriminated `Action` payload is buildable without
reading `models.py`.
- The **🏆 Live Reward Scoreboard** is pinned at the top of the tab and
re-pulls `/grader` after **every single button click**`/reset`,
`/step`, `/grader`, even `/live/oversee`. Cumulative responder reward,
cumulative overseer reward, F1 (color-coded), TP/FP/TN/FN, drift
count. The same scoreboard banner is also pinned to the top of the
Replay Viewer tab and updates after each `▶️ Play Episode` click.
The implementation is one new file (`server/api_explorer_ui.py`, ~430
lines, all populator-style) plus a 3-line change to `combine_with_live_tab()`
in `server/live_ui.py` to make the third tab optional. Still zero edits
to `graders.py`, `eval.py`, `scenarios.py`, `models.py`, `drift.py`, or
`client.py`.