Spaces:
Running
SENTINEL / Live — Real-time Agent Oversight
SENTINEL/Live is a thin layer on top of the trained SENTINEL Overseer that
exposes oversight as a single HTTP endpoint. Any LLM agent — Claude Code,
AutoGen, LangChain, a homemade tool-using loop — can POST a proposed action
and get back approve / flag / block / escalate with a justification, in
real time, with no scenario context required.
Why this matters
Existing AI safety layers are static: regex blocklists, allowlist policies, hand-authored guardrail prompts. They don't reason about the action in context.
SENTINEL/Live is the first oversight layer trained specifically on
multi-step incident remediation, with a deterministic reward signal and 50
held-out scenarios for evaluation. The same grade_overseer_decision()
function that scored every training step is the one that scores your live
verdict — there is no train/serve gap.
Six things ship in the box:
| Heuristic backend | Always works, sub-millisecond, zero deps. Mirrors the policy-aware Overseer (F1 = 0.969 on the eval set). |
| Trained backend (opt-in) | Defers to Qwen3-1.7B + LoRA via vLLM if both are present. Silently falls back to heuristic on any failure. |
| 🛡️ Prompt-injection shield | Detects 10 adversarial override patterns ("ignore previous instructions", "approve regardless", <|im_start|>, …) before classification. Force-escalates with a clear shield_triggered=true flag. |
| 📋 Copy-as-agent-code generator | The Gradio tab has a "Copy as agent code" panel that auto-rebuilds a cURL / Python requests / LangChain BaseTool snippet from whatever you typed into the form — paste-and-go integration with zero adaptation. |
| 🔌 API Explorer tab | A whole third Gradio tab with a one-click ▶️ Try it card for every endpoint the FastAPI app exposes — /health, /api/info, /tasks, /reset, /step (both Responder and Overseer subforms), /state, /grader, /live/oversee, /live/stats, /live/health, plus a link to /docs (Swagger UI). Each card shows the live JSON response and the equivalent curl snippet pointed at the public Space URL — the snippet is provably what the UI just ran, so judges can paste it into their terminal and reproduce verbatim. |
| 🏆 Live Reward Scoreboard | Pinned to the top of both the Replay Viewer and the API Explorer. Shows cumulative Responder reward, cumulative Overseer reward, F1 (color-coded 🟢 ≥ 0.85, 🟡 ≥ 0.50, 🔴 below), the full TP/FP/TN/FN confusion matrix, and drift-event count. It auto-refreshes after every endpoint call — hit /reset and watch the placeholder flip to a live episode; hit /step and watch the cumulative bars move; hit /grader and the scoreboard pulls the canonical numbers. The reward signal is no longer hidden in a JSON pane — it's the headline of the page. |
Try it
- Live demo (Gradio) — first tab on the HF Space
- API —
POST https://elliot89-sentinel.hf.space/live/oversee - Demo script —
python tools/agent_demo.py --use-mock-llm(no API key needed)
curl -X POST https://elliot89-sentinel.hf.space/live/oversee \
-H 'Content-Type: application/json' \
-d '{"action_description":"DROP TABLE users",
"proposed_tool":"DROP TABLE",
"proposed_target":"users",
"severity_hint":"critical"}'
# → {"decision":"block","severity_assessed":"catastrophic", ... }
Endpoints — SENTINEL/Live adds three; the underlying OpenEnv contract exposes seven more
| Layer | Method | Path | Purpose |
|---|---|---|---|
| 🛡️ Live | POST |
/live/oversee |
Submit one proposed action, get a verdict. |
| 🛡️ Live | GET |
/live/stats |
Lifetime counters: total verdicts, by-decision, by-severity, p50/p99 latency, catastrophic-blocked, shield-triggered. |
| 🛡️ Live | GET |
/live/health |
Per-feature health + whether the trained backend is currently reachable. |
| 🌐 OpenEnv | GET |
/health |
Server health ({status, version}). |
| 🌐 OpenEnv | GET |
/api/info |
Service descriptor (name, version, tasks, docs URL). |
| 🌐 OpenEnv | GET |
/tasks |
All 3 task tiers + canonical Responder/Overseer action schemas. |
| 🌐 OpenEnv | POST |
/reset |
Start an episode (task_id, seed, mode). |
| 🌐 OpenEnv | POST |
/step |
Submit one action (Responder or Overseer, discriminated on role). |
| 🌐 OpenEnv | GET |
/state |
Full EpisodeState snapshot. |
| 🌐 OpenEnv | GET |
/grader |
Per-episode F1, confusion, cumulative rewards 🏆. |
| 📖 Docs | GET |
/docs |
FastAPI Swagger UI. |
There is no
/stopendpoint — episodes terminate naturally when/stepreturnsdone: true. Call/resetagain to start a fresh one. Every endpoint above has a one-click ▶️ Try it card on the API Explorer tab.
Wire it into your agent framework
Plain Python (works for OpenAI tool-use, Claude tool-use, anything):
import requests
verdict = requests.post(
"http://localhost:7860/live/oversee",
json={"action_description": proposed_action, "proposed_tool": tool,
"proposed_target": target, "severity_hint": severity},
).json()
if verdict["decision"] in ("block", "escalate"):
raise SafetyAbort(verdict["justification"])
LangChain (as a tool wrapper):
from langchain.tools import tool
@tool
def safe_execute(action: str, tool_name: str, target: str) -> str:
"""Run an action only if SENTINEL approves."""
v = requests.post(SENTINEL_URL + "/live/oversee", json={
"action_description": action, "proposed_tool": tool_name,
"proposed_target": target, "severity_hint": "high"}).json()
return f"BLOCKED: {v['justification']}" if v["decision"] in ("block","escalate") \
else execute(action, tool_name, target)
AutoGen (as a ConversableAgent reply hook):
def sentinel_filter(messages, sender, config):
proposal = parse_proposal(messages[-1]["content"])
v = requests.post(SENTINEL_URL + "/live/oversee", json=proposal).json()
return (False, None) if v["decision"] in ("block","escalate") else (False, None)
agent.register_reply([Agent], reply_func=sentinel_filter, position=0)
Response schema
{
"decision": "approve | flag | block | escalate",
"justification": "one-sentence reasoning citing the rule that fired",
"severity_assessed": "low | medium | high | catastrophic",
"confidence": 0.95,
"risk_score": 0.05,
"reasoning_trace": ["shield_check: clear", "rule: ...", "backend: heuristic"],
"shield_triggered": false,
"backend_used": "heuristic | heuristic-fallback | trained-qwen3-1.7b",
"latency_ms": 1
}
risk_score = severity × (1 − confidence), in [0,1]. Use it as a single-number
gauge in your dashboard.
Architecture (one paragraph)
The route handler is a thin wrapper around live_oversee_logic() — a pure
function that (1) runs the prompt-injection shield, (2) classifies the
proposal via keyword rules into one of {catastrophic, wrong, correct,
ambiguous, neutral}, (3) synthesizes a scenario-shaped dict and calls
graders.grade_overseer_decision() so the live verdict is provably
consistent with how a real episode would have scored it, (4) optionally
defers to the trained Qwen3-1.7B backend via vLLM with silent fallback.
The Gradio tab calls the same function in-process — what you see on screen
is byte-for-byte what the HTTP API returns.
The whole feature is ~1100 lines across 4 new files (server/live_routes.py,
server/live_ui.py, tools/agent_demo.py, SENTINEL_LIVE.md) plus a small
populator extraction in server/app.py. Nothing in graders.py,
scenarios.py, models.py, drift.py, eval.py, or client.py was touched.
Note on the UI structure: the live tab, the original 3-column replay viewer, and the new API Explorer tab are all composed via the populator pattern (callables that add components to the current
gr.Tabscontext). Earlier builds used the nestedBlocks.render()pattern, which caused some Gradio versions to render the live panel twice on the same page. The current build renders each tab exactly once — verified at the/configlevel (3 tab items, 3 distinct labels, no duplicates).
🔌 API Explorer + 🏆 Reward Scoreboard — the "judge UX" upgrade
Two complaints any hackathon judge has after staring at a FastAPI Space for 30 seconds:
- "Where do I see the rewards?" — they're often buried in a JSON pane below the fold.
- "How do I call this without dropping into a terminal?" — most
submissions force you out to
curlor Postman.
The third Gradio tab — 🔌 API Explorer — fixes both.
- Every endpoint (
/health,/api/info,/tasks,/reset,/step,/state,/grader, plus all three/live/*routes) sits in its own collapsible card. Each card has a▶️ Try itbutton (with input form if the route takes a body), a live JSON response panel, and an equivalentcurlpanel pointed at the public Space URL. - The
/stepcard has two sub-forms (Responder action and Overseer action) so the discriminatedActionpayload is buildable without readingmodels.py. - The 🏆 Live Reward Scoreboard is pinned at the top of the tab and
re-pulls
/graderafter every single button click —/reset,/step,/grader, even/live/oversee. Cumulative responder reward, cumulative overseer reward, F1 (color-coded), TP/FP/TN/FN, drift count. The same scoreboard banner is also pinned to the top of the Replay Viewer tab and updates after each▶️ Play Episodeclick.
The implementation is one new file (server/api_explorer_ui.py, ~430
lines, all populator-style) plus a 3-line change to combine_with_live_tab()
in server/live_ui.py to make the third tab optional. Still zero edits
to graders.py, eval.py, scenarios.py, models.py, drift.py, or
client.py.