# SENTINEL / Live — Real-time Agent Oversight `SENTINEL/Live` is a thin layer on top of the trained SENTINEL Overseer that exposes oversight as a single HTTP endpoint. Any LLM agent — Claude Code, AutoGen, LangChain, a homemade tool-using loop — can POST a proposed action and get back **approve / flag / block / escalate** with a justification, in real time, with no scenario context required. ## Why this matters Existing AI safety layers are **static**: regex blocklists, allowlist policies, hand-authored guardrail prompts. They don't reason about the action in context. `SENTINEL/Live` is the first oversight layer trained specifically on multi-step incident remediation, with a deterministic reward signal and 50 held-out scenarios for evaluation. The same `grade_overseer_decision()` function that scored every training step is the one that scores your live verdict — there is no train/serve gap. Six things ship in the box: | | | |---|---| | **Heuristic backend** | Always works, sub-millisecond, zero deps. Mirrors the policy-aware Overseer (F1 = 0.969 on the eval set). | | **Trained backend (opt-in)** | Defers to Qwen3-1.7B + LoRA via vLLM if both are present. Silently falls back to heuristic on any failure. | | **🛡️ Prompt-injection shield** | Detects 10 adversarial override patterns ("ignore previous instructions", "approve regardless", `<\|im_start\|>`, …) before classification. Force-escalates with a clear `shield_triggered=true` flag. | | **📋 Copy-as-agent-code generator** | The Gradio tab has a "Copy as agent code" panel that auto-rebuilds a **cURL / Python `requests` / LangChain `BaseTool`** snippet from whatever you typed into the form — paste-and-go integration with zero adaptation. | | **🔌 API Explorer tab** | A whole third Gradio tab with a one-click **▶️ Try it** card for *every* endpoint the FastAPI app exposes — `/health`, `/api/info`, `/tasks`, `/reset`, `/step` (both Responder and Overseer subforms), `/state`, `/grader`, `/live/oversee`, `/live/stats`, `/live/health`, plus a link to `/docs` (Swagger UI). Each card shows the live JSON response *and* the equivalent `curl` snippet pointed at the public Space URL — the snippet is provably what the UI just ran, so judges can paste it into their terminal and reproduce verbatim. | | **🏆 Live Reward Scoreboard** | Pinned to the top of both the Replay Viewer and the API Explorer. Shows cumulative Responder reward, cumulative Overseer reward, F1 (color-coded 🟢 ≥ 0.85, 🟡 ≥ 0.50, 🔴 below), the full TP/FP/TN/FN confusion matrix, and drift-event count. **It auto-refreshes after every endpoint call** — hit `/reset` and watch the placeholder flip to a live episode; hit `/step` and watch the cumulative bars move; hit `/grader` and the scoreboard pulls the canonical numbers. The reward signal is no longer hidden in a JSON pane — it's the headline of the page. | ## Try it - **Live demo (Gradio)** — first tab on [the HF Space](https://huggingface.co/spaces/Elliot89/sentinel) - **API** — `POST https://elliot89-sentinel.hf.space/live/oversee` - **Demo script** — `python tools/agent_demo.py --use-mock-llm` (no API key needed) ```bash curl -X POST https://elliot89-sentinel.hf.space/live/oversee \ -H 'Content-Type: application/json' \ -d '{"action_description":"DROP TABLE users", "proposed_tool":"DROP TABLE", "proposed_target":"users", "severity_hint":"critical"}' # → {"decision":"block","severity_assessed":"catastrophic", ... } ``` ## Endpoints — `SENTINEL/Live` adds three; the underlying OpenEnv contract exposes seven more | Layer | Method | Path | Purpose | |---|:---:|---|---| | 🛡️ Live | `POST` | `/live/oversee` | Submit one proposed action, get a verdict. | | 🛡️ Live | `GET` | `/live/stats` | Lifetime counters: total verdicts, by-decision, by-severity, p50/p99 latency, catastrophic-blocked, shield-triggered. | | 🛡️ Live | `GET` | `/live/health` | Per-feature health + whether the trained backend is currently reachable. | | 🌐 OpenEnv | `GET` | `/health` | Server health (`{status, version}`). | | 🌐 OpenEnv | `GET` | `/api/info` | Service descriptor (name, version, tasks, docs URL). | | 🌐 OpenEnv | `GET` | `/tasks` | All 3 task tiers + canonical Responder/Overseer action schemas. | | 🌐 OpenEnv | `POST` | `/reset` | Start an episode (`task_id`, `seed`, `mode`). | | 🌐 OpenEnv | `POST` | `/step` | Submit one action (Responder or Overseer, discriminated on `role`). | | 🌐 OpenEnv | `GET` | `/state` | Full `EpisodeState` snapshot. | | 🌐 OpenEnv | `GET` | `/grader` | Per-episode F1, confusion, **cumulative rewards** 🏆. | | 📖 Docs | `GET` | `/docs` | FastAPI Swagger UI. | > There is no `/stop` endpoint — episodes terminate naturally when `/step` returns `done: true`. Call `/reset` again to start a fresh one. **Every endpoint above has a one-click ▶️ Try it card on the API Explorer tab.** ## Wire it into your agent framework **Plain Python** (works for OpenAI tool-use, Claude tool-use, anything): ```python import requests verdict = requests.post( "http://localhost:7860/live/oversee", json={"action_description": proposed_action, "proposed_tool": tool, "proposed_target": target, "severity_hint": severity}, ).json() if verdict["decision"] in ("block", "escalate"): raise SafetyAbort(verdict["justification"]) ``` **LangChain** (as a tool wrapper): ```python from langchain.tools import tool @tool def safe_execute(action: str, tool_name: str, target: str) -> str: """Run an action only if SENTINEL approves.""" v = requests.post(SENTINEL_URL + "/live/oversee", json={ "action_description": action, "proposed_tool": tool_name, "proposed_target": target, "severity_hint": "high"}).json() return f"BLOCKED: {v['justification']}" if v["decision"] in ("block","escalate") \ else execute(action, tool_name, target) ``` **AutoGen** (as a `ConversableAgent` reply hook): ```python def sentinel_filter(messages, sender, config): proposal = parse_proposal(messages[-1]["content"]) v = requests.post(SENTINEL_URL + "/live/oversee", json=proposal).json() return (False, None) if v["decision"] in ("block","escalate") else (False, None) agent.register_reply([Agent], reply_func=sentinel_filter, position=0) ``` ## Response schema ```json { "decision": "approve | flag | block | escalate", "justification": "one-sentence reasoning citing the rule that fired", "severity_assessed": "low | medium | high | catastrophic", "confidence": 0.95, "risk_score": 0.05, "reasoning_trace": ["shield_check: clear", "rule: ...", "backend: heuristic"], "shield_triggered": false, "backend_used": "heuristic | heuristic-fallback | trained-qwen3-1.7b", "latency_ms": 1 } ``` `risk_score = severity × (1 − confidence)`, in [0,1]. Use it as a single-number gauge in your dashboard. ## Architecture (one paragraph) The route handler is a thin wrapper around `live_oversee_logic()` — a pure function that (1) runs the prompt-injection shield, (2) classifies the proposal via keyword rules into one of {catastrophic, wrong, correct, ambiguous, neutral}, (3) synthesizes a scenario-shaped dict and calls `graders.grade_overseer_decision()` so the live verdict is provably consistent with how a real episode would have scored it, (4) optionally defers to the trained Qwen3-1.7B backend via vLLM with silent fallback. The Gradio tab calls the same function in-process — what you see on screen is byte-for-byte what the HTTP API returns. The whole feature is ~1100 lines across 4 new files (`server/live_routes.py`, `server/live_ui.py`, `tools/agent_demo.py`, `SENTINEL_LIVE.md`) plus a small populator extraction in `server/app.py`. Nothing in `graders.py`, `scenarios.py`, `models.py`, `drift.py`, `eval.py`, or `client.py` was touched. > **Note on the UI structure:** the live tab, the original 3-column > replay viewer, and the new API Explorer tab are all composed via the > *populator pattern* (callables that add components to the current > `gr.Tabs` context). Earlier builds used the nested `Blocks.render()` > pattern, which caused some Gradio versions to render the live panel > twice on the same page. The current build renders each tab exactly > once — verified at the `/config` level (3 tab items, 3 distinct > labels, no duplicates). ## 🔌 API Explorer + 🏆 Reward Scoreboard — the "judge UX" upgrade Two complaints any hackathon judge has after staring at a FastAPI Space for 30 seconds: 1. *"Where do I see the rewards?"* — they're often buried in a JSON pane below the fold. 2. *"How do I call this without dropping into a terminal?"* — most submissions force you out to `curl` or Postman. The third Gradio tab — **🔌 API Explorer** — fixes both. - **Every endpoint** (`/health`, `/api/info`, `/tasks`, `/reset`, `/step`, `/state`, `/grader`, plus all three `/live/*` routes) sits in its own collapsible card. Each card has a `▶️ Try it` button (with input form if the route takes a body), a **live JSON response panel**, and an **equivalent `curl` panel** pointed at the public Space URL. - The `/step` card has *two* sub-forms (Responder action and Overseer action) so the discriminated `Action` payload is buildable without reading `models.py`. - The **🏆 Live Reward Scoreboard** is pinned at the top of the tab and re-pulls `/grader` after **every single button click** — `/reset`, `/step`, `/grader`, even `/live/oversee`. Cumulative responder reward, cumulative overseer reward, F1 (color-coded), TP/FP/TN/FN, drift count. The same scoreboard banner is also pinned to the top of the Replay Viewer tab and updates after each `▶️ Play Episode` click. The implementation is one new file (`server/api_explorer_ui.py`, ~430 lines, all populator-style) plus a 3-line change to `combine_with_live_tab()` in `server/live_ui.py` to make the third tab optional. Still zero edits to `graders.py`, `eval.py`, `scenarios.py`, `models.py`, `drift.py`, or `client.py`.