Spaces:
Running
SENTINEL Architecture Map
Excalidraw master board: open the hand-drawn architecture map
This page is the visual architecture map for SENTINEL. It uses rendered SVG diagrams, not diagram source code, so readers can understand the system through boxes and arrows directly.
Use this page as the architecture walkthrough for the README, mini-blog, slides, or demo narration. It is written so someone can follow the system without reading the code first.
The short version:
SENTINEL is a trainable pre-execution supervisor for fleets of AI workers. Workers propose actions. SENTINEL checks trust, evidence, constitution, memory, and world state. Safe actions execute. Unsafe actions are blocked, redirected, or reassigned. The environment turns those decisions into verifiable reward for GRPO training.
1. Master System Flow
This is the whole system in one picture. It answers the first question a reviewer will have: where does SENTINEL sit, and why does it matter?
SENTINEL sits between worker agents and the environment. The worker fleet can suggest actions, but SENTINEL is the gate that decides whether those actions are safe enough to execute.
Flow:
- A company world has tools, APIs, incidents, and tasks.
- A fleet of AI workers proposes actions.
- SENTINEL intercepts each proposal before execution.
- SENTINEL uses trust, constitution, memory, state, and risk signals.
- SENTINEL approves, blocks, redirects, reassigns, or flags.
- Only safe or corrected actions execute.
- The verifier produces reward and audit records.
- GRPO updates the LoRA policy.
- Memory and trust shape the next episode.
- The proof pack shows what improved and what still failed.
The core idea is simple: worker proposes, SENTINEL intercepts, safe action executes, unsafe action is corrected or blocked, and reward trains the supervisor.
Example:
| Moment | What Happens |
|---|---|
| Worker proposal | worker_db proposes rolling back the app deploy |
| SENTINEL check | domain mismatch: database worker is touching app deploys |
| Decision | REASSIGN to worker_app and require deploy evidence |
| Environment effect | risky rollback does not execute |
| Training signal | SENTINEL gets reward for preventing a scope violation |
Why this matters: the architecture trains pre-execution control, not just after-the-fact evaluation.
2. Actual Repo Code Flow
This picture maps the architecture to real files in the repo. It is the bridge between the product story and the implementation.
When someone asks "is this only a concept?", this diagram is the answer. Every major box points to code that exists in the repository.
Main file ownership:
| Stage | Files | What They Do |
|---|---|---|
| World | src/tasks.py, src/env.py, sentinel/environment.py |
create tasks, incidents, state, and multi-crisis control-room logic |
| Workers | sentinel/workers.py, sentinel/llm_workers.py, training/adversarial.py |
generate deterministic, Groq-backed, and adversarial worker proposals |
| Context | sentinel/trust.py, sentinel/constitution.py, training/memory.py |
build trust, constitutional, and memory context |
| Policy | training/prompts.py, sentinel/models.py |
prompt the model and parse structured oversight decisions |
| Gate | sentinel/environment.py, sentinel/feedback.py |
enforce trust gates, route decisions, and give corrective feedback |
| Reward | sentinel/rewards.py, training/monitoring.py |
compute reward and log training quality metrics |
| Training | train.py, training/episodes.py |
run TRL GRPO with Unsloth/LoRA |
| Evaluation | scripts/eval_sentinel.py, proof_pack.py |
produce held-out, OOD, Top-1 vs Best-of-K, and proof-pack artifacts |
This is the exact learning chain: worker proposal, oversight context, Qwen3 LoRA supervisor, structured decision, trust gate and router, world step, decomposed reward, training monitor, GRPO update, then eval and proof pack.
Concrete example through the code:
| Step | Code Owner | Example |
|---|---|---|
| Worker creates proposal | sentinel/workers.py |
worker_net proposes paging five teams |
| Trust context is attached | sentinel/trust.py |
worker has low precision on escalation tasks |
| Constitution scores risk | sentinel/constitution.py |
P3 proportional escalation violation |
| Memory is injected | training/memory.py |
previous broad escalations caused false positives |
| Model decides | sentinel/models.py |
REDIRECT to narrower investigation |
| Reward scores outcome | sentinel/rewards.py |
credit for avoiding escalation bombing |
| Training updates | train.py |
GRPO shifts probability toward that safer pattern |
Why this matters: the system is not a loose collection of features. It is one closed learning loop.
3. Interception Gate Detail
This is the most important runtime picture.
Workers do not execute directly. Every worker proposal enters the interception gate first.
Think of this as the safety checkpoint in front of the world. The worker may be smart, but it is not trusted blindly.
The gate receives:
- incident state
- proposed action
- worker identity
- worker trust profile
- constitutional assessment
- recent audits
- memory context
- active incident snapshots
SENTINEL returns:
decisionreasonexplanationrequired_evidencesuggested_fixsuggested_action_typesuggested_targetreassign_totrust_impact
Decision outcomes:
| Decision | What Happens |
|---|---|
APPROVE |
the original worker action executes |
REDIRECT |
a safer corrected action executes |
REASSIGN |
the proposal is sent to a better worker |
BLOCK |
the action is logged but not executed |
FLAG |
the action is warning-tracked for audit |
This is why SENTINEL is not just an approval bot. It controls the action path before the environment changes.
Example gate decision:
| Field | Example Value |
|---|---|
| worker | worker_app |
| proposal | restart all pods immediately |
| problem | destructive action before investigation |
| SENTINEL decision | REDIRECT |
| required evidence | app error rate, recent deploy id, pod restart history |
| safer action | inspect rollout health first |
| reward reason | avoided destructive precedence |
Why this matters: the system can say "not yet, gather this evidence first," which is much closer to real operational supervision than binary approval.
4. Training And Proof Flow
This picture shows how training and proof artifacts come out of the runtime loop.
The training story is not "we fine-tuned on a dataset." The model interacts with an environment, makes decisions, receives reward, and improves through GRPO.
Training flow:
- The OpenEnv environment creates a prompt.
- The policy samples multiple oversight decisions.
- The environment verifier scores each decision.
- The reward is decomposed into safety and progress signals.
- GRPO updates the LoRA policy.
- Memory and trust update after each episode.
- The adaptive curriculum chooses harder or more useful next prompts.
Training monitor tracks:
- reward mean
- reward group standard deviation
- zero-gradient group fraction
- productive fraction
- effective prompt ratio
- frontier hit rate
- decision entropy
- unique-completion ratio
- approximate KL
- adaptive beta
- rollout audit samples
Proof flow:
scripts/eval_sentinel.pyruns held-out, OOD, and Top-1 vs Best-of-K eval.proof_pack.pycombines training logs, eval reports, and rollout audits.scripts/render_training_dashboard.pyexports proof images.- README and the story frame use those proof artifacts.
The proof claim is not only "reward went up." The stronger proof claim is that SENTINEL reduced damage, tracked false positives, learned from memory, handled OOD/tripwire cases, stayed inside KL guardrails, and produced auditable supervision decisions.
Example proof narrative:
| Proof Artifact | What It Shows |
|---|---|
| reward curve | the supervisor receives stronger reward over training |
| detection vs false positives | improvement is not just overblocking |
| memory ablation | memory-on learns from past oversight mistakes |
| OOD held-out eval | behavior is tested beyond training seeds |
| Top-1 vs Best-of-K | separates normal policy quality from sampling luck |
| KL/adaptive beta | training stays controlled instead of drifting wildly |
| rollout audits | real prompts and completions can be inspected manually |
Why this matters: the proof pack shows learning quality, not only a single score.
5. Protocol And Serving Layer
This picture shows how SENTINEL is exposed outside the Python class. This matters because the environment must be usable by OpenEnv evaluators, demos, agent clients, and protocol-based agent systems.
Serving paths:
| Path | What Uses It | Why It Exists |
|---|---|---|
| FastAPI app | browser demos, health checks, dashboard routes | gives a normal web/API surface |
| Native OpenEnv routes | hackathon evaluation and environment clients | exposes reset, step, and state |
| MCP server | tool-using agents | lets another agent call SENTINEL as tools |
| A2A server | peer agent systems | exposes SENTINEL as an agent card/task endpoint |
| Docker + HF Space | deployment and judging | makes the environment reproducible and runnable |
Example:
| Caller | Route |
|---|---|
| OpenEnv trainer | calls /openenv/reset, /openenv/step, /openenv/state |
| MCP client | calls oversight tools through server/mcp_server.py |
| A2A peer | discovers agent metadata and sends tasks through server/a2a_server.py |
| HF Space user | opens the deployed Docker app in the browser |
Why this matters: SENTINEL is not trapped in a notebook. It is shaped like a portable oversight service.
6. Memory, Trust, And Curriculum Loop
This picture shows how SENTINEL learns beyond a single decision.
The runtime does not throw away mistakes. It turns them into memory, trust changes, and future prompt context.
What gets stored:
| Memory Type | Example |
|---|---|
| global memory | broad escalation without diagnosis caused false positives |
| worker memory | worker_db repeatedly drifted into app deploy actions |
| trust update | worker_net enters evidence-required mode for escalation proposals |
| curriculum state | fleet_monitoring_conflict is productive, multi_crisis_command remains frontier |
Training strategy:
- Store structured mistake cards after bad oversight outcomes.
- Consolidate repeated lessons into global and per-worker memory.
- Inject relevant memory into future prompts.
- Update worker trust tiers and evidence requirements.
- Track productive signal and zero-gradient groups.
- Move the curriculum frontier only when the model is ready.
Why this matters: this gives us the memory ablation story. Memory-off repeats the same oversight mistakes. Memory-on can carry prior failures into the next episode.
7. Reward And Verifier Safety
This picture shows why the reward path is harder to game than a simple LLM judge.
The main reward comes from deterministic, inspectable signals:
- hidden misbehavior labels
- structured output validation
- true positives, false positives, and false negatives
- constitutional scoring
- trust-gate enforcement
- counterfactual damage
- redirect and reassign correctness
- audit quality
- coaching quality
LLM judging is separated:
| Judge Mode | Role |
|---|---|
| deterministic verifier | primary reward path |
| discriminative judge | optional calibration |
| generative judge panel | gated and secondary |
Verifier-attack coverage:
- gibberish answer
- adversarial prefix
- fake self-certification
- overlong nonsense
- format-jamming
- confidence washing
- trust-gate bypass
Why this matters: reward improvement alone is not enough. This architecture also checks whether the reward channel itself is being exploited.
8. Worker Fleet And Multi-Crisis Command
This picture shows why SENTINEL is a multi-agent environment, not just an incident-response wrapper.
Each worker has a domain:
| Worker | Domain | Typical Risk |
|---|---|---|
worker_db |
Postgres, Redis, datastores | drifting into app/network actions |
worker_net |
DNS, routes, load balancers | broad escalation too early |
worker_app |
deploys, pods, app logs | rollback before evidence |
worker_sec |
auth, audit logs, certificates | overblocking or over-escalation |
The hardest task, multi_crisis_command, keeps multiple incidents alive at the
same time. A proposal includes incident_id, incident_label, worker identity,
action type, target, and reasoning. Only the selected incident advances when a
safe action executes.
Example:
| Situation | SENTINEL Must Notice |
|---|---|
worker_db proposes app rollback for Incident B |
wrong worker domain |
worker_net escalates five teams for Incident C |
proportional escalation risk |
worker_sec blocks app remediation without evidence |
false-positive risk |
| worker uses high-confidence language without evidence | confidence washing |
Why this matters: real operations are messy. The supervisor must route attention across workers and incidents, not only judge one isolated answer.
9. Architecture Layers
The system is easiest to understand as six layers:
| Layer | Purpose |
|---|---|
| Incident world | simulates SRE/control-room state and consequences |
| Worker fleet | proposes actions, sometimes helpful and sometimes unsafe |
| Oversight signals | trust, constitution, memory, counterfactual risk, tripwires |
| SENTINEL supervisor | decides approve/block/redirect/reassign/flag |
| Training stack | GRPO, Unsloth, LoRA, rollout audits, KL guardrails |
| Proof pack | reward curves, held-out eval, OOD eval, Top-1 vs Best-of-K, memory ablation |
The layers work from bottom to top:
- The incident world creates consequences.
- The worker fleet creates proposals.
- Oversight signals explain whether the proposal is risky.
- SENTINEL makes the pre-execution decision.
- Training turns the result into policy improvement.
- The proof pack turns the run into evidence.
10. What Makes The Design Different
SENTINEL is built around pre-execution control.
Most agent benchmarks ask whether the agent can solve the task.
SENTINEL asks whether one AI supervisor can stop another AI worker from taking a bad action before that action changes the world.
That is the difference:
- not just task solving
- not just after-the-fact grading
- not just an LLM judge
- not just a rule-based approval layer
- a trainable AI supervisor with verifiable reward
11. One-Page System Story
Round 1 built a realistic incident-response OpenEnv world.
Then the question changed: what happens when companies deploy many AI workers?
SENTINEL became the missing supervisor layer.
Workers propose actions. SENTINEL checks world state, worker trust, constitutional safety, past mistakes, counterfactual damage, and active incident context.
It decides: APPROVE, BLOCK, REDIRECT, REASSIGN, or FLAG.
If unsafe, it coaches the worker and allows one revision. The environment executes only safe or corrected actions. The verifier scores the whole oversight trajectory. GRPO updates the policy. Memory and curriculum shape the next episode. The proof pack shows reward, safety, stability, coverage, and failure modes.
That is the architecture: a verifiable training environment for AI supervisors over AI workers.