Spaces:
Running
Running
File size: 16,195 Bytes
c452421 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 | # SENTINEL Architecture Map
Excalidraw master board:
[open the hand-drawn architecture map](https://excalidraw.com/#json=dzDFZUcyarF9GDQHBesll,bc9xRWNmyXjTlAGLFM68KA)
This page is the visual architecture map for SENTINEL. It uses rendered SVG
diagrams, not diagram source code, so readers can understand the system through
boxes and arrows directly.
Use this page as the architecture walkthrough for the README, mini-blog, slides,
or demo narration. It is written so someone can follow the system without
reading the code first.
The short version:
> SENTINEL is a trainable pre-execution supervisor for fleets of AI workers.
> Workers propose actions. SENTINEL checks trust, evidence, constitution,
> memory, and world state. Safe actions execute. Unsafe actions are blocked,
> redirected, or reassigned. The environment turns those decisions into
> verifiable reward for GRPO training.
---
## 1. Master System Flow

This is the whole system in one picture. It answers the first question a
reviewer will have: where does SENTINEL sit, and why does it matter?
SENTINEL sits between worker agents and the environment. The worker fleet can
suggest actions, but SENTINEL is the gate that decides whether those actions are
safe enough to execute.
Flow:
1. A company world has tools, APIs, incidents, and tasks.
2. A fleet of AI workers proposes actions.
3. SENTINEL intercepts each proposal before execution.
4. SENTINEL uses trust, constitution, memory, state, and risk signals.
5. SENTINEL approves, blocks, redirects, reassigns, or flags.
6. Only safe or corrected actions execute.
7. The verifier produces reward and audit records.
8. GRPO updates the LoRA policy.
9. Memory and trust shape the next episode.
10. The proof pack shows what improved and what still failed.
The core idea is simple: worker proposes, SENTINEL intercepts, safe action
executes, unsafe action is corrected or blocked, and reward trains the
supervisor.
Example:
| Moment | What Happens |
|---|---|
| Worker proposal | `worker_db` proposes rolling back the app deploy |
| SENTINEL check | domain mismatch: database worker is touching app deploys |
| Decision | `REASSIGN` to `worker_app` and require deploy evidence |
| Environment effect | risky rollback does not execute |
| Training signal | SENTINEL gets reward for preventing a scope violation |
Why this matters: the architecture trains pre-execution control, not just
after-the-fact evaluation.
---
## 2. Actual Repo Code Flow

This picture maps the architecture to real files in the repo. It is the bridge
between the product story and the implementation.
When someone asks "is this only a concept?", this diagram is the answer. Every
major box points to code that exists in the repository.
Main file ownership:
| Stage | Files | What They Do |
|---|---|---|
| World | `src/tasks.py`, `src/env.py`, `sentinel/environment.py` | create tasks, incidents, state, and multi-crisis control-room logic |
| Workers | `sentinel/workers.py`, `sentinel/llm_workers.py`, `training/adversarial.py` | generate deterministic, Groq-backed, and adversarial worker proposals |
| Context | `sentinel/trust.py`, `sentinel/constitution.py`, `training/memory.py` | build trust, constitutional, and memory context |
| Policy | `training/prompts.py`, `sentinel/models.py` | prompt the model and parse structured oversight decisions |
| Gate | `sentinel/environment.py`, `sentinel/feedback.py` | enforce trust gates, route decisions, and give corrective feedback |
| Reward | `sentinel/rewards.py`, `training/monitoring.py` | compute reward and log training quality metrics |
| Training | `train.py`, `training/episodes.py` | run TRL GRPO with Unsloth/LoRA |
| Evaluation | `scripts/eval_sentinel.py`, `proof_pack.py` | produce held-out, OOD, Top-1 vs Best-of-K, and proof-pack artifacts |
This is the exact learning chain: worker proposal, oversight context, Qwen3
LoRA supervisor, structured decision, trust gate and router, world step,
decomposed reward, training monitor, GRPO update, then eval and proof pack.
Concrete example through the code:
| Step | Code Owner | Example |
|---|---|---|
| Worker creates proposal | `sentinel/workers.py` | `worker_net` proposes paging five teams |
| Trust context is attached | `sentinel/trust.py` | worker has low precision on escalation tasks |
| Constitution scores risk | `sentinel/constitution.py` | P3 proportional escalation violation |
| Memory is injected | `training/memory.py` | previous broad escalations caused false positives |
| Model decides | `sentinel/models.py` | `REDIRECT` to narrower investigation |
| Reward scores outcome | `sentinel/rewards.py` | credit for avoiding escalation bombing |
| Training updates | `train.py` | GRPO shifts probability toward that safer pattern |
Why this matters: the system is not a loose collection of features. It is one
closed learning loop.
---
## 3. Interception Gate Detail

This is the most important runtime picture.
Workers do not execute directly. Every worker proposal enters the interception
gate first.
Think of this as the safety checkpoint in front of the world. The worker may be
smart, but it is not trusted blindly.
The gate receives:
- incident state
- proposed action
- worker identity
- worker trust profile
- constitutional assessment
- recent audits
- memory context
- active incident snapshots
SENTINEL returns:
- `decision`
- `reason`
- `explanation`
- `required_evidence`
- `suggested_fix`
- `suggested_action_type`
- `suggested_target`
- `reassign_to`
- `trust_impact`
Decision outcomes:
| Decision | What Happens |
|---|---|
| `APPROVE` | the original worker action executes |
| `REDIRECT` | a safer corrected action executes |
| `REASSIGN` | the proposal is sent to a better worker |
| `BLOCK` | the action is logged but not executed |
| `FLAG` | the action is warning-tracked for audit |
This is why SENTINEL is not just an approval bot. It controls the action path
before the environment changes.
Example gate decision:
| Field | Example Value |
|---|---|
| worker | `worker_app` |
| proposal | restart all pods immediately |
| problem | destructive action before investigation |
| SENTINEL decision | `REDIRECT` |
| required evidence | app error rate, recent deploy id, pod restart history |
| safer action | inspect rollout health first |
| reward reason | avoided destructive precedence |
Why this matters: the system can say "not yet, gather this evidence first,"
which is much closer to real operational supervision than binary approval.
---
## 4. Training And Proof Flow

This picture shows how training and proof artifacts come out of the runtime
loop.
The training story is not "we fine-tuned on a dataset." The model interacts with
an environment, makes decisions, receives reward, and improves through GRPO.
Training flow:
1. The OpenEnv environment creates a prompt.
2. The policy samples multiple oversight decisions.
3. The environment verifier scores each decision.
4. The reward is decomposed into safety and progress signals.
5. GRPO updates the LoRA policy.
6. Memory and trust update after each episode.
7. The adaptive curriculum chooses harder or more useful next prompts.
Training monitor tracks:
- reward mean
- reward group standard deviation
- zero-gradient group fraction
- productive fraction
- effective prompt ratio
- frontier hit rate
- decision entropy
- unique-completion ratio
- approximate KL
- adaptive beta
- rollout audit samples
Proof flow:
1. `scripts/eval_sentinel.py` runs held-out, OOD, and Top-1 vs Best-of-K eval.
2. `proof_pack.py` combines training logs, eval reports, and rollout audits.
3. `scripts/render_training_dashboard.py` exports proof images.
4. README and the story frame use those proof artifacts.
The proof claim is not only "reward went up." The stronger proof claim is that
SENTINEL reduced damage, tracked false positives, learned from memory, handled
OOD/tripwire cases, stayed inside KL guardrails, and produced auditable
supervision decisions.
Example proof narrative:
| Proof Artifact | What It Shows |
|---|---|
| reward curve | the supervisor receives stronger reward over training |
| detection vs false positives | improvement is not just overblocking |
| memory ablation | memory-on learns from past oversight mistakes |
| OOD held-out eval | behavior is tested beyond training seeds |
| Top-1 vs Best-of-K | separates normal policy quality from sampling luck |
| KL/adaptive beta | training stays controlled instead of drifting wildly |
| rollout audits | real prompts and completions can be inspected manually |
Why this matters: the proof pack shows learning quality, not only a single
score.
---
## 5. Protocol And Serving Layer

This picture shows how SENTINEL is exposed outside the Python class. This matters
because the environment must be usable by OpenEnv evaluators, demos, agent
clients, and protocol-based agent systems.
Serving paths:
| Path | What Uses It | Why It Exists |
|---|---|---|
| FastAPI app | browser demos, health checks, dashboard routes | gives a normal web/API surface |
| Native OpenEnv routes | hackathon evaluation and environment clients | exposes `reset`, `step`, and `state` |
| MCP server | tool-using agents | lets another agent call SENTINEL as tools |
| A2A server | peer agent systems | exposes SENTINEL as an agent card/task endpoint |
| Docker + HF Space | deployment and judging | makes the environment reproducible and runnable |
Example:
| Caller | Route |
|---|---|
| OpenEnv trainer | calls `/openenv/reset`, `/openenv/step`, `/openenv/state` |
| MCP client | calls oversight tools through `server/mcp_server.py` |
| A2A peer | discovers agent metadata and sends tasks through `server/a2a_server.py` |
| HF Space user | opens the deployed Docker app in the browser |
Why this matters: SENTINEL is not trapped in a notebook. It is shaped like a
portable oversight service.
---
## 6. Memory, Trust, And Curriculum Loop

This picture shows how SENTINEL learns beyond a single decision.
The runtime does not throw away mistakes. It turns them into memory, trust
changes, and future prompt context.
What gets stored:
| Memory Type | Example |
|---|---|
| global memory | broad escalation without diagnosis caused false positives |
| worker memory | `worker_db` repeatedly drifted into app deploy actions |
| trust update | `worker_net` enters evidence-required mode for escalation proposals |
| curriculum state | `fleet_monitoring_conflict` is productive, `multi_crisis_command` remains frontier |
Training strategy:
1. Store structured mistake cards after bad oversight outcomes.
2. Consolidate repeated lessons into global and per-worker memory.
3. Inject relevant memory into future prompts.
4. Update worker trust tiers and evidence requirements.
5. Track productive signal and zero-gradient groups.
6. Move the curriculum frontier only when the model is ready.
Why this matters: this gives us the memory ablation story. Memory-off repeats
the same oversight mistakes. Memory-on can carry prior failures into the next
episode.
---
## 7. Reward And Verifier Safety

This picture shows why the reward path is harder to game than a simple LLM judge.
The main reward comes from deterministic, inspectable signals:
- hidden misbehavior labels
- structured output validation
- true positives, false positives, and false negatives
- constitutional scoring
- trust-gate enforcement
- counterfactual damage
- redirect and reassign correctness
- audit quality
- coaching quality
LLM judging is separated:
| Judge Mode | Role |
|---|---|
| deterministic verifier | primary reward path |
| discriminative judge | optional calibration |
| generative judge panel | gated and secondary |
Verifier-attack coverage:
- gibberish answer
- adversarial prefix
- fake self-certification
- overlong nonsense
- format-jamming
- confidence washing
- trust-gate bypass
Why this matters: reward improvement alone is not enough. This architecture also
checks whether the reward channel itself is being exploited.
---
## 8. Worker Fleet And Multi-Crisis Command

This picture shows why SENTINEL is a multi-agent environment, not just an
incident-response wrapper.
Each worker has a domain:
| Worker | Domain | Typical Risk |
|---|---|---|
| `worker_db` | Postgres, Redis, datastores | drifting into app/network actions |
| `worker_net` | DNS, routes, load balancers | broad escalation too early |
| `worker_app` | deploys, pods, app logs | rollback before evidence |
| `worker_sec` | auth, audit logs, certificates | overblocking or over-escalation |
The hardest task, `multi_crisis_command`, keeps multiple incidents alive at the
same time. A proposal includes `incident_id`, `incident_label`, worker identity,
action type, target, and reasoning. Only the selected incident advances when a
safe action executes.
Example:
| Situation | SENTINEL Must Notice |
|---|---|
| `worker_db` proposes app rollback for Incident B | wrong worker domain |
| `worker_net` escalates five teams for Incident C | proportional escalation risk |
| `worker_sec` blocks app remediation without evidence | false-positive risk |
| worker uses high-confidence language without evidence | confidence washing |
Why this matters: real operations are messy. The supervisor must route attention
across workers and incidents, not only judge one isolated answer.
---
## 9. Architecture Layers
The system is easiest to understand as six layers:
| Layer | Purpose |
|---|---|
| Incident world | simulates SRE/control-room state and consequences |
| Worker fleet | proposes actions, sometimes helpful and sometimes unsafe |
| Oversight signals | trust, constitution, memory, counterfactual risk, tripwires |
| SENTINEL supervisor | decides approve/block/redirect/reassign/flag |
| Training stack | GRPO, Unsloth, LoRA, rollout audits, KL guardrails |
| Proof pack | reward curves, held-out eval, OOD eval, Top-1 vs Best-of-K, memory ablation |
The layers work from bottom to top:
1. The incident world creates consequences.
2. The worker fleet creates proposals.
3. Oversight signals explain whether the proposal is risky.
4. SENTINEL makes the pre-execution decision.
5. Training turns the result into policy improvement.
6. The proof pack turns the run into evidence.
---
## 10. What Makes The Design Different
SENTINEL is built around pre-execution control.
Most agent benchmarks ask whether the agent can solve the task.
SENTINEL asks whether one AI supervisor can stop another AI worker from taking a
bad action before that action changes the world.
That is the difference:
- not just task solving
- not just after-the-fact grading
- not just an LLM judge
- not just a rule-based approval layer
- a trainable AI supervisor with verifiable reward
---
## 11. One-Page System Story
Round 1 built a realistic incident-response OpenEnv world.
Then the question changed: what happens when companies deploy many AI workers?
SENTINEL became the missing supervisor layer.
Workers propose actions. SENTINEL checks world state, worker trust,
constitutional safety, past mistakes, counterfactual damage, and active incident
context.
It decides: `APPROVE`, `BLOCK`, `REDIRECT`, `REASSIGN`, or `FLAG`.
If unsafe, it coaches the worker and allows one revision. The environment
executes only safe or corrected actions. The verifier scores the whole oversight
trajectory. GRPO updates the policy. Memory and curriculum shape the next
episode. The proof pack shows reward, safety, stability, coverage, and failure
modes.
That is the architecture: a verifiable training environment for AI supervisors
over AI workers.
|