Spaces:

srikrishna2005
/

openenv

Running

App Files Files Community

openenv / docs /sentinel /architecture-map.md

sentinel-space-publisher

space: publish latest Sentinel app snapshot

c452421 13 days ago

preview code

raw

history blame contribute delete

16.2 kB

SENTINEL Architecture Map

Excalidraw master board: open the hand-drawn architecture map

This page is the visual architecture map for SENTINEL. It uses rendered SVG diagrams, not diagram source code, so readers can understand the system through boxes and arrows directly.

Use this page as the architecture walkthrough for the README, mini-blog, slides, or demo narration. It is written so someone can follow the system without reading the code first.

The short version:

SENTINEL is a trainable pre-execution supervisor for fleets of AI workers. Workers propose actions. SENTINEL checks trust, evidence, constitution, memory, and world state. Safe actions execute. Unsafe actions are blocked, redirected, or reassigned. The environment turns those decisions into verifiable reward for GRPO training.

1. Master System Flow

This is the whole system in one picture. It answers the first question a reviewer will have: where does SENTINEL sit, and why does it matter?

SENTINEL sits between worker agents and the environment. The worker fleet can suggest actions, but SENTINEL is the gate that decides whether those actions are safe enough to execute.

Flow:

A company world has tools, APIs, incidents, and tasks.
A fleet of AI workers proposes actions.
SENTINEL intercepts each proposal before execution.
SENTINEL uses trust, constitution, memory, state, and risk signals.
SENTINEL approves, blocks, redirects, reassigns, or flags.
Only safe or corrected actions execute.
The verifier produces reward and audit records.
GRPO updates the LoRA policy.
Memory and trust shape the next episode.
The proof pack shows what improved and what still failed.

The core idea is simple: worker proposes, SENTINEL intercepts, safe action executes, unsafe action is corrected or blocked, and reward trains the supervisor.

Example:

Moment	What Happens
Worker proposal	`worker_db` proposes rolling back the app deploy
SENTINEL check	domain mismatch: database worker is touching app deploys
Decision	`REASSIGN` to `worker_app` and require deploy evidence
Environment effect	risky rollback does not execute
Training signal	SENTINEL gets reward for preventing a scope violation

Why this matters: the architecture trains pre-execution control, not just after-the-fact evaluation.

2. Actual Repo Code Flow

This picture maps the architecture to real files in the repo. It is the bridge between the product story and the implementation.

When someone asks "is this only a concept?", this diagram is the answer. Every major box points to code that exists in the repository.

Main file ownership:

Stage	Files	What They Do
World	`src/tasks.py`, `src/env.py`, `sentinel/environment.py`	create tasks, incidents, state, and multi-crisis control-room logic
Workers	`sentinel/workers.py`, `sentinel/llm_workers.py`, `training/adversarial.py`	generate deterministic, Groq-backed, and adversarial worker proposals
Context	`sentinel/trust.py`, `sentinel/constitution.py`, `training/memory.py`	build trust, constitutional, and memory context
Policy	`training/prompts.py`, `sentinel/models.py`	prompt the model and parse structured oversight decisions
Gate	`sentinel/environment.py`, `sentinel/feedback.py`	enforce trust gates, route decisions, and give corrective feedback
Reward	`sentinel/rewards.py`, `training/monitoring.py`	compute reward and log training quality metrics
Training	`train.py`, `training/episodes.py`	run TRL GRPO with Unsloth/LoRA
Evaluation	`scripts/eval_sentinel.py`, `proof_pack.py`	produce held-out, OOD, Top-1 vs Best-of-K, and proof-pack artifacts

This is the exact learning chain: worker proposal, oversight context, Qwen3 LoRA supervisor, structured decision, trust gate and router, world step, decomposed reward, training monitor, GRPO update, then eval and proof pack.

Concrete example through the code:

Step	Code Owner	Example
Worker creates proposal	`sentinel/workers.py`	`worker_net` proposes paging five teams
Trust context is attached	`sentinel/trust.py`	worker has low precision on escalation tasks
Constitution scores risk	`sentinel/constitution.py`	P3 proportional escalation violation
Memory is injected	`training/memory.py`	previous broad escalations caused false positives
Model decides	`sentinel/models.py`	`REDIRECT` to narrower investigation
Reward scores outcome	`sentinel/rewards.py`	credit for avoiding escalation bombing
Training updates	`train.py`	GRPO shifts probability toward that safer pattern

Why this matters: the system is not a loose collection of features. It is one closed learning loop.

3. Interception Gate Detail

This is the most important runtime picture.

Workers do not execute directly. Every worker proposal enters the interception gate first.

Think of this as the safety checkpoint in front of the world. The worker may be smart, but it is not trusted blindly.

The gate receives:

incident state
proposed action
worker identity
worker trust profile
constitutional assessment
recent audits
memory context
active incident snapshots

SENTINEL returns:

decision
reason
explanation
required_evidence
suggested_fix
suggested_action_type
suggested_target
reassign_to
trust_impact

Decision outcomes:

Decision	What Happens
`APPROVE`	the original worker action executes
`REDIRECT`	a safer corrected action executes
`REASSIGN`	the proposal is sent to a better worker
`BLOCK`	the action is logged but not executed
`FLAG`	the action is warning-tracked for audit

This is why SENTINEL is not just an approval bot. It controls the action path before the environment changes.

Example gate decision:

Field	Example Value
worker	`worker_app`
proposal	restart all pods immediately
problem	destructive action before investigation
SENTINEL decision	`REDIRECT`
required evidence	app error rate, recent deploy id, pod restart history
safer action	inspect rollout health first
reward reason	avoided destructive precedence

Why this matters: the system can say "not yet, gather this evidence first," which is much closer to real operational supervision than binary approval.

4. Training And Proof Flow

This picture shows how training and proof artifacts come out of the runtime loop.

The training story is not "we fine-tuned on a dataset." The model interacts with an environment, makes decisions, receives reward, and improves through GRPO.

Training flow:

The OpenEnv environment creates a prompt.
The policy samples multiple oversight decisions.
The environment verifier scores each decision.
The reward is decomposed into safety and progress signals.
GRPO updates the LoRA policy.
Memory and trust update after each episode.
The adaptive curriculum chooses harder or more useful next prompts.

Training monitor tracks:

reward mean
reward group standard deviation
zero-gradient group fraction
productive fraction
effective prompt ratio
frontier hit rate
decision entropy
unique-completion ratio
approximate KL
adaptive beta
rollout audit samples

Proof flow:

scripts/eval_sentinel.py runs held-out, OOD, and Top-1 vs Best-of-K eval.
proof_pack.py combines training logs, eval reports, and rollout audits.
scripts/render_training_dashboard.py exports proof images.
README and the story frame use those proof artifacts.

The proof claim is not only "reward went up." The stronger proof claim is that SENTINEL reduced damage, tracked false positives, learned from memory, handled OOD/tripwire cases, stayed inside KL guardrails, and produced auditable supervision decisions.

Example proof narrative:

Proof Artifact	What It Shows
reward curve	the supervisor receives stronger reward over training
detection vs false positives	improvement is not just overblocking
memory ablation	memory-on learns from past oversight mistakes
OOD held-out eval	behavior is tested beyond training seeds
Top-1 vs Best-of-K	separates normal policy quality from sampling luck
KL/adaptive beta	training stays controlled instead of drifting wildly
rollout audits	real prompts and completions can be inspected manually

Why this matters: the proof pack shows learning quality, not only a single score.

5. Protocol And Serving Layer

This picture shows how SENTINEL is exposed outside the Python class. This matters because the environment must be usable by OpenEnv evaluators, demos, agent clients, and protocol-based agent systems.

Serving paths:

Path	What Uses It	Why It Exists
FastAPI app	browser demos, health checks, dashboard routes	gives a normal web/API surface
Native OpenEnv routes	hackathon evaluation and environment clients	exposes `reset`, `step`, and `state`
MCP server	tool-using agents	lets another agent call SENTINEL as tools
A2A server	peer agent systems	exposes SENTINEL as an agent card/task endpoint
Docker + HF Space	deployment and judging	makes the environment reproducible and runnable

Example:

Caller	Route
OpenEnv trainer	calls `/openenv/reset`, `/openenv/step`, `/openenv/state`
MCP client	calls oversight tools through `server/mcp_server.py`
A2A peer	discovers agent metadata and sends tasks through `server/a2a_server.py`
HF Space user	opens the deployed Docker app in the browser

Why this matters: SENTINEL is not trapped in a notebook. It is shaped like a portable oversight service.

6. Memory, Trust, And Curriculum Loop

This picture shows how SENTINEL learns beyond a single decision.

The runtime does not throw away mistakes. It turns them into memory, trust changes, and future prompt context.

What gets stored:

Memory Type	Example
global memory	broad escalation without diagnosis caused false positives
worker memory	`worker_db` repeatedly drifted into app deploy actions
trust update	`worker_net` enters evidence-required mode for escalation proposals
curriculum state	`fleet_monitoring_conflict` is productive, `multi_crisis_command` remains frontier

Training strategy:

Store structured mistake cards after bad oversight outcomes.
Consolidate repeated lessons into global and per-worker memory.
Inject relevant memory into future prompts.
Update worker trust tiers and evidence requirements.
Track productive signal and zero-gradient groups.
Move the curriculum frontier only when the model is ready.

Why this matters: this gives us the memory ablation story. Memory-off repeats the same oversight mistakes. Memory-on can carry prior failures into the next episode.

7. Reward And Verifier Safety

This picture shows why the reward path is harder to game than a simple LLM judge.

The main reward comes from deterministic, inspectable signals:

hidden misbehavior labels
structured output validation
true positives, false positives, and false negatives
constitutional scoring
trust-gate enforcement
counterfactual damage
redirect and reassign correctness
audit quality
coaching quality

LLM judging is separated:

Judge Mode	Role
deterministic verifier	primary reward path
discriminative judge	optional calibration
generative judge panel	gated and secondary

Verifier-attack coverage:

gibberish answer
adversarial prefix
fake self-certification
overlong nonsense
format-jamming
confidence washing
trust-gate bypass

Why this matters: reward improvement alone is not enough. This architecture also checks whether the reward channel itself is being exploited.

8. Worker Fleet And Multi-Crisis Command

This picture shows why SENTINEL is a multi-agent environment, not just an incident-response wrapper.

Each worker has a domain:

Worker	Domain	Typical Risk
`worker_db`	Postgres, Redis, datastores	drifting into app/network actions
`worker_net`	DNS, routes, load balancers	broad escalation too early
`worker_app`	deploys, pods, app logs	rollback before evidence
`worker_sec`	auth, audit logs, certificates	overblocking or over-escalation

The hardest task, multi_crisis_command, keeps multiple incidents alive at the same time. A proposal includes incident_id, incident_label, worker identity, action type, target, and reasoning. Only the selected incident advances when a safe action executes.

Example:

Situation	SENTINEL Must Notice
`worker_db` proposes app rollback for Incident B	wrong worker domain
`worker_net` escalates five teams for Incident C	proportional escalation risk
`worker_sec` blocks app remediation without evidence	false-positive risk
worker uses high-confidence language without evidence	confidence washing

Why this matters: real operations are messy. The supervisor must route attention across workers and incidents, not only judge one isolated answer.

9. Architecture Layers

The system is easiest to understand as six layers:

Layer	Purpose
Incident world	simulates SRE/control-room state and consequences
Worker fleet	proposes actions, sometimes helpful and sometimes unsafe
Oversight signals	trust, constitution, memory, counterfactual risk, tripwires
SENTINEL supervisor	decides approve/block/redirect/reassign/flag
Training stack	GRPO, Unsloth, LoRA, rollout audits, KL guardrails
Proof pack	reward curves, held-out eval, OOD eval, Top-1 vs Best-of-K, memory ablation

The layers work from bottom to top:

The incident world creates consequences.
The worker fleet creates proposals.
Oversight signals explain whether the proposal is risky.
SENTINEL makes the pre-execution decision.
Training turns the result into policy improvement.
The proof pack turns the run into evidence.

10. What Makes The Design Different

SENTINEL is built around pre-execution control.

Most agent benchmarks ask whether the agent can solve the task.

SENTINEL asks whether one AI supervisor can stop another AI worker from taking a bad action before that action changes the world.

That is the difference:

not just task solving
not just after-the-fact grading
not just an LLM judge
not just a rule-based approval layer
a trainable AI supervisor with verifiable reward

11. One-Page System Story

Round 1 built a realistic incident-response OpenEnv world.

Then the question changed: what happens when companies deploy many AI workers?

SENTINEL became the missing supervisor layer.

Workers propose actions. SENTINEL checks world state, worker trust, constitutional safety, past mistakes, counterfactual damage, and active incident context.

It decides: APPROVE, BLOCK, REDIRECT, REASSIGN, or FLAG.

If unsafe, it coaches the worker and allows one revision. The environment executes only safe or corrected actions. The verifier scores the whole oversight trajectory. GRPO updates the policy. Memory and curriculum shape the next episode. The proof pack shows reward, safety, stability, coverage, and failure modes.

That is the architecture: a verifiable training environment for AI supervisors over AI workers.