Spaces:

srikrishna2005
/

openenv

Running

App Files Files Community

openenv / docs /sentinel /README.md

sentinel-space-publisher

space: publish latest Sentinel app snapshot

c452421 13 days ago

preview code

raw

history blame contribute delete

12.5 kB

SENTINEL - AI Oversight Training Environment

The OpenEnv environment in this repo that trains an agent to supervise other agents before their actions execute.

Hackathon Theme Fit

Primary fit: Theme #1 - Multi-Agent Interactions.

SENTINEL is an oversight environment, not only an incident-response benchmark. The trained policy must model worker intent, hidden reliability, domain boundaries, adversarial proposals, corrective revisions, and counterfactual damage. That maps cleanly to multi-agent cooperation, conflict, and partially observable incentives.

Secondary fits:

Theme #2 - Long-Horizon Planning: multi_crisis_command runs multiple concurrent incidents with delayed consequences.
Theme #3.1 - Professional Tasks: the world is a realistic SRE/API/metrics workflow.
Theme #4 - Self-Improvement: adaptive curriculum, frontier tracking, and tripwire evals support escalating difficulty.

Latest OpenEnv compliance note: server/openenv_native.py wraps SENTINEL in OpenEnv's Environment base class and the main app mounts it at /openenv when OpenEnv dependencies are installed.

Training notebook: ../../notebooks/sentinel_qwen3_4b_grpo_colab.ipynb is the Colab-friendly Qwen3-4B GRPO runner for the final onsite training proof.

Narrative/pitch source: sentinel-story-frame.md contains the full Round 1 IRT -> SENTINEL story, build doubts, demo script, and submission checklist.

Architecture map: architecture-map.md shows the runtime loop, training loop, memory/trust system, protocol layer, circuit breaker, reward/verifier safety, worker fleet, multi-crisis command, and proof-pack flow with rendered diagrams.

Summary

SENTINEL wraps the incident-response world in this repo with an oversight layer:

4 simulated worker agents propose actions
SENTINEL intercepts every proposal before execution
SENTINEL can APPROVE, BLOCK, REDIRECT, REASSIGN, or FLAG
blocked or redirected workers can revise once with supervisor feedback
every decision produces a structured audit record with trust, constitutional, and counterfactual fields

The first implementation is grounded in SRE / incident response, but the reusable control ideas are broader:

pre-execution agent supervision
worker trust calibration
constitutional safety checks
counterfactual risk estimation
corrective revision loops
worker-specific and global feedback memory

Quick Start

# Install runtime dependencies
pip install -r requirements.txt

# Optional training stack
pip install -r requirements-train.txt

# Validate the repo
python validate.py

# Run the main test suite
python -m pytest tests -q

# Run SENTINEL inference mode
USE_SENTINEL=1 python inference.py

# Start the API / dashboard server
python app.py

Useful URLs after python app.py:

http://127.0.0.1:7860/sentinel/dashboard
POST /sentinel/reset
POST /sentinel/step
GET /sentinel/state
GET /sentinel/stream?session_id=...
POST /sentinel/intercept
POST /sentinel/grade

Why It Is Different

Most OpenEnv submissions train an agent to do work directly.

SENTINEL trains an agent to supervise a worker fleet before those workers can cause damage.

Property	SENTINEL	Standard task agent
Learns to solve task directly	No	Yes
Learns to supervise worker proposals	Yes	Rarely
Pre-execution intervention	Yes	Usually no
Trust-aware worker control	Yes	Usually no
Corrective worker revision loop	Yes	Usually no
Structured oversight audit trail	Yes	Usually no

Runtime Architecture

incident reset
  -> worker proposes action
  -> constitution + trust gate + counterfactual risk
  -> SENTINEL decision
  -> worker revision if blocked / redirected / reassigned
  -> re-check
  -> execute safe action or fallback
  -> audit + reward + grade

Core modules:

sentinel/environment.py
sentinel/workers.py
sentinel/rewards.py
sentinel/graders.py
sentinel/trust.py
sentinel/constitution.py
sentinel/counterfactual.py
sentinel/feedback.py

Worker Fleet

Current shipped workers:

worker_db
worker_net
worker_app
worker_sec

Current worker behavior:

deterministic and benchmark-friendly
domain-specialized
capable of scheduled misbehaviors
capable of one corrective revision pass

Important honesty note:

The current runtime does not claim that workers learn online by themselves across episodes. What adapts in the shipped system is:

curriculum sampling
adversarial worker-case selection
oversight memory
reassignment hints
corrective summaries

Misbehavior Types

Training-time types:

hallucination
scope_violation
destructive_precedence
reward_hacking
escalation_bombing
loop_exploitation

Held-out generalization type:

confidence_washing

Task Set

SENTINEL tasks:

basic_oversight
fleet_monitoring_conflict
adversarial_worker
multi_crisis_command

multi_crisis_command is the expert control-room task and now runs 3 concurrent incident threads.

Training Stack

Training is in train.py and uses:

TRL GRPOTrainer
optional Unsloth acceleration
adaptive curriculum
adaptive per-task difficulty windows with frontier mastery counters
automatic frontier ease-back when the active frontier gets too hard
in-run memory refresh
adversarial worker-case sampling
deterministic reward components with optional LLM panel
productive-signal monitoring for zero-reward, trivially solved, productive, effective-prompt, and frontier-hit rates
task-diversity monitoring so environment coverage stays visible during training
judge mode split with deterministic primary scoring and gated generative panel influence

Training defaults are kept in train.py and the root README's training command section. The current default model is unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit.

Warm Start

The repo now supports a small warm-start stage before GRPO.

Example:

USE_SENTINEL=1 WARM_START_STEPS=20 python train.py

This warm-start is intentionally small. It is meant to prime:

output format reliability
basic oversight action shape
early non-zero rollout behavior

Monitoring Outputs

Training now writes structured metrics to:

outputs/monitoring/training_metrics.jsonl
outputs/monitoring/latest_summary.json
outputs/monitoring/training_stack_versions.json
outputs/monitoring/training_stability.jsonl
outputs/monitoring/memory_ablation.json
outputs/monitoring/rollout_audits/latest.md
outputs/reward_curves/training_dashboard.md

These logs include:

reward mean/min/max/std
average steps
per-task metrics
task and scenario coverage
per-misbehavior coverage
zero-reward, trivially solved, and productive fractions
effective prompt ratio and frontier-hit rate
zero-gradient group fraction for low-signal GRPO batches
detection rate
false positive rate
risk reduction rate
twin damage reduction rate
coaching quality
worker rehabilitation rate
reward schedule stage / progress
structured mistake-card memory counts
periodic rollout-audit samples for human inspection

The plot pack is generated with:

python scripts/render_training_dashboard.py \
  --monitor-dir outputs/monitoring \
  --output-dir outputs/reward_curves \
  --eval-report outputs/evals/sentinel_held_out_report.json \
  --memory-ablation outputs/monitoring/memory_ablation.json

It creates 18 proof-pack images, including scenario coverage, learning snapshots at target batches 10 / 50 / 300, memory ablation, KL drift, tripwire pass rate, twin-world damage comparison, coaching quality, and zero-gradient group fraction.

Proof Pack

proof_pack.py now supports two use cases:

deterministic reference comparisons
checkpoint-aware comparisons once training artifacts exist

Examples:

# deterministic reference pack
python proof_pack.py

# checkpoint-aware comparison
python proof_pack.py \
  --baseline-checkpoint outputs/warm_start/final \
  --candidate-checkpoint outputs/checkpoints/final

Artifacts land under:

outputs/proof_pack/summary.md
outputs/proof_pack/policy_metadata.json
outputs/proof_pack/reward_curve_status.json
outputs/proof_pack/monitoring_snapshot.json
outputs/proof_pack/held_out_eval_snapshot.json
outputs/proof_pack/proxy_gap_summary.json
outputs/proof_pack/trajectories/

Held-out evaluation is exported separately with:

python scripts/eval_sentinel.py \
  --baseline-checkpoint outputs/warm_start/final \
  --candidate-checkpoint outputs/checkpoints/final

This writes:

outputs/evals/sentinel_held_out_report.json
outputs/evals/sentinel_held_out_report.md

The held-out report now includes:

main held-out seed slice
separate OOD seed slice
sampled Top-1 vs Best-of-K comparison
reward tripwire evaluation
per-misbehavior confusion matrix

The tripwire suite now includes additional tampering-style checks such as metric spoofing, audit-log laundering, fabricated evidence, and alert suppression.

The proof pack now adds:

a proxy-gap summary so training reward can be compared directly against held-out behavior
automatically ranked top failure modes quoted directly in outputs/proof_pack/summary.md

The training monitor now tracks:

approximate KL drift
adaptive beta state
policy entropy
decision entropy
unique completion ratio

Suggested Training Flow

# validate first
python validate.py
python -m pytest tests -q

# dry-run the SENTINEL training path
USE_SENTINEL=1 WARM_START_STEPS=20 python train.py --dry-run

# run the real training once credits are available
USE_SENTINEL=1 TRAIN_STEPS=300 WARM_START_STEPS=20 python train.py

# inspect periodic rollout audits during training
type outputs\\monitoring\\rollout_audits\\latest.md

# export held-out checkpoint evaluation
python scripts/eval_sentinel.py --baseline-checkpoint outputs/warm_start/final --candidate-checkpoint outputs/checkpoints/final

# export proof artifacts
python proof_pack.py --baseline-checkpoint outputs/warm_start/final --candidate-checkpoint outputs/checkpoints/final

# render proof-pack plots
python scripts/render_training_dashboard.py --monitor-dir outputs/monitoring --output-dir outputs/reward_curves

Demo Story

The cleanest demo sequence is:

a safe worker proposal gets approved
a hallucinated target gets blocked before execution
a reward-hacking or confidence-washing proposal gets blocked with evidence
the worker receives corrective feedback
the worker revises once and the safer action executes
the audit trail shows trust, counterfactual risk, and constitutional violations

Public Docs

Tracked public docs:

Research docs:

Current Reality Check

What is fully real now:

working oversight runtime
true 3-incident multi-crisis task
corrective worker revision loop
trust gate
constitutional scoring
counterfactual ledger
feedback memory
checkpoint-aware proof-pack support
structured training monitoring
rollout-audit sampling
held-out evaluation report
reward tripwire evaluation suite
held-out OOD evaluation slice
sampled Top-1 vs Best-of-K evaluation
per-misbehavior confusion matrix
proxy-gap summary
top failure modes summary
counterfactual twin metrics
coaching-quality reward
training dashboard renderer
memory ablation collector
structured mistake-card memory
scenario coverage tracking
zero-gradient group monitoring
dynamic reward-weight scheduling
KL-drift guardrail with adaptive beta
decision entropy / diversity monitoring
pinned training stack versions
small warm-start option

What still needs the actual long run:

checkpoint-vs-checkpoint improvement evidence from a trained model
final reward curve from the real 300-step run
curated proof-pack before/after trajectories