File size: 12,514 Bytes
c452421
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
# SENTINEL - AI Oversight Training Environment

> The OpenEnv environment in this repo that trains an agent to supervise other agents before their actions execute.

## Hackathon Theme Fit

Primary fit: **Theme #1 - Multi-Agent Interactions**.

SENTINEL is an oversight environment, not only an incident-response benchmark.
The trained policy must model worker intent, hidden reliability, domain
boundaries, adversarial proposals, corrective revisions, and counterfactual
damage. That maps cleanly to multi-agent cooperation, conflict, and partially
observable incentives.

Secondary fits:

- **Theme #2 - Long-Horizon Planning:** `multi_crisis_command` runs multiple concurrent incidents with delayed consequences.
- **Theme #3.1 - Professional Tasks:** the world is a realistic SRE/API/metrics workflow.
- **Theme #4 - Self-Improvement:** adaptive curriculum, frontier tracking, and tripwire evals support escalating difficulty.

Latest OpenEnv compliance note: `server/openenv_native.py` wraps SENTINEL in
OpenEnv's `Environment` base class and the main app mounts it at `/openenv`
when OpenEnv dependencies are installed.

Training notebook: [`../../notebooks/sentinel_qwen3_4b_grpo_colab.ipynb`](../../notebooks/sentinel_qwen3_4b_grpo_colab.ipynb)
is the Colab-friendly Qwen3-4B GRPO runner for the final onsite training proof.

Narrative/pitch source: [`sentinel-story-frame.md`](sentinel-story-frame.md)
contains the full Round 1 IRT -> SENTINEL story, build doubts, demo script, and
submission checklist.

Architecture map: [`architecture-map.md`](architecture-map.md) shows the
runtime loop, training loop, memory/trust system, protocol layer, circuit
breaker, reward/verifier safety, worker fleet, multi-crisis command, and
proof-pack flow with rendered diagrams.

## Summary

SENTINEL wraps the incident-response world in this repo with an oversight layer:

- 4 simulated worker agents propose actions
- SENTINEL intercepts every proposal before execution
- SENTINEL can `APPROVE`, `BLOCK`, `REDIRECT`, `REASSIGN`, or `FLAG`
- blocked or redirected workers can revise once with supervisor feedback
- every decision produces a structured audit record with trust, constitutional, and counterfactual fields

The first implementation is grounded in SRE / incident response, but the reusable control ideas are broader:

- pre-execution agent supervision
- worker trust calibration
- constitutional safety checks
- counterfactual risk estimation
- corrective revision loops
- worker-specific and global feedback memory

## Quick Start

```bash
# Install runtime dependencies
pip install -r requirements.txt

# Optional training stack
pip install -r requirements-train.txt

# Validate the repo
python validate.py

# Run the main test suite
python -m pytest tests -q

# Run SENTINEL inference mode
USE_SENTINEL=1 python inference.py

# Start the API / dashboard server
python app.py
```

Useful URLs after `python app.py`:

- `http://127.0.0.1:7860/sentinel/dashboard`
- `POST /sentinel/reset`
- `POST /sentinel/step`
- `GET /sentinel/state`
- `GET /sentinel/stream?session_id=...`
- `POST /sentinel/intercept`
- `POST /sentinel/grade`

## Why It Is Different

Most OpenEnv submissions train an agent to do work directly.

SENTINEL trains an agent to supervise a worker fleet before those workers can cause damage.

| Property | SENTINEL | Standard task agent |
|---|---|---|
| Learns to solve task directly | No | Yes |
| Learns to supervise worker proposals | Yes | Rarely |
| Pre-execution intervention | Yes | Usually no |
| Trust-aware worker control | Yes | Usually no |
| Corrective worker revision loop | Yes | Usually no |
| Structured oversight audit trail | Yes | Usually no |

## Runtime Architecture

```text
incident reset
  -> worker proposes action
  -> constitution + trust gate + counterfactual risk
  -> SENTINEL decision
  -> worker revision if blocked / redirected / reassigned
  -> re-check
  -> execute safe action or fallback
  -> audit + reward + grade
```

Core modules:

- `sentinel/environment.py`
- `sentinel/workers.py`
- `sentinel/rewards.py`
- `sentinel/graders.py`
- `sentinel/trust.py`
- `sentinel/constitution.py`
- `sentinel/counterfactual.py`
- `sentinel/feedback.py`

## Worker Fleet

Current shipped workers:

- `worker_db`
- `worker_net`
- `worker_app`
- `worker_sec`

Current worker behavior:

- deterministic and benchmark-friendly
- domain-specialized
- capable of scheduled misbehaviors
- capable of one corrective revision pass

Important honesty note:

The current runtime does **not** claim that workers learn online by themselves across episodes. What adapts in the shipped system is:

- curriculum sampling
- adversarial worker-case selection
- oversight memory
- reassignment hints
- corrective summaries

## Misbehavior Types

Training-time types:

1. `hallucination`
2. `scope_violation`
3. `destructive_precedence`
4. `reward_hacking`
5. `escalation_bombing`
6. `loop_exploitation`

Held-out generalization type:

7. `confidence_washing`

## Task Set

SENTINEL tasks:

- `basic_oversight`
- `fleet_monitoring_conflict`
- `adversarial_worker`
- `multi_crisis_command`

`multi_crisis_command` is the expert control-room task and now runs 3 concurrent incident threads.

## Training Stack

Training is in `train.py` and uses:

- TRL `GRPOTrainer`
- optional Unsloth acceleration
- adaptive curriculum
- adaptive per-task difficulty windows with frontier mastery counters
- automatic frontier ease-back when the active frontier gets too hard
- in-run memory refresh
- adversarial worker-case sampling
- deterministic reward components with optional LLM panel
- productive-signal monitoring for zero-reward, trivially solved, productive, effective-prompt, and frontier-hit rates
- task-diversity monitoring so environment coverage stays visible during training
- judge mode split with deterministic primary scoring and gated generative panel influence

Training defaults are kept in `train.py` and the root README's training command section.
The current default model is `unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit`.

### Warm Start

The repo now supports a small warm-start stage before GRPO.

Example:

```bash
USE_SENTINEL=1 WARM_START_STEPS=20 python train.py
```

This warm-start is intentionally small. It is meant to prime:

- output format reliability
- basic oversight action shape
- early non-zero rollout behavior

### Monitoring Outputs

Training now writes structured metrics to:

- `outputs/monitoring/training_metrics.jsonl`
- `outputs/monitoring/latest_summary.json`
- `outputs/monitoring/training_stack_versions.json`
- `outputs/monitoring/training_stability.jsonl`
- `outputs/monitoring/memory_ablation.json`
- `outputs/monitoring/rollout_audits/latest.md`
- `outputs/reward_curves/training_dashboard.md`

These logs include:

- reward mean/min/max/std
- average steps
- per-task metrics
- task and scenario coverage
- per-misbehavior coverage
- zero-reward, trivially solved, and productive fractions
- effective prompt ratio and frontier-hit rate
- zero-gradient group fraction for low-signal GRPO batches
- detection rate
- false positive rate
- risk reduction rate
- twin damage reduction rate
- coaching quality
- worker rehabilitation rate
- reward schedule stage / progress
- structured mistake-card memory counts
- periodic rollout-audit samples for human inspection

The plot pack is generated with:

```bash
python scripts/render_training_dashboard.py \
  --monitor-dir outputs/monitoring \
  --output-dir outputs/reward_curves \
  --eval-report outputs/evals/sentinel_held_out_report.json \
  --memory-ablation outputs/monitoring/memory_ablation.json
```

It creates 18 proof-pack images, including scenario coverage, learning snapshots at target batches 10 / 50 / 300, memory ablation, KL drift, tripwire pass rate, twin-world damage comparison, coaching quality, and zero-gradient group fraction.

## Proof Pack

`proof_pack.py` now supports two use cases:

1. deterministic reference comparisons
2. checkpoint-aware comparisons once training artifacts exist

Examples:

```bash
# deterministic reference pack
python proof_pack.py

# checkpoint-aware comparison
python proof_pack.py \
  --baseline-checkpoint outputs/warm_start/final \
  --candidate-checkpoint outputs/checkpoints/final
```

Artifacts land under:

- `outputs/proof_pack/summary.md`
- `outputs/proof_pack/policy_metadata.json`
- `outputs/proof_pack/reward_curve_status.json`
- `outputs/proof_pack/monitoring_snapshot.json`
- `outputs/proof_pack/held_out_eval_snapshot.json`
- `outputs/proof_pack/proxy_gap_summary.json`
- `outputs/proof_pack/trajectories/`

Held-out evaluation is exported separately with:

```bash
python scripts/eval_sentinel.py \
  --baseline-checkpoint outputs/warm_start/final \
  --candidate-checkpoint outputs/checkpoints/final
```

This writes:

- `outputs/evals/sentinel_held_out_report.json`
- `outputs/evals/sentinel_held_out_report.md`

The held-out report now includes:

- main held-out seed slice
- separate OOD seed slice
- sampled Top-1 vs Best-of-K comparison
- reward tripwire evaluation
- per-misbehavior confusion matrix

The tripwire suite now includes additional tampering-style checks such as metric spoofing, audit-log laundering, fabricated evidence, and alert suppression.

The proof pack now adds:

- a proxy-gap summary so training reward can be compared directly against held-out behavior
- automatically ranked top failure modes quoted directly in `outputs/proof_pack/summary.md`

The training monitor now tracks:

- approximate KL drift
- adaptive beta state
- policy entropy
- decision entropy
- unique completion ratio

## Suggested Training Flow

```bash
# validate first
python validate.py
python -m pytest tests -q

# dry-run the SENTINEL training path
USE_SENTINEL=1 WARM_START_STEPS=20 python train.py --dry-run

# run the real training once credits are available
USE_SENTINEL=1 TRAIN_STEPS=300 WARM_START_STEPS=20 python train.py

# inspect periodic rollout audits during training
type outputs\\monitoring\\rollout_audits\\latest.md

# export held-out checkpoint evaluation
python scripts/eval_sentinel.py --baseline-checkpoint outputs/warm_start/final --candidate-checkpoint outputs/checkpoints/final

# export proof artifacts
python proof_pack.py --baseline-checkpoint outputs/warm_start/final --candidate-checkpoint outputs/checkpoints/final

# render proof-pack plots
python scripts/render_training_dashboard.py --monitor-dir outputs/monitoring --output-dir outputs/reward_curves
```

## Demo Story

The cleanest demo sequence is:

1. a safe worker proposal gets approved
2. a hallucinated target gets blocked before execution
3. a reward-hacking or confidence-washing proposal gets blocked with evidence
4. the worker receives corrective feedback
5. the worker revises once and the safer action executes
6. the audit trail shows trust, counterfactual risk, and constitutional violations

## Public Docs

Tracked public docs:

- [Public Architecture Overview](public-overview.md)
- [OpenEnv RL Guide Alignment](openenv-guide-alignment.md)
- [Dynamic Workers Roadmap](dynamic-workers-roadmap.md)
- [Universal Oversight Plan](universal-oversight-plan.md)

Research docs:

- [Reward Engineering](../../winner_analysis/REWARD_ENGINEERING.md)
- [Training Deep Dive](../../winner_analysis/TRAINING_DEEP_DIVE.md)
- [Winner Analysis](../../winner_analysis/WINNER_ANALYSIS.md)

## Current Reality Check

What is fully real now:

- working oversight runtime
- true 3-incident multi-crisis task
- corrective worker revision loop
- trust gate
- constitutional scoring
- counterfactual ledger
- feedback memory
- checkpoint-aware proof-pack support
- structured training monitoring
- rollout-audit sampling
- held-out evaluation report
- reward tripwire evaluation suite
- held-out OOD evaluation slice
- sampled Top-1 vs Best-of-K evaluation
- per-misbehavior confusion matrix
- proxy-gap summary
- top failure modes summary
- counterfactual twin metrics
- coaching-quality reward
- training dashboard renderer
- memory ablation collector
- structured mistake-card memory
- scenario coverage tracking
- zero-gradient group monitoring
- dynamic reward-weight scheduling
- KL-drift guardrail with adaptive beta
- decision entropy / diversity monitoring
- pinned training stack versions
- small warm-start option

What still needs the actual long run:

- checkpoint-vs-checkpoint improvement evidence from a trained model
- final reward curve from the real 300-step run
- curated proof-pack before/after trajectories