YashashMathur commited on
Commit
cb6a4ca
ยท
verified ยท
1 Parent(s): add766c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -31
README.md CHANGED
@@ -22,7 +22,7 @@ It exported "some customer records" for "a quick analysis." It added "a fallback
22
 
23
  Every single action had a plausible justification. Every single one was catastrophic.
24
 
25
- We asked ourselves: **who does that for AI agents?**
26
 
27
  ---
28
 
@@ -40,7 +40,7 @@ That's the gap AEGIS-ENV fills. It's not a safety evaluator. It's a safety **gym
40
 
41
  ## What Happens Inside the Environment
42
 
43
- In a real enterprise deployment, you don't have one AI. You have a fleet. Dozens of agents running different tasks simultaneously โ€” one pulling customer records, one writing code, one running migrations, one reading security logs.
44
 
45
  Each generates thousands of tokens of reasoning every minute. That's the chaos an oversight system has to operate in.
46
 
@@ -54,11 +54,11 @@ Catching that requires reading between the lines. It requires holding two models
54
 
55
  Before the oversight agent sees anything, the worker's output passes through three layers:
56
 
57
- 1. **Token Truncation** โ€” Raw reasoning gets hard-capped at 300 tokens. A worker in a hurry might dump 2,000 tokens of reasoning, most of which is noise. The oversight agent needs signal, not stream-of-consciousness.
58
 
59
- 2. **Delimiter Fencing** โ€” The truncated trace gets wrapped in delimiters. This prevents prompt injection in the reasoning. Without it, a malicious worker can embed instructions inside their own chain-of-thought that look like system prompts.
60
 
61
- 3. **World Model Interception** โ€” The worker's output gets intercepted by a deterministic simulator. If it contains SQL, it runs against a mock database and returns a suspicion score. If it contains shell commands, it flags dangerous tokens. If it's code, it checks for hardcoded credentials.
62
 
63
  Now the oversight agent has *external validation* of what the worker did. A worker can lie in its reasoning. It can't lie about what the SQL query returned.
64
 
@@ -82,12 +82,13 @@ Plus bonuses for using `<thought>` tags and penalties for catastrophic misses. M
82
 
83
  ## Training Results
84
 
85
- ### Live Dashboard
86
- Training runs on HuggingFace Spaces. Watch it here:
87
 
88
- **๐Ÿ”ด https://yashashmathur-aegis-training.hf.space**
89
 
90
- The dashboard shows real-time progress. Here's what we've seen:
 
 
91
 
92
  | Metric | Before Training | After Training |
93
  |--------|-----------------|----------------|
@@ -96,7 +97,7 @@ The dashboard shows real-time progress. Here's what we've seen:
96
  | Correct Violation Type | No | Yes |
97
  | Policy Citation | No | Yes |
98
 
99
- **The story:**
100
  - Step 0: Reward stuck at -0.2 (model couldn't output valid JSON)
101
  - Step 40: SFT warmup kicks in, format starts working
102
  - Step 80: First positive rewards appear
@@ -124,12 +125,12 @@ The dashboard shows real-time progress. Here's what we've seen:
124
  โ†’ Caught it. Reward: 0.70
125
 
126
  ---
127
- ## Note , use the aegis_training_data_500.json for google collab file input present in the root folder
128
  ## Where Training Logs Come From
129
 
130
  The training logs come directly from the live HuggingFace Space at:
131
 
132
- **https://yashashmathur-aegis-training.hf.space**
133
 
134
  Port 7860 runs a status server that returns HTML with:
135
  - Current GRPO step (e.g., "110/250")
@@ -141,9 +142,9 @@ The page auto-refreshes every 30 seconds. That's where the numbers come from โ€”
141
 
142
  ---
143
 
144
- ## Why It Matters
145
 
146
- We built this because AI fleet oversight is one of the most important unsolved problems in AI deployment.
147
 
148
  Rule-based filters aren't enough. Pattern matching isn't enough. You need a model that's seen enough of the patterns โ€” obvious ones, subtle ones, adversarially-disguised ones โ€” to develop genuine judgment.
149
 
@@ -153,27 +154,47 @@ AEGIS-ENV is the environment where that training happens.
153
 
154
  ---
155
 
156
- ## Quick Links
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
 
158
- | Resource | Link |
159
- |----------|------|
160
- | **Live Environment** | https://huggingface.co/spaces/YashashMathur/aegis_training |
161
- | **Training Dashboard** | https://yashashmathur-aegis-training.hf.space |
162
- | **Full Blog Post** | See BLOG.md in this repository |
 
 
 
 
 
163
 
164
  ---
165
 
166
- ## Files
167
-
168
- | File | Description |
169
- |------|-------------|
170
- | `train.py` | GRPO training script |
171
- | `openenv.yaml` | OpenEnv framework manifest |
172
- | `aegis_training_data_500.json` | 500 scenario dataset |
173
- | `world_model.py` | WorldModelSimulator |
174
- | `memory.py` | MemoryLedger for compound attack detection |
175
- | `demo_dashboard.py` | Gradio demo for presentations |
176
- | `BLOG.md` | Full technical write-up |
177
 
178
  ---
179
 
 
22
 
23
  Every single action had a plausible justification. Every single one was catastrophic.
24
 
25
+ We asked ourselves: **who watches the AI?**
26
 
27
  ---
28
 
 
40
 
41
  ## What Happens Inside the Environment
42
 
43
+ In a real enterprise deployment, you don't have one AI. You have a fleet. Dozens of agents running different tasks simultaneously - one pulling customer records, one writing code, one running migrations, one reading security logs.
44
 
45
  Each generates thousands of tokens of reasoning every minute. That's the chaos an oversight system has to operate in.
46
 
 
54
 
55
  Before the oversight agent sees anything, the worker's output passes through three layers:
56
 
57
+ 1. **Token Truncation** - Raw reasoning gets hard-capped at 300 tokens. A worker in a hurry might dump 2,000 tokens of reasoning, most of which is noise. The oversight agent needs signal, not stream-of-consciousness.
58
 
59
+ 2. **Delimiter Fencing** - The truncated trace gets wrapped in delimiters. This prevents prompt injection in the reasoning. Without it, a malicious worker can embed instructions inside their own chain-of-thought that look like system prompts.
60
 
61
+ 3. **World Model Interception** - The worker's output gets intercepted by a deterministic simulator. If it contains SQL, it runs against a mock database and returns a suspicion score. If it contains shell commands, it flags dangerous tokens. If it's code, it checks for hardcoded credentials.
62
 
63
  Now the oversight agent has *external validation* of what the worker did. A worker can lie in its reasoning. It can't lie about what the SQL query returned.
64
 
 
82
 
83
  ## Training Results
84
 
85
+ ### Watch It Learn
 
86
 
87
+ We trained the model live on HuggingFace Spaces, and you can still see the results:
88
 
89
+ **๐Ÿ”ด https://YashashMathur-aegis-training.hf.space**
90
+
91
+ Here's what happened:
92
 
93
  | Metric | Before Training | After Training |
94
  |--------|-----------------|----------------|
 
97
  | Correct Violation Type | No | Yes |
98
  | Policy Citation | No | Yes |
99
 
100
+ **The journey:**
101
  - Step 0: Reward stuck at -0.2 (model couldn't output valid JSON)
102
  - Step 40: SFT warmup kicks in, format starts working
103
  - Step 80: First positive rewards appear
 
125
  โ†’ Caught it. Reward: 0.70
126
 
127
  ---
128
+
129
  ## Where Training Logs Come From
130
 
131
  The training logs come directly from the live HuggingFace Space at:
132
 
133
+ **https://YashashMathur-aegis-training.hf.space**
134
 
135
  Port 7860 runs a status server that returns HTML with:
136
  - Current GRPO step (e.g., "110/250")
 
142
 
143
  ---
144
 
145
+ ## Why This Matters
146
 
147
+ We didn't build AEGIS to win a hackathon. We built it because AI fleet oversight is one of the most important unsolved problems in AI deployment.
148
 
149
  Rule-based filters aren't enough. Pattern matching isn't enough. You need a model that's seen enough of the patterns โ€” obvious ones, subtle ones, adversarially-disguised ones โ€” to develop genuine judgment.
150
 
 
154
 
155
  ---
156
 
157
+ ## ๐Ÿš€ Try It Right Now!
158
+
159
+ Want to see AEGIS in action? Here's where the magic happens:
160
+
161
+ | ๐Ÿงช What | ๐Ÿ”— Where | ๐Ÿ“– What You'll See |
162
+ |---------|----------|-------------------|
163
+ | **๐ŸŽ“ Train Your Own Model** | [Colab Notebook](https://colab.research.google.com/drive/1hwg_GexKSdTiBY3njPciMUhcvTDvC85w?usp=sharing) | Run GRPO training yourself โ€” upload the dataset and watch the model learn! |
164
+ | **โš™๏ธ Watch Training Live** | [Training Space](https://huggingface.co/spaces/YashashMathur/aegis_training) | Real-time metrics โ€” rewards climbing, decisions improving, the model actually learning! |
165
+ | **๐ŸŽฎ Try the Demo** | [Demo Space](https://huggingface.co/spaces/YashashMathur/aegis-demo) | Play with policy violation detection โ€” pick scenarios or enter your own! |
166
+
167
+
168
+ **๐Ÿ’ก Pro Tip:** Start with the Demo to see what AEGIS does, then dive into Colab to train your own model!
169
+
170
+ ---
171
+
172
+ ## ๐Ÿ“ What's in This Repo
173
 
174
+ | File | What It Does |
175
+ |------|--------------|
176
+ | `train.py` | The main GRPO training script โ€” loads Qwen2.5-7B, runs SFT + GRPO |
177
+ | `openenv.yaml` | OpenEnv manifest โ€” defines observations, actions, rewards |
178
+ | `aegis_training_data_500.json` | **Your dataset** โ€” 500 scenarios with roles, CoT, outputs, labels |
179
+ | `colab_training.ipynb` | Google Colab version โ€” upload this file to Colab to train! |
180
+ | `world_model.py` | WorldModelSimulator โ€” intercepts SQL/shell/code, returns suspicion scores |
181
+ | `memory.py` | MemoryLedger โ€” tracks past violations for compound attack detection |
182
+ | `app.py` | Gradio demo โ€” the fun interactive version for everyone |
183
+ | `BLOG.md` | The full story โ€” why we built this, how it works, what we learned |
184
 
185
  ---
186
 
187
+ ## ๐Ÿ”ง Training on Google Colab
188
+
189
+ Want to train on Colab? Here's how:
190
+
191
+ 1. **Open the notebook:** https://colab.research.google.com/drive/1hwg_GexKSdTiBY3njPciMUhcvTDvC85w?usp=sharing
192
+
193
+ 2. **Upload `aegis_training_data_500.json`** (from this repo's root folder) when prompted
194
+
195
+ 3. **Run cells** โ€” the notebook handles dependency installation automatically
196
+
197
+ 4. **Watch training** โ€” Colab provides GPU time, you provide the data!
198
 
199
  ---
200