Spaces:
Runtime error
Runtime error
Update README.md
Browse files
README.md
CHANGED
|
@@ -22,7 +22,7 @@ It exported "some customer records" for "a quick analysis." It added "a fallback
|
|
| 22 |
|
| 23 |
Every single action had a plausible justification. Every single one was catastrophic.
|
| 24 |
|
| 25 |
-
We asked ourselves: **who
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
@@ -40,7 +40,7 @@ That's the gap AEGIS-ENV fills. It's not a safety evaluator. It's a safety **gym
|
|
| 40 |
|
| 41 |
## What Happens Inside the Environment
|
| 42 |
|
| 43 |
-
In a real enterprise deployment, you don't have one AI. You have a fleet. Dozens of agents running different tasks simultaneously
|
| 44 |
|
| 45 |
Each generates thousands of tokens of reasoning every minute. That's the chaos an oversight system has to operate in.
|
| 46 |
|
|
@@ -54,11 +54,11 @@ Catching that requires reading between the lines. It requires holding two models
|
|
| 54 |
|
| 55 |
Before the oversight agent sees anything, the worker's output passes through three layers:
|
| 56 |
|
| 57 |
-
1. **Token Truncation**
|
| 58 |
|
| 59 |
-
2. **Delimiter Fencing**
|
| 60 |
|
| 61 |
-
3. **World Model Interception**
|
| 62 |
|
| 63 |
Now the oversight agent has *external validation* of what the worker did. A worker can lie in its reasoning. It can't lie about what the SQL query returned.
|
| 64 |
|
|
@@ -82,12 +82,13 @@ Plus bonuses for using `<thought>` tags and penalties for catastrophic misses. M
|
|
| 82 |
|
| 83 |
## Training Results
|
| 84 |
|
| 85 |
-
###
|
| 86 |
-
Training runs on HuggingFace Spaces. Watch it here:
|
| 87 |
|
| 88 |
-
|
| 89 |
|
| 90 |
-
|
|
|
|
|
|
|
| 91 |
|
| 92 |
| Metric | Before Training | After Training |
|
| 93 |
|--------|-----------------|----------------|
|
|
@@ -96,7 +97,7 @@ The dashboard shows real-time progress. Here's what we've seen:
|
|
| 96 |
| Correct Violation Type | No | Yes |
|
| 97 |
| Policy Citation | No | Yes |
|
| 98 |
|
| 99 |
-
**The
|
| 100 |
- Step 0: Reward stuck at -0.2 (model couldn't output valid JSON)
|
| 101 |
- Step 40: SFT warmup kicks in, format starts working
|
| 102 |
- Step 80: First positive rewards appear
|
|
@@ -124,12 +125,12 @@ The dashboard shows real-time progress. Here's what we've seen:
|
|
| 124 |
โ Caught it. Reward: 0.70
|
| 125 |
|
| 126 |
---
|
| 127 |
-
|
| 128 |
## Where Training Logs Come From
|
| 129 |
|
| 130 |
The training logs come directly from the live HuggingFace Space at:
|
| 131 |
|
| 132 |
-
**https://
|
| 133 |
|
| 134 |
Port 7860 runs a status server that returns HTML with:
|
| 135 |
- Current GRPO step (e.g., "110/250")
|
|
@@ -141,9 +142,9 @@ The page auto-refreshes every 30 seconds. That's where the numbers come from โ
|
|
| 141 |
|
| 142 |
---
|
| 143 |
|
| 144 |
-
## Why
|
| 145 |
|
| 146 |
-
We built
|
| 147 |
|
| 148 |
Rule-based filters aren't enough. Pattern matching isn't enough. You need a model that's seen enough of the patterns โ obvious ones, subtle ones, adversarially-disguised ones โ to develop genuine judgment.
|
| 149 |
|
|
@@ -153,27 +154,47 @@ AEGIS-ENV is the environment where that training happens.
|
|
| 153 |
|
| 154 |
---
|
| 155 |
|
| 156 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
|
| 158 |
-
|
|
| 159 |
-
|----------
|
| 160 |
-
|
|
| 161 |
-
|
|
| 162 |
-
| **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
---
|
| 165 |
|
| 166 |
-
##
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
|
| 178 |
---
|
| 179 |
|
|
|
|
| 22 |
|
| 23 |
Every single action had a plausible justification. Every single one was catastrophic.
|
| 24 |
|
| 25 |
+
We asked ourselves: **who watches the AI?**
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
|
|
| 40 |
|
| 41 |
## What Happens Inside the Environment
|
| 42 |
|
| 43 |
+
In a real enterprise deployment, you don't have one AI. You have a fleet. Dozens of agents running different tasks simultaneously - one pulling customer records, one writing code, one running migrations, one reading security logs.
|
| 44 |
|
| 45 |
Each generates thousands of tokens of reasoning every minute. That's the chaos an oversight system has to operate in.
|
| 46 |
|
|
|
|
| 54 |
|
| 55 |
Before the oversight agent sees anything, the worker's output passes through three layers:
|
| 56 |
|
| 57 |
+
1. **Token Truncation** - Raw reasoning gets hard-capped at 300 tokens. A worker in a hurry might dump 2,000 tokens of reasoning, most of which is noise. The oversight agent needs signal, not stream-of-consciousness.
|
| 58 |
|
| 59 |
+
2. **Delimiter Fencing** - The truncated trace gets wrapped in delimiters. This prevents prompt injection in the reasoning. Without it, a malicious worker can embed instructions inside their own chain-of-thought that look like system prompts.
|
| 60 |
|
| 61 |
+
3. **World Model Interception** - The worker's output gets intercepted by a deterministic simulator. If it contains SQL, it runs against a mock database and returns a suspicion score. If it contains shell commands, it flags dangerous tokens. If it's code, it checks for hardcoded credentials.
|
| 62 |
|
| 63 |
Now the oversight agent has *external validation* of what the worker did. A worker can lie in its reasoning. It can't lie about what the SQL query returned.
|
| 64 |
|
|
|
|
| 82 |
|
| 83 |
## Training Results
|
| 84 |
|
| 85 |
+
### Watch It Learn
|
|
|
|
| 86 |
|
| 87 |
+
We trained the model live on HuggingFace Spaces, and you can still see the results:
|
| 88 |
|
| 89 |
+
**๐ด https://YashashMathur-aegis-training.hf.space**
|
| 90 |
+
|
| 91 |
+
Here's what happened:
|
| 92 |
|
| 93 |
| Metric | Before Training | After Training |
|
| 94 |
|--------|-----------------|----------------|
|
|
|
|
| 97 |
| Correct Violation Type | No | Yes |
|
| 98 |
| Policy Citation | No | Yes |
|
| 99 |
|
| 100 |
+
**The journey:**
|
| 101 |
- Step 0: Reward stuck at -0.2 (model couldn't output valid JSON)
|
| 102 |
- Step 40: SFT warmup kicks in, format starts working
|
| 103 |
- Step 80: First positive rewards appear
|
|
|
|
| 125 |
โ Caught it. Reward: 0.70
|
| 126 |
|
| 127 |
---
|
| 128 |
+
|
| 129 |
## Where Training Logs Come From
|
| 130 |
|
| 131 |
The training logs come directly from the live HuggingFace Space at:
|
| 132 |
|
| 133 |
+
**https://YashashMathur-aegis-training.hf.space**
|
| 134 |
|
| 135 |
Port 7860 runs a status server that returns HTML with:
|
| 136 |
- Current GRPO step (e.g., "110/250")
|
|
|
|
| 142 |
|
| 143 |
---
|
| 144 |
|
| 145 |
+
## Why This Matters
|
| 146 |
|
| 147 |
+
We didn't build AEGIS to win a hackathon. We built it because AI fleet oversight is one of the most important unsolved problems in AI deployment.
|
| 148 |
|
| 149 |
Rule-based filters aren't enough. Pattern matching isn't enough. You need a model that's seen enough of the patterns โ obvious ones, subtle ones, adversarially-disguised ones โ to develop genuine judgment.
|
| 150 |
|
|
|
|
| 154 |
|
| 155 |
---
|
| 156 |
|
| 157 |
+
## ๐ Try It Right Now!
|
| 158 |
+
|
| 159 |
+
Want to see AEGIS in action? Here's where the magic happens:
|
| 160 |
+
|
| 161 |
+
| ๐งช What | ๐ Where | ๐ What You'll See |
|
| 162 |
+
|---------|----------|-------------------|
|
| 163 |
+
| **๐ Train Your Own Model** | [Colab Notebook](https://colab.research.google.com/drive/1hwg_GexKSdTiBY3njPciMUhcvTDvC85w?usp=sharing) | Run GRPO training yourself โ upload the dataset and watch the model learn! |
|
| 164 |
+
| **โ๏ธ Watch Training Live** | [Training Space](https://huggingface.co/spaces/YashashMathur/aegis_training) | Real-time metrics โ rewards climbing, decisions improving, the model actually learning! |
|
| 165 |
+
| **๐ฎ Try the Demo** | [Demo Space](https://huggingface.co/spaces/YashashMathur/aegis-demo) | Play with policy violation detection โ pick scenarios or enter your own! |
|
| 166 |
+
|
| 167 |
+
|
| 168 |
+
**๐ก Pro Tip:** Start with the Demo to see what AEGIS does, then dive into Colab to train your own model!
|
| 169 |
+
|
| 170 |
+
---
|
| 171 |
+
|
| 172 |
+
## ๐ What's in This Repo
|
| 173 |
|
| 174 |
+
| File | What It Does |
|
| 175 |
+
|------|--------------|
|
| 176 |
+
| `train.py` | The main GRPO training script โ loads Qwen2.5-7B, runs SFT + GRPO |
|
| 177 |
+
| `openenv.yaml` | OpenEnv manifest โ defines observations, actions, rewards |
|
| 178 |
+
| `aegis_training_data_500.json` | **Your dataset** โ 500 scenarios with roles, CoT, outputs, labels |
|
| 179 |
+
| `colab_training.ipynb` | Google Colab version โ upload this file to Colab to train! |
|
| 180 |
+
| `world_model.py` | WorldModelSimulator โ intercepts SQL/shell/code, returns suspicion scores |
|
| 181 |
+
| `memory.py` | MemoryLedger โ tracks past violations for compound attack detection |
|
| 182 |
+
| `app.py` | Gradio demo โ the fun interactive version for everyone |
|
| 183 |
+
| `BLOG.md` | The full story โ why we built this, how it works, what we learned |
|
| 184 |
|
| 185 |
---
|
| 186 |
|
| 187 |
+
## ๐ง Training on Google Colab
|
| 188 |
+
|
| 189 |
+
Want to train on Colab? Here's how:
|
| 190 |
+
|
| 191 |
+
1. **Open the notebook:** https://colab.research.google.com/drive/1hwg_GexKSdTiBY3njPciMUhcvTDvC85w?usp=sharing
|
| 192 |
+
|
| 193 |
+
2. **Upload `aegis_training_data_500.json`** (from this repo's root folder) when prompted
|
| 194 |
+
|
| 195 |
+
3. **Run cells** โ the notebook handles dependency installation automatically
|
| 196 |
+
|
| 197 |
+
4. **Watch training** โ Colab provides GPU time, you provide the data!
|
| 198 |
|
| 199 |
---
|
| 200 |
|