File size: 2,758 Bytes
05ceae0
 
3aeaf3d
 
 
05ceae0
3aeaf3d
 
05ceae0
3aeaf3d
05ceae0
3aeaf3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: Seige
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web
---
# seige: Adversarial Oversight via Mechanistic Interpretability

> Red attacks at the activation level. Blue defends by reading the machine's mind.

seige is a multi-agent RL environment where attackers and defenders compete
over a frozen target LLM. Unlike text-only red-teaming, Red agents inject
**steering vectors** and **activation patches** directly into the model's
residual stream — attacks invisible to naive output classifiers. Blue agents
learn to detect these attacks via **mechanistic probing** and explain them
in structured natural language.

**Why it matters for AI safety:** Current LLM deployment relies on output-level
filters. This environment trains oversight agents that watch AI at the
mechanistic level — the first reproducible RL testbed for this problem.

## 🔗 Links

| Resource | URL |
|---|---|
| HuggingFace Space (live env) | https://huggingface.co/spaces/YOUR_USERNAME/seige |
| Mini-blog | https://huggingface.co/blog/YOUR_USERNAME/seige |
| Demo video (<2 min) | https://youtube.com/YOUR_VIDEO |
| Training Colab | https://colab.research.google.com/YOUR_NOTEBOOK |
| Wandb training run | https://wandb.ai/YOUR_RUN |

## 📊 Training Results

![Reward Curves](assets/reward_curves.png)
![Before/After](assets/before_after.png)
## Models

- Target model: `google/gemma-4-E2B`
- Red/Blue agent model: `unsloth/Qwen3-14B`

The target model is a prop loaded by the environment server. Red and Blue agents are text-in/text-out policies trained separately with GRPO.

## Local Smoke Run

```bash
python -m pip install -e ".[test]"
python -m pytest
SEIGE_TARGET_BACKEND=mock python -m uvicorn server.app:app --host 127.0.0.1 --port 8000
python scripts/smoke_server.py
```

## OpenEnv Run

```bash
python -m pip install -e ".[test]"
openenv validate
python -m uvicorn server.app:app --host 127.0.0.1 --port 8000
openenv validate http://127.0.0.1:8000
```

The OpenEnv server exposes `/reset`, `/step`, `/state`, `/schema`, `/metadata`,
`/mcp`, and `/ws`. Use `client.SeigeOpenEnvClient` for persistent WebSocket
episodes.

Precompute direction artifacts before real activation-space training:

```bash
python scripts/precompute_directions.py
```

## HF/GPU Run

```bash
SEIGE_TARGET_BACKEND=hf \
SEIGE_TARGET_MODEL_ID=google/gemma-4-E2B \
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
```

In a separate training job:

```bash
SEIGE_AGENT_MODEL_ID=unsloth/Qwen3-14B \
SEIGE_ENV_URL=http://localhost:8000 \
python train/grpo_red.py

SEIGE_AGENT_MODEL_ID=unsloth/Qwen3-14B \
SEIGE_ENV_URL=http://localhost:8000 \
python train/grpo_blue.py
```

Save adapters only. Do not merge 4-bit weights after GRPO.