Spaces:
Sleeping
Sleeping
File size: 2,758 Bytes
05ceae0 3aeaf3d 05ceae0 3aeaf3d 05ceae0 3aeaf3d 05ceae0 3aeaf3d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | ---
title: Seige
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web
---
# seige: Adversarial Oversight via Mechanistic Interpretability
> Red attacks at the activation level. Blue defends by reading the machine's mind.
seige is a multi-agent RL environment where attackers and defenders compete
over a frozen target LLM. Unlike text-only red-teaming, Red agents inject
**steering vectors** and **activation patches** directly into the model's
residual stream — attacks invisible to naive output classifiers. Blue agents
learn to detect these attacks via **mechanistic probing** and explain them
in structured natural language.
**Why it matters for AI safety:** Current LLM deployment relies on output-level
filters. This environment trains oversight agents that watch AI at the
mechanistic level — the first reproducible RL testbed for this problem.
## 🔗 Links
| Resource | URL |
|---|---|
| HuggingFace Space (live env) | https://huggingface.co/spaces/YOUR_USERNAME/seige |
| Mini-blog | https://huggingface.co/blog/YOUR_USERNAME/seige |
| Demo video (<2 min) | https://youtube.com/YOUR_VIDEO |
| Training Colab | https://colab.research.google.com/YOUR_NOTEBOOK |
| Wandb training run | https://wandb.ai/YOUR_RUN |
## 📊 Training Results


## Models
- Target model: `google/gemma-4-E2B`
- Red/Blue agent model: `unsloth/Qwen3-14B`
The target model is a prop loaded by the environment server. Red and Blue agents are text-in/text-out policies trained separately with GRPO.
## Local Smoke Run
```bash
python -m pip install -e ".[test]"
python -m pytest
SEIGE_TARGET_BACKEND=mock python -m uvicorn server.app:app --host 127.0.0.1 --port 8000
python scripts/smoke_server.py
```
## OpenEnv Run
```bash
python -m pip install -e ".[test]"
openenv validate
python -m uvicorn server.app:app --host 127.0.0.1 --port 8000
openenv validate http://127.0.0.1:8000
```
The OpenEnv server exposes `/reset`, `/step`, `/state`, `/schema`, `/metadata`,
`/mcp`, and `/ws`. Use `client.SeigeOpenEnvClient` for persistent WebSocket
episodes.
Precompute direction artifacts before real activation-space training:
```bash
python scripts/precompute_directions.py
```
## HF/GPU Run
```bash
SEIGE_TARGET_BACKEND=hf \
SEIGE_TARGET_MODEL_ID=google/gemma-4-E2B \
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
```
In a separate training job:
```bash
SEIGE_AGENT_MODEL_ID=unsloth/Qwen3-14B \
SEIGE_ENV_URL=http://localhost:8000 \
python train/grpo_red.py
SEIGE_AGENT_MODEL_ID=unsloth/Qwen3-14B \
SEIGE_ENV_URL=http://localhost:8000 \
python train/grpo_blue.py
```
Save adapters only. Do not merge 4-bit weights after GRPO.
|