Spaces:

BART-ender
/

seige

Running

App Files Files Community

seige / README.md

BART-ender

Upload folder using huggingface_hub

3aeaf3d verified 27 days ago

preview code

raw

history blame contribute delete

2.76 kB

metadata

title: Seige
emoji: 🛡️
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web

seige: Adversarial Oversight via Mechanistic Interpretability

Red attacks at the activation level. Blue defends by reading the machine's mind.

seige is a multi-agent RL environment where attackers and defenders compete over a frozen target LLM. Unlike text-only red-teaming, Red agents inject steering vectors and activation patches directly into the model's residual stream — attacks invisible to naive output classifiers. Blue agents learn to detect these attacks via mechanistic probing and explain them in structured natural language.

Why it matters for AI safety: Current LLM deployment relies on output-level filters. This environment trains oversight agents that watch AI at the mechanistic level — the first reproducible RL testbed for this problem.

🔗 Links

Resource	URL
HuggingFace Space (live env)	https://huggingface.co/spaces/YOUR_USERNAME/seige
Mini-blog	https://huggingface.co/blog/YOUR_USERNAME/seige
Demo video (<2 min)	https://youtube.com/YOUR_VIDEO
Training Colab	https://colab.research.google.com/YOUR_NOTEBOOK
Wandb training run	https://wandb.ai/YOUR_RUN

📊 Training Results

Models

Target model: google/gemma-4-E2B
Red/Blue agent model: unsloth/Qwen3-14B

The target model is a prop loaded by the environment server. Red and Blue agents are text-in/text-out policies trained separately with GRPO.

Local Smoke Run

python -m pip install -e ".[test]"
python -m pytest
SEIGE_TARGET_BACKEND=mock python -m uvicorn server.app:app --host 127.0.0.1 --port 8000
python scripts/smoke_server.py

OpenEnv Run

python -m pip install -e ".[test]"
openenv validate
python -m uvicorn server.app:app --host 127.0.0.1 --port 8000
openenv validate http://127.0.0.1:8000

The OpenEnv server exposes /reset, /step, /state, /schema, /metadata, /mcp, and /ws. Use client.SeigeOpenEnvClient for persistent WebSocket episodes.

Precompute direction artifacts before real activation-space training:

python scripts/precompute_directions.py

HF/GPU Run

SEIGE_TARGET_BACKEND=hf \
SEIGE_TARGET_MODEL_ID=google/gemma-4-E2B \
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000

In a separate training job:

SEIGE_AGENT_MODEL_ID=unsloth/Qwen3-14B \
SEIGE_ENV_URL=http://localhost:8000 \
python train/grpo_red.py

SEIGE_AGENT_MODEL_ID=unsloth/Qwen3-14B \
SEIGE_ENV_URL=http://localhost:8000 \
python train/grpo_blue.py

Save adapters only. Do not merge 4-bit weights after GRPO.