seige / README.md
BART-ender's picture
Upload folder using huggingface_hub
3aeaf3d verified
metadata
title: Seige
emoji: πŸ›‘οΈ
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 8000
base_path: /web

seige: Adversarial Oversight via Mechanistic Interpretability

Red attacks at the activation level. Blue defends by reading the machine's mind.

seige is a multi-agent RL environment where attackers and defenders compete over a frozen target LLM. Unlike text-only red-teaming, Red agents inject steering vectors and activation patches directly into the model's residual stream β€” attacks invisible to naive output classifiers. Blue agents learn to detect these attacks via mechanistic probing and explain them in structured natural language.

Why it matters for AI safety: Current LLM deployment relies on output-level filters. This environment trains oversight agents that watch AI at the mechanistic level β€” the first reproducible RL testbed for this problem.

πŸ”— Links

πŸ“Š Training Results

Reward Curves Before/After

Models

  • Target model: google/gemma-4-E2B
  • Red/Blue agent model: unsloth/Qwen3-14B

The target model is a prop loaded by the environment server. Red and Blue agents are text-in/text-out policies trained separately with GRPO.

Local Smoke Run

python -m pip install -e ".[test]"
python -m pytest
SEIGE_TARGET_BACKEND=mock python -m uvicorn server.app:app --host 127.0.0.1 --port 8000
python scripts/smoke_server.py

OpenEnv Run

python -m pip install -e ".[test]"
openenv validate
python -m uvicorn server.app:app --host 127.0.0.1 --port 8000
openenv validate http://127.0.0.1:8000

The OpenEnv server exposes /reset, /step, /state, /schema, /metadata, /mcp, and /ws. Use client.SeigeOpenEnvClient for persistent WebSocket episodes.

Precompute direction artifacts before real activation-space training:

python scripts/precompute_directions.py

HF/GPU Run

SEIGE_TARGET_BACKEND=hf \
SEIGE_TARGET_MODEL_ID=google/gemma-4-E2B \
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000

In a separate training job:

SEIGE_AGENT_MODEL_ID=unsloth/Qwen3-14B \
SEIGE_ENV_URL=http://localhost:8000 \
python train/grpo_red.py

SEIGE_AGENT_MODEL_ID=unsloth/Qwen3-14B \
SEIGE_ENV_URL=http://localhost:8000 \
python train/grpo_blue.py

Save adapters only. Do not merge 4-bit weights after GRPO.