Spaces:

BART-ender
/

seige

Sleeping

App Files Files Community

seige / README.md

BART-ender

Upload folder using huggingface_hub

3aeaf3d verified 28 days ago

preview code

raw

history blame contribute delete

2.76 kB

	---
	title: Seige
	emoji: 🛡️
	colorFrom: blue
	colorTo: indigo
	sdk: docker
	app_port: 8000
	base_path: /web
	---
	# seige: Adversarial Oversight via Mechanistic Interpretability

	> Red attacks at the activation level. Blue defends by reading the machine's mind.

	seige is a multi-agent RL environment where attackers and defenders compete
	over a frozen target LLM. Unlike text-only red-teaming, Red agents inject
	steering vectors and activation patches directly into the model's
	residual stream — attacks invisible to naive output classifiers. Blue agents
	learn to detect these attacks via mechanistic probing and explain them
	in structured natural language.

	Why it matters for AI safety: Current LLM deployment relies on output-level
	filters. This environment trains oversight agents that watch AI at the
	mechanistic level — the first reproducible RL testbed for this problem.

	## 🔗 Links

	\| Resource \| URL \|
	\|---\|---\|
	\| HuggingFace Space (live env) \| https://huggingface.co/spaces/YOUR_USERNAME/seige \|
	\| Mini-blog \| https://huggingface.co/blog/YOUR_USERNAME/seige \|
	\| Demo video (<2 min) \| https://youtube.com/YOUR_VIDEO \|
	\| Training Colab \| https://colab.research.google.com/YOUR_NOTEBOOK \|
	\| Wandb training run \| https://wandb.ai/YOUR_RUN \|

	## 📊 Training Results

	![Reward Curves](assets/reward_curves.png)
	![Before/After](assets/before_after.png)
	## Models

	- Target model: `google/gemma-4-E2B`
	- Red/Blue agent model: `unsloth/Qwen3-14B`

	The target model is a prop loaded by the environment server. Red and Blue agents are text-in/text-out policies trained separately with GRPO.

	## Local Smoke Run

	```bash
	python -m pip install -e ".[test]"
	python -m pytest
	SEIGE_TARGET_BACKEND=mock python -m uvicorn server.app:app --host 127.0.0.1 --port 8000
	python scripts/smoke_server.py
	```

	## OpenEnv Run

	```bash
	python -m pip install -e ".[test]"
	openenv validate
	python -m uvicorn server.app:app --host 127.0.0.1 --port 8000
	openenv validate http://127.0.0.1:8000
	```

	The OpenEnv server exposes `/reset`, `/step`, `/state`, `/schema`, `/metadata`,
	`/mcp`, and `/ws`. Use `client.SeigeOpenEnvClient` for persistent WebSocket
	episodes.

	Precompute direction artifacts before real activation-space training:

	```bash
	python scripts/precompute_directions.py
	```

	## HF/GPU Run

	```bash
	SEIGE_TARGET_BACKEND=hf \
	SEIGE_TARGET_MODEL_ID=google/gemma-4-E2B \
	python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
	```

	In a separate training job:

	```bash
	SEIGE_AGENT_MODEL_ID=unsloth/Qwen3-14B \
	SEIGE_ENV_URL=http://localhost:8000 \
	python train/grpo_red.py

	SEIGE_AGENT_MODEL_ID=unsloth/Qwen3-14B \
	SEIGE_ENV_URL=http://localhost:8000 \
	python train/grpo_blue.py
	```

	Save adapters only. Do not merge 4-bit weights after GRPO.