arxiv:2604.04385

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Published on Apr 13

· Submitted by

Authors:

Abstract

The study reveals that policy routing in alignment-trained language models involves attention gates and amplifier heads that control safety responses, with the routing mechanism being early-committing and transferable across model scales.

AI-generated summary

This paper localizes the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, but interchange testing (p<0.001) and knockout cascade confirm it is causally necessary. Interchange screening at n>=120 detects the same motif in twelve models from six labs (2B to 72B), though specific heads differ by lab. Per-head ablation weakens up to 58x at 72B and misses gates that interchange identifies; interchange is the only reliable audit at scale. Modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering. On safety prompts the same intervention turns refusal into harmful guidance, showing the safety-trained capability is gated by routing rather than removed. Thresholds vary by topic and by input language, and the circuit relocates across generations within a family while behavioral benchmarks register no change. Routing is early-commitment: the gate commits at its own layer before deeper layers finish processing the input. Under an in-context substitution cipher, gate interchange necessity collapses 70 to 99% across three models and the model switches to puzzle-solving. Injecting the plaintext gate activation into the cipher forward pass restores 48% of refusals in Phi-4-mini, localizing the bypass to the routing interface. A second method, cipher contrast analysis, uses plain/cipher DLA differences to map the full cipher-sensitive routing circuit in O(3n) forward passes. Any encoding that defeats detection-layer pattern matching bypasses the policy regardless of whether deeper layers reconstruct the content.

View arXiv page View PDF GitHub 2 Add to collection

Community

gregfrank

Paper submitter about 7 hours ago

This paper identifies the attention-level circuit responsible for refusal in aligned-trained language models.

A 'gate' attention head at intermediate depth reads detected content and triggers downstream 'amplifier' heads that boost the output toward refusal. The gate contributes under 1% of output signal, but is causally necessary (p < 0.001); knocking it out suppresses amplifiers by 5–26%.

The same motif appears across 12 models from 6 labs (2B–72B), though specific heads differ. Per-head ablation weakens up to 58x at scale, while interchange testing remains informative, making it the only reliable circuit-level audit method at 70B+.

Under a substitution cipher, gate necessity collapses 70–99% and the model switches to puzzle-solving initially. The routing decision commits before deeper layers finish processing. Cipher contrast analysis, a new O(3n) method comparing per-head DLA under plaintext and cipher, identifies circuit members that interchange misses.

Code, reproducibility guide, and full results for all 12 models are available.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.04385

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.04385 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.04385 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.04385 in a Space README.md to link it from this page.