How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Abstract
The study reveals that policy routing in alignment-trained language models involves attention gates and amplifier heads that control safety responses, with the routing mechanism being early-committing and transferable across model scales.
This paper localizes the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, but interchange testing (p<0.001) and knockout cascade confirm it is causally necessary. Interchange screening at n>=120 detects the same motif in twelve models from six labs (2B to 72B), though specific heads differ by lab. Per-head ablation weakens up to 58x at 72B and misses gates that interchange identifies; interchange is the only reliable audit at scale. Modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering. On safety prompts the same intervention turns refusal into harmful guidance, showing the safety-trained capability is gated by routing rather than removed. Thresholds vary by topic and by input language, and the circuit relocates across generations within a family while behavioral benchmarks register no change. Routing is early-commitment: the gate commits at its own layer before deeper layers finish processing the input. Under an in-context substitution cipher, gate interchange necessity collapses 70 to 99% across three models and the model switches to puzzle-solving. Injecting the plaintext gate activation into the cipher forward pass restores 48% of refusals in Phi-4-mini, localizing the bypass to the routing interface. A second method, cipher contrast analysis, uses plain/cipher DLA differences to map the full cipher-sensitive routing circuit in O(3n) forward passes. Any encoding that defeats detection-layer pattern matching bypasses the policy regardless of whether deeper layers reconstruct the content.
Community
This paper identifies the attention-level circuit responsible for refusal in aligned-trained language models.
A 'gate' attention head at intermediate depth reads detected content and triggers downstream 'amplifier' heads that boost the output toward refusal. The gate contributes under 1% of output signal, but is causally necessary (p < 0.001); knocking it out suppresses amplifiers by 5–26%.
The same motif appears across 12 models from 6 labs (2B–72B), though specific heads differ. Per-head ablation weakens up to 58x at scale, while interchange testing remains informative, making it the only reliable circuit-level audit method at 70B+.
Under a substitution cipher, gate necessity collapses 70–99% and the model switches to puzzle-solving initially. The routing decision commits before deeper layers finish processing. Cipher contrast analysis, a new O(3n) method comparing per-head DLA under plaintext and cipher, identifies circuit members that interchange misses.
Code, reproducibility guide, and full results for all 12 models are available.
Get this paper in your agent:
hf papers read 2604.04385 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper