Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
Abstract
Researchers developed symmetry-compatible optimizers that respect the equivariance structures of neural network parameters, improving training stability and performance over traditional coordinate-wise methods like Adam.
A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exhibit rich symmetry and equivariance properties, popular optimizers such as Adam and its variants operate inherently coordinate-wise, rendering them unable to respect the equivariance structures of the parameter space. We address this disparity by introducing a symmetry-compatible principle for optimizer design: the gradient update rule should be equivariant under the symmetry group acting on the corresponding weight block. Following this principle, we first provide a unified perspective on bi-orthogonally equivariant updates for general matrix layers, as employed by stochastic spectral descent, Muon, Scion, and polar gradient methods. More importantly, by moving from orthogonal groups to permutation and shared-shift symmetries, we derive symmetry-compatible optimizers for parameter blocks whose symmetries differ from those of general matrix layers: embedding and LM head matrices, SwiGLU MLP projections, and MoE router matrices. These constructions include one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. They yield an end-to-end layerwise optimizer stack in which each major matrix-valued parameter class is assigned an update whose equivariance matches its symmetry group. We corroborate this principle through pre-training experiments on dense and sparse MoE language models, including Qwen3-0.6B-style, Gemma 3 1B-style, OLMoE-1B-7B-style, and downsized gpt-oss architectures. Across these experiments, symmetry-compatible updates consistently improve final validation loss, and in several cases training stability, over corresponding AdamW updates.
Community
Modern LLMs are full of symmetry: embedding rows are interchangeable, MoE experts are interchangeable, SwiGLU intermediate neurons are interchangeable. But Adam treats every weight as one entry in a flat vector and respects none of it.
This paper proposes a simple principle: an optimizer's update map should commute with the symmetry group acting on the parameter block it updates. This single requirement gives different optimizer classes for different layers. Full spectral updates (like Muon) for linear and attention matrices, row-norm or one-sided spectral updates for embeddings and LM heads, row- and column-aware updates for SwiGLU MLPs, and centered row-norm or left-spectral updates for MoE routers.
Across pre-training experiments on Qwen3-0.6B, Gemma 3 1B, OLMoE-1B-7B, and a downsized gpt-oss, replacing AdamW on vocabulary-indexed matrices with symmetry-compatible updates consistently improves validation loss, and the gains grow with vocabulary size. In MoE settings, symmetry-compatible router updates also reduce training-loss spikes.
Get this paper in your agent:
hf papers read 2605.18106 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper