NN Arch Components
updated
A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
Paper
• 2601.22966
• Published
STEM: Scaling Transformers with Embedding Modules
Paper
• 2601.10639
• Published • 2
Paper
• 2601.00417
• Published • 34
mHC: Manifold-Constrained Hyper-Connections
Paper
• 2512.24880
• Published • 322
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse
Paper
• 2512.14531
• Published • 15
Stronger Normalization-Free Transformers
Paper
• 2512.10938
• Published • 22
Gated Attention for Large Language Models: Non-linearity, Sparsity, and
Attention-Sink-Free
Paper
• 2505.06708
• Published • 11
Transformers without Normalization
Paper
• 2503.10622
• Published • 172
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper
• 2503.02130
• Published • 32
Paper
• 2409.19606
• Published • 26
Paper
• 2511.11238
• Published • 39