Muon Optimizer Variants × Looped Transformers

This repository contains the first experimental implementations of Muon optimizer variants on looped transformer architectures. All combinations tested here were previously completely unexplored in the literature.

Background

The intersection of Muon optimizer variants and looped/recursive transformers represents a genuinely empty research space:

No Muon variant has ever been tested on any looped/recursive transformer architecture
All looped transformer papers (LoopFormer, Hyperloop, Mixture-of-Recursions, ELT, SpiralFormer) use AdamW/Adam
All Muon variant papers (Newton-Muon, NorMuon, AdaMuon, Mano) test on standard dense transformers only

Experiments Conducted

1. Newton-Muon + LoopFormer

Paper: Newton-Muon (arXiv:2604.01472)
Architecture: LoopFormer (arXiv:2602.11451)
Code: Custom implementation based on paper Algorithm 1
Status: ✅ Working, needs LR tuning for looped setting

2. NorMuon + LoopFormer

Paper: NorMuon (arXiv:2510.05491)
Architecture: LoopFormer
Code: zichongli5/NorMuon
Status: ✅ Working

3. AdaMuon + LoopFormer

Paper: AdaMuon (arXiv:2507.11005)
Architecture: LoopFormer
Code: Chongjie-Si/AdaMuon
Status: ✅ Working

4. Mano + LoopFormer

Paper: Mano (arXiv:2601.23000)
Architecture: LoopFormer
Code: xie-lab-ml/Mano
Status: ✅ Working, fastest variant

Results Summary

Small Model (6.8M params, 1 layer × 2 loops, 50 steps)

Optimizer	Final Loss	vs AdamW	Time/Step	vs AdamW
AdamW	10.8124	baseline	0.219s	1.0x
Mano	10.8393	+0.0269	0.304s	1.4x
NorMuon	10.8647	+0.0524	1.226s	5.6x
AdaMuon	10.9948	+0.1825	1.312s	6.0x
Newton-Muon	11.1217	+0.3094	1.128s	5.2x

Key Findings

All variants successfully train on looped transformers - confirming implementability
Mano is fastest (1.4x AdamW) due to cheaper manifold normalization vs Newton-Schulz iterations
Newton-Muon needs tuning - the right-preconditioner refresh interval likely needs adjustment for looped gradients
Limited steps - Muon typically shows advantages after longer training (Jordan et al. 2024)

Code Structure

experiments/
├── train_loopformer_newton_muon.py  # Newton-Muon + LoopFormer
├── train_loopformer_normuon.py      # NorMuon + LoopFormer
├── train_loopformer_adamuon.py      # AdaMuon + LoopFormer
├── train_loopformer_mano.py         # Mano + LoopFormer
├── train_mor_newton_muon.py         # Newton-Muon + Mixture-of-Recursions
└── results.json                     # All experimental results