Muon Optimizer Variants Γ Looped Transformers
This repository contains the first experimental implementations of Muon optimizer variants on looped transformer architectures. All combinations tested here were previously completely unexplored in the literature.
Background
The intersection of Muon optimizer variants and looped/recursive transformers represents a genuinely empty research space:
- No Muon variant has ever been tested on any looped/recursive transformer architecture
- All looped transformer papers (LoopFormer, Hyperloop, Mixture-of-Recursions, ELT, SpiralFormer) use AdamW/Adam
- All Muon variant papers (Newton-Muon, NorMuon, AdaMuon, Mano) test on standard dense transformers only
Experiments Conducted
1. Newton-Muon + LoopFormer
- Paper: Newton-Muon (arXiv:2604.01472)
- Architecture: LoopFormer (arXiv:2602.11451)
- Code: Custom implementation based on paper Algorithm 1
- Status: β
Working, needs LR tuning for looped setting
2. NorMuon + LoopFormer
- Paper: NorMuon (arXiv:2510.05491)
- Architecture: LoopFormer
- Code: zichongli5/NorMuon
- Status: β
Working
3. AdaMuon + LoopFormer
- Paper: AdaMuon (arXiv:2507.11005)
- Architecture: LoopFormer
- Code: Chongjie-Si/AdaMuon
- Status: β
Working
4. Mano + LoopFormer
- Paper: Mano (arXiv:2601.23000)
- Architecture: LoopFormer
- Code: xie-lab-ml/Mano
- Status: β
Working, fastest variant
Results Summary
Small Model (6.8M params, 1 layer Γ 2 loops, 50 steps)
| Optimizer |
Final Loss |
vs AdamW |
Time/Step |
vs AdamW |
| AdamW |
10.8124 |
baseline |
0.219s |
1.0x |
| Mano |
10.8393 |
+0.0269 |
0.304s |
1.4x |
| NorMuon |
10.8647 |
+0.0524 |
1.226s |
5.6x |
| AdaMuon |
10.9948 |
+0.1825 |
1.312s |
6.0x |
| Newton-Muon |
11.1217 |
+0.3094 |
1.128s |
5.2x |
Key Findings
- All variants successfully train on looped transformers - confirming implementability
- Mano is fastest (1.4x AdamW) due to cheaper manifold normalization vs Newton-Schulz iterations
- Newton-Muon needs tuning - the right-preconditioner refresh interval likely needs adjustment for looped gradients
- Limited steps - Muon typically shows advantages after longer training (Jordan et al. 2024)
Code Structure
experiments/
βββ train_loopformer_newton_muon.py # Newton-Muon + LoopFormer
βββ train_loopformer_normuon.py # NorMuon + LoopFormer
βββ train_loopformer_adamuon.py # AdaMuon + LoopFormer
βββ train_loopformer_mano.py # Mano + LoopFormer
βββ train_mor_newton_muon.py # Newton-Muon + Mixture-of-Recursions
βββ results.json # All experimental results
References
Future Work
- Scale to larger models (124M+) on real data (FineWeb-Edu)
- Tune Newton-Muon hyperparameters for looped setting
- Test on Mixture-of-Recursions with routing
- Compare with Hyperloop Transformers architecture
- Add Β΅P analysis for transfer scaling