| # Muon Optimizer Variants Γ Looped Transformers |
|
|
| This repository contains the **first experimental implementations** of Muon optimizer variants on looped transformer architectures. All combinations tested here were previously completely unexplored in the literature. |
|
|
| ## Background |
|
|
| The intersection of Muon optimizer variants and looped/recursive transformers represents a genuinely empty research space: |
| - **No Muon variant** has ever been tested on **any** looped/recursive transformer architecture |
| - All looped transformer papers (LoopFormer, Hyperloop, Mixture-of-Recursions, ELT, SpiralFormer) use AdamW/Adam |
| - All Muon variant papers (Newton-Muon, NorMuon, AdaMuon, Mano) test on standard dense transformers only |
|
|
| ## Experiments Conducted |
|
|
| ### 1. Newton-Muon + LoopFormer |
| - **Paper**: Newton-Muon (arXiv:2604.01472) |
| - **Architecture**: LoopFormer (arXiv:2602.11451) |
| - **Code**: Custom implementation based on paper Algorithm 1 |
| - **Status**: β
Working, needs LR tuning for looped setting |
|
|
| ### 2. NorMuon + LoopFormer |
| - **Paper**: NorMuon (arXiv:2510.05491) |
| - **Architecture**: LoopFormer |
| - **Code**: [zichongli5/NorMuon](https://github.com/zichongli5/NorMuon) |
| - **Status**: β
Working |
|
|
| ### 3. AdaMuon + LoopFormer |
| - **Paper**: AdaMuon (arXiv:2507.11005) |
| - **Architecture**: LoopFormer |
| - **Code**: [Chongjie-Si/AdaMuon](https://github.com/Chongjie-Si/AdaMuon) |
| - **Status**: β
Working |
|
|
| ### 4. Mano + LoopFormer |
| - **Paper**: Mano (arXiv:2601.23000) |
| - **Architecture**: LoopFormer |
| - **Code**: [xie-lab-ml/Mano](https://github.com/xie-lab-ml/Mano-Restriking-Manifold-Optimization-for-LLM-Training) |
| - **Status**: β
Working, fastest variant |
|
|
| ## Results Summary |
|
|
| ### Small Model (6.8M params, 1 layer Γ 2 loops, 50 steps) |
|
|
| | Optimizer | Final Loss | vs AdamW | Time/Step | vs AdamW | |
| |-----------|-----------|----------|-----------|----------| |
| | AdamW | 10.8124 | baseline | 0.219s | 1.0x | |
| | **Mano** | 10.8393 | +0.0269 | 0.304s | 1.4x | |
| | NorMuon | 10.8647 | +0.0524 | 1.226s | 5.6x | |
| | AdaMuon | 10.9948 | +0.1825 | 1.312s | 6.0x | |
| | Newton-Muon | 11.1217 | +0.3094 | 1.128s | 5.2x | |
|
|
| ### Key Findings |
|
|
| 1. **All variants successfully train** on looped transformers - confirming implementability |
| 2. **Mano is fastest** (1.4x AdamW) due to cheaper manifold normalization vs Newton-Schulz iterations |
| 3. **Newton-Muon needs tuning** - the right-preconditioner refresh interval likely needs adjustment for looped gradients |
| 4. **Limited steps** - Muon typically shows advantages after longer training (Jordan et al. 2024) |
|
|
| ## Code Structure |
|
|
| ``` |
| experiments/ |
| βββ train_loopformer_newton_muon.py # Newton-Muon + LoopFormer |
| βββ train_loopformer_normuon.py # NorMuon + LoopFormer |
| βββ train_loopformer_adamuon.py # AdaMuon + LoopFormer |
| βββ train_loopformer_mano.py # Mano + LoopFormer |
| βββ train_mor_newton_muon.py # Newton-Muon + Mixture-of-Recursions |
| βββ results.json # All experimental results |
| ``` |
|
|
| ## References |
|
|
| - Newton-Muon: [arXiv:2604.01472](https://arxiv.org/abs/2604.01472) |
| - NorMuon: [arXiv:2510.05491](https://arxiv.org/abs/2510.05491) |
| - AdaMuon: [arXiv:2507.11005](https://arxiv.org/abs/2507.11005) |
| - Mano: [arXiv:2601.23000](https://arxiv.org/abs/2601.23000) |
| - LoopFormer: [arXiv:2602.11451](https://arxiv.org/abs/2602.11451) |
| - Hyperloop: [arXiv:2604.21254](https://arxiv.org/abs/2604.21254) |
| - Mixture-of-Recursions: [arXiv:2507.10524](https://arxiv.org/abs/2507.10524) |
| - Base Muon: [KellerJordan/Muon](https://github.com/KellerJordan/Muon) |
|
|
| ## Future Work |
|
|
| - Scale to larger models (124M+) on real data (FineWeb-Edu) |
| - Tune Newton-Muon hyperparameters for looped setting |
| - Test on Mixture-of-Recursions with routing |
| - Compare with Hyperloop Transformers architecture |
| - Add Β΅P analysis for transfer scaling |