galimova
/

muon-looped-experiments

Model card Files Files and versions

xet

Community

galimova commited on 3 days ago

Commit

badfd28

verified ·

1 Parent(s): a7d870c

Upload README.md

Browse files

Files changed (1) hide show

README.md +86 -0

README.md ADDED Viewed

	@@ -0,0 +1,86 @@

+# Muon Optimizer Variants × Looped Transformers
+This repository contains the **first experimental implementations** of Muon optimizer variants on looped transformer architectures. All combinations tested here were previously completely unexplored in the literature.
+## Background
+The intersection of Muon optimizer variants and looped/recursive transformers represents a genuinely empty research space:
+- **No Muon variant** has ever been tested on **any** looped/recursive transformer architecture
+- All looped transformer papers (LoopFormer, Hyperloop, Mixture-of-Recursions, ELT, SpiralFormer) use AdamW/Adam
+- All Muon variant papers (Newton-Muon, NorMuon, AdaMuon, Mano) test on standard dense transformers only
+## Experiments Conducted
+### 1. Newton-Muon + LoopFormer
+- **Paper**: Newton-Muon (arXiv:2604.01472)
+- **Architecture**: LoopFormer (arXiv:2602.11451)
+- **Code**: Custom implementation based on paper Algorithm 1
+- **Status**: ✅ Working, needs LR tuning for looped setting
+### 2. NorMuon + LoopFormer
+- **Paper**: NorMuon (arXiv:2510.05491)
+- **Architecture**: LoopFormer
+- **Code**: [zichongli5/NorMuon](https://github.com/zichongli5/NorMuon)
+- **Status**: ✅ Working
+### 3. AdaMuon + LoopFormer
+- **Paper**: AdaMuon (arXiv:2507.11005)
+- **Architecture**: LoopFormer
+- **Code**: [Chongjie-Si/AdaMuon](https://github.com/Chongjie-Si/AdaMuon)
+- **Status**: ✅ Working
+### 4. Mano + LoopFormer
+- **Paper**: Mano (arXiv:2601.23000)
+- **Architecture**: LoopFormer
+- **Code**: [xie-lab-ml/Mano](https://github.com/xie-lab-ml/Mano-Restriking-Manifold-Optimization-for-LLM-Training)
+- **Status**: ✅ Working, fastest variant
+## Results Summary
+### Small Model (6.8M params, 1 layer × 2 loops, 50 steps)
+| Optimizer | Final Loss | vs AdamW | Time/Step | vs AdamW |
+|-----------|-----------|----------|-----------|----------|
+| AdamW | 10.8124 | baseline | 0.219s | 1.0x |
+| **Mano** | 10.8393 | +0.0269 | 0.304s | 1.4x |
+| NorMuon | 10.8647 | +0.0524 | 1.226s | 5.6x |
+| AdaMuon | 10.9948 | +0.1825 | 1.312s | 6.0x |
+| Newton-Muon | 11.1217 | +0.3094 | 1.128s | 5.2x |
+### Key Findings
+1. **All variants successfully train** on looped transformers - confirming implementability
+2. **Mano is fastest** (1.4x AdamW) due to cheaper manifold normalization vs Newton-Schulz iterations
+3. **Newton-Muon needs tuning** - the right-preconditioner refresh interval likely needs adjustment for looped gradients
+4. **Limited steps** - Muon typically shows advantages after longer training (Jordan et al. 2024)
+## Code Structure
+```
+experiments/
+├── train_loopformer_newton_muon.py  # Newton-Muon + LoopFormer
+├── train_loopformer_normuon.py      # NorMuon + LoopFormer
+├── train_loopformer_adamuon.py      # AdaMuon + LoopFormer
+├── train_loopformer_mano.py         # Mano + LoopFormer
+├── train_mor_newton_muon.py         # Newton-Muon + Mixture-of-Recursions
+└── results.json                     # All experimental results
+```
+## References
+- Newton-Muon: [arXiv:2604.01472](https://arxiv.org/abs/2604.01472)
+- NorMuon: [arXiv:2510.05491](https://arxiv.org/abs/2510.05491)
+- AdaMuon: [arXiv:2507.11005](https://arxiv.org/abs/2507.11005)
+- Mano: [arXiv:2601.23000](https://arxiv.org/abs/2601.23000)
+- LoopFormer: [arXiv:2602.11451](https://arxiv.org/abs/2602.11451)
+- Hyperloop: [arXiv:2604.21254](https://arxiv.org/abs/2604.21254)
+- Mixture-of-Recursions: [arXiv:2507.10524](https://arxiv.org/abs/2507.10524)
+- Base Muon: [KellerJordan/Muon](https://github.com/KellerJordan/Muon)
+## Future Work
+- Scale to larger models (124M+) on real data (FineWeb-Edu)
+- Tune Newton-Muon hyperparameters for looped setting
+- Test on Mixture-of-Recursions with routing
+- Compare with Hyperloop Transformers architecture
+- Add µP analysis for transfer scaling