Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Muon Optimizer Variants Γ Looped Transformers
|
| 2 |
+
|
| 3 |
+
This repository contains the **first experimental implementations** of Muon optimizer variants on looped transformer architectures. All combinations tested here were previously completely unexplored in the literature.
|
| 4 |
+
|
| 5 |
+
## Background
|
| 6 |
+
|
| 7 |
+
The intersection of Muon optimizer variants and looped/recursive transformers represents a genuinely empty research space:
|
| 8 |
+
- **No Muon variant** has ever been tested on **any** looped/recursive transformer architecture
|
| 9 |
+
- All looped transformer papers (LoopFormer, Hyperloop, Mixture-of-Recursions, ELT, SpiralFormer) use AdamW/Adam
|
| 10 |
+
- All Muon variant papers (Newton-Muon, NorMuon, AdaMuon, Mano) test on standard dense transformers only
|
| 11 |
+
|
| 12 |
+
## Experiments Conducted
|
| 13 |
+
|
| 14 |
+
### 1. Newton-Muon + LoopFormer
|
| 15 |
+
- **Paper**: Newton-Muon (arXiv:2604.01472)
|
| 16 |
+
- **Architecture**: LoopFormer (arXiv:2602.11451)
|
| 17 |
+
- **Code**: Custom implementation based on paper Algorithm 1
|
| 18 |
+
- **Status**: β
Working, needs LR tuning for looped setting
|
| 19 |
+
|
| 20 |
+
### 2. NorMuon + LoopFormer
|
| 21 |
+
- **Paper**: NorMuon (arXiv:2510.05491)
|
| 22 |
+
- **Architecture**: LoopFormer
|
| 23 |
+
- **Code**: [zichongli5/NorMuon](https://github.com/zichongli5/NorMuon)
|
| 24 |
+
- **Status**: β
Working
|
| 25 |
+
|
| 26 |
+
### 3. AdaMuon + LoopFormer
|
| 27 |
+
- **Paper**: AdaMuon (arXiv:2507.11005)
|
| 28 |
+
- **Architecture**: LoopFormer
|
| 29 |
+
- **Code**: [Chongjie-Si/AdaMuon](https://github.com/Chongjie-Si/AdaMuon)
|
| 30 |
+
- **Status**: β
Working
|
| 31 |
+
|
| 32 |
+
### 4. Mano + LoopFormer
|
| 33 |
+
- **Paper**: Mano (arXiv:2601.23000)
|
| 34 |
+
- **Architecture**: LoopFormer
|
| 35 |
+
- **Code**: [xie-lab-ml/Mano](https://github.com/xie-lab-ml/Mano-Restriking-Manifold-Optimization-for-LLM-Training)
|
| 36 |
+
- **Status**: β
Working, fastest variant
|
| 37 |
+
|
| 38 |
+
## Results Summary
|
| 39 |
+
|
| 40 |
+
### Small Model (6.8M params, 1 layer Γ 2 loops, 50 steps)
|
| 41 |
+
|
| 42 |
+
| Optimizer | Final Loss | vs AdamW | Time/Step | vs AdamW |
|
| 43 |
+
|-----------|-----------|----------|-----------|----------|
|
| 44 |
+
| AdamW | 10.8124 | baseline | 0.219s | 1.0x |
|
| 45 |
+
| **Mano** | 10.8393 | +0.0269 | 0.304s | 1.4x |
|
| 46 |
+
| NorMuon | 10.8647 | +0.0524 | 1.226s | 5.6x |
|
| 47 |
+
| AdaMuon | 10.9948 | +0.1825 | 1.312s | 6.0x |
|
| 48 |
+
| Newton-Muon | 11.1217 | +0.3094 | 1.128s | 5.2x |
|
| 49 |
+
|
| 50 |
+
### Key Findings
|
| 51 |
+
|
| 52 |
+
1. **All variants successfully train** on looped transformers - confirming implementability
|
| 53 |
+
2. **Mano is fastest** (1.4x AdamW) due to cheaper manifold normalization vs Newton-Schulz iterations
|
| 54 |
+
3. **Newton-Muon needs tuning** - the right-preconditioner refresh interval likely needs adjustment for looped gradients
|
| 55 |
+
4. **Limited steps** - Muon typically shows advantages after longer training (Jordan et al. 2024)
|
| 56 |
+
|
| 57 |
+
## Code Structure
|
| 58 |
+
|
| 59 |
+
```
|
| 60 |
+
experiments/
|
| 61 |
+
βββ train_loopformer_newton_muon.py # Newton-Muon + LoopFormer
|
| 62 |
+
βββ train_loopformer_normuon.py # NorMuon + LoopFormer
|
| 63 |
+
βββ train_loopformer_adamuon.py # AdaMuon + LoopFormer
|
| 64 |
+
βββ train_loopformer_mano.py # Mano + LoopFormer
|
| 65 |
+
βββ train_mor_newton_muon.py # Newton-Muon + Mixture-of-Recursions
|
| 66 |
+
βββ results.json # All experimental results
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## References
|
| 70 |
+
|
| 71 |
+
- Newton-Muon: [arXiv:2604.01472](https://arxiv.org/abs/2604.01472)
|
| 72 |
+
- NorMuon: [arXiv:2510.05491](https://arxiv.org/abs/2510.05491)
|
| 73 |
+
- AdaMuon: [arXiv:2507.11005](https://arxiv.org/abs/2507.11005)
|
| 74 |
+
- Mano: [arXiv:2601.23000](https://arxiv.org/abs/2601.23000)
|
| 75 |
+
- LoopFormer: [arXiv:2602.11451](https://arxiv.org/abs/2602.11451)
|
| 76 |
+
- Hyperloop: [arXiv:2604.21254](https://arxiv.org/abs/2604.21254)
|
| 77 |
+
- Mixture-of-Recursions: [arXiv:2507.10524](https://arxiv.org/abs/2507.10524)
|
| 78 |
+
- Base Muon: [KellerJordan/Muon](https://github.com/KellerJordan/Muon)
|
| 79 |
+
|
| 80 |
+
## Future Work
|
| 81 |
+
|
| 82 |
+
- Scale to larger models (124M+) on real data (FineWeb-Edu)
|
| 83 |
+
- Tune Newton-Muon hyperparameters for looped setting
|
| 84 |
+
- Test on Mixture-of-Recursions with routing
|
| 85 |
+
- Compare with Hyperloop Transformers architecture
|
| 86 |
+
- Add Β΅P analysis for transfer scaling
|