galimova commited on
Commit
badfd28
Β·
verified Β·
1 Parent(s): a7d870c

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Muon Optimizer Variants Γ— Looped Transformers
2
+
3
+ This repository contains the **first experimental implementations** of Muon optimizer variants on looped transformer architectures. All combinations tested here were previously completely unexplored in the literature.
4
+
5
+ ## Background
6
+
7
+ The intersection of Muon optimizer variants and looped/recursive transformers represents a genuinely empty research space:
8
+ - **No Muon variant** has ever been tested on **any** looped/recursive transformer architecture
9
+ - All looped transformer papers (LoopFormer, Hyperloop, Mixture-of-Recursions, ELT, SpiralFormer) use AdamW/Adam
10
+ - All Muon variant papers (Newton-Muon, NorMuon, AdaMuon, Mano) test on standard dense transformers only
11
+
12
+ ## Experiments Conducted
13
+
14
+ ### 1. Newton-Muon + LoopFormer
15
+ - **Paper**: Newton-Muon (arXiv:2604.01472)
16
+ - **Architecture**: LoopFormer (arXiv:2602.11451)
17
+ - **Code**: Custom implementation based on paper Algorithm 1
18
+ - **Status**: βœ… Working, needs LR tuning for looped setting
19
+
20
+ ### 2. NorMuon + LoopFormer
21
+ - **Paper**: NorMuon (arXiv:2510.05491)
22
+ - **Architecture**: LoopFormer
23
+ - **Code**: [zichongli5/NorMuon](https://github.com/zichongli5/NorMuon)
24
+ - **Status**: βœ… Working
25
+
26
+ ### 3. AdaMuon + LoopFormer
27
+ - **Paper**: AdaMuon (arXiv:2507.11005)
28
+ - **Architecture**: LoopFormer
29
+ - **Code**: [Chongjie-Si/AdaMuon](https://github.com/Chongjie-Si/AdaMuon)
30
+ - **Status**: βœ… Working
31
+
32
+ ### 4. Mano + LoopFormer
33
+ - **Paper**: Mano (arXiv:2601.23000)
34
+ - **Architecture**: LoopFormer
35
+ - **Code**: [xie-lab-ml/Mano](https://github.com/xie-lab-ml/Mano-Restriking-Manifold-Optimization-for-LLM-Training)
36
+ - **Status**: βœ… Working, fastest variant
37
+
38
+ ## Results Summary
39
+
40
+ ### Small Model (6.8M params, 1 layer Γ— 2 loops, 50 steps)
41
+
42
+ | Optimizer | Final Loss | vs AdamW | Time/Step | vs AdamW |
43
+ |-----------|-----------|----------|-----------|----------|
44
+ | AdamW | 10.8124 | baseline | 0.219s | 1.0x |
45
+ | **Mano** | 10.8393 | +0.0269 | 0.304s | 1.4x |
46
+ | NorMuon | 10.8647 | +0.0524 | 1.226s | 5.6x |
47
+ | AdaMuon | 10.9948 | +0.1825 | 1.312s | 6.0x |
48
+ | Newton-Muon | 11.1217 | +0.3094 | 1.128s | 5.2x |
49
+
50
+ ### Key Findings
51
+
52
+ 1. **All variants successfully train** on looped transformers - confirming implementability
53
+ 2. **Mano is fastest** (1.4x AdamW) due to cheaper manifold normalization vs Newton-Schulz iterations
54
+ 3. **Newton-Muon needs tuning** - the right-preconditioner refresh interval likely needs adjustment for looped gradients
55
+ 4. **Limited steps** - Muon typically shows advantages after longer training (Jordan et al. 2024)
56
+
57
+ ## Code Structure
58
+
59
+ ```
60
+ experiments/
61
+ β”œβ”€β”€ train_loopformer_newton_muon.py # Newton-Muon + LoopFormer
62
+ β”œβ”€β”€ train_loopformer_normuon.py # NorMuon + LoopFormer
63
+ β”œβ”€β”€ train_loopformer_adamuon.py # AdaMuon + LoopFormer
64
+ β”œβ”€β”€ train_loopformer_mano.py # Mano + LoopFormer
65
+ β”œβ”€β”€ train_mor_newton_muon.py # Newton-Muon + Mixture-of-Recursions
66
+ └── results.json # All experimental results
67
+ ```
68
+
69
+ ## References
70
+
71
+ - Newton-Muon: [arXiv:2604.01472](https://arxiv.org/abs/2604.01472)
72
+ - NorMuon: [arXiv:2510.05491](https://arxiv.org/abs/2510.05491)
73
+ - AdaMuon: [arXiv:2507.11005](https://arxiv.org/abs/2507.11005)
74
+ - Mano: [arXiv:2601.23000](https://arxiv.org/abs/2601.23000)
75
+ - LoopFormer: [arXiv:2602.11451](https://arxiv.org/abs/2602.11451)
76
+ - Hyperloop: [arXiv:2604.21254](https://arxiv.org/abs/2604.21254)
77
+ - Mixture-of-Recursions: [arXiv:2507.10524](https://arxiv.org/abs/2507.10524)
78
+ - Base Muon: [KellerJordan/Muon](https://github.com/KellerJordan/Muon)
79
+
80
+ ## Future Work
81
+
82
+ - Scale to larger models (124M+) on real data (FineWeb-Edu)
83
+ - Tune Newton-Muon hyperparameters for looped setting
84
+ - Test on Mixture-of-Recursions with routing
85
+ - Compare with Hyperloop Transformers architecture
86
+ - Add Β΅P analysis for transfer scaling