Upload README.md

badfd28 verified 3 days ago

3.81 kB

	# Muon Optimizer Variants × Looped Transformers

	This repository contains the first experimental implementations of Muon optimizer variants on looped transformer architectures. All combinations tested here were previously completely unexplored in the literature.

	## Background

	The intersection of Muon optimizer variants and looped/recursive transformers represents a genuinely empty research space:
	- No Muon variant has ever been tested on any looped/recursive transformer architecture
	- All looped transformer papers (LoopFormer, Hyperloop, Mixture-of-Recursions, ELT, SpiralFormer) use AdamW/Adam
	- All Muon variant papers (Newton-Muon, NorMuon, AdaMuon, Mano) test on standard dense transformers only

	## Experiments Conducted

	### 1. Newton-Muon + LoopFormer
	- Paper: Newton-Muon (arXiv:2604.01472)
	- Architecture: LoopFormer (arXiv:2602.11451)
	- Code: Custom implementation based on paper Algorithm 1
	- Status: ✅ Working, needs LR tuning for looped setting

	### 2. NorMuon + LoopFormer
	- Paper: NorMuon (arXiv:2510.05491)
	- Architecture: LoopFormer
	- Code: [zichongli5/NorMuon](https://github.com/zichongli5/NorMuon)
	- Status: ✅ Working

	### 3. AdaMuon + LoopFormer
	- Paper: AdaMuon (arXiv:2507.11005)
	- Architecture: LoopFormer
	- Code: [Chongjie-Si/AdaMuon](https://github.com/Chongjie-Si/AdaMuon)
	- Status: ✅ Working

	### 4. Mano + LoopFormer
	- Paper: Mano (arXiv:2601.23000)
	- Architecture: LoopFormer
	- Code: [xie-lab-ml/Mano](https://github.com/xie-lab-ml/Mano-Restriking-Manifold-Optimization-for-LLM-Training)
	- Status: ✅ Working, fastest variant

	## Results Summary

	### Small Model (6.8M params, 1 layer × 2 loops, 50 steps)

	\| Optimizer \| Final Loss \| vs AdamW \| Time/Step \| vs AdamW \|
	\|-----------\|-----------\|----------\|-----------\|----------\|
	\| AdamW \| 10.8124 \| baseline \| 0.219s \| 1.0x \|
	\| Mano \| 10.8393 \| +0.0269 \| 0.304s \| 1.4x \|
	\| NorMuon \| 10.8647 \| +0.0524 \| 1.226s \| 5.6x \|
	\| AdaMuon \| 10.9948 \| +0.1825 \| 1.312s \| 6.0x \|
	\| Newton-Muon \| 11.1217 \| +0.3094 \| 1.128s \| 5.2x \|

	### Key Findings

	1. All variants successfully train on looped transformers - confirming implementability
	2. Mano is fastest (1.4x AdamW) due to cheaper manifold normalization vs Newton-Schulz iterations
	3. Newton-Muon needs tuning - the right-preconditioner refresh interval likely needs adjustment for looped gradients
	4. Limited steps - Muon typically shows advantages after longer training (Jordan et al. 2024)

	## Code Structure

	```
	experiments/
	├── train_loopformer_newton_muon.py # Newton-Muon + LoopFormer
	├── train_loopformer_normuon.py # NorMuon + LoopFormer
	├── train_loopformer_adamuon.py # AdaMuon + LoopFormer
	├── train_loopformer_mano.py # Mano + LoopFormer
	├── train_mor_newton_muon.py # Newton-Muon + Mixture-of-Recursions
	└── results.json # All experimental results
	```

	## References

	- Newton-Muon: [arXiv:2604.01472](https://arxiv.org/abs/2604.01472)
	- NorMuon: [arXiv:2510.05491](https://arxiv.org/abs/2510.05491)
	- AdaMuon: [arXiv:2507.11005](https://arxiv.org/abs/2507.11005)
	- Mano: [arXiv:2601.23000](https://arxiv.org/abs/2601.23000)
	- LoopFormer: [arXiv:2602.11451](https://arxiv.org/abs/2602.11451)
	- Hyperloop: [arXiv:2604.21254](https://arxiv.org/abs/2604.21254)
	- Mixture-of-Recursions: [arXiv:2507.10524](https://arxiv.org/abs/2507.10524)
	- Base Muon: [KellerJordan/Muon](https://github.com/KellerJordan/Muon)

	## Future Work

	- Scale to larger models (124M+) on real data (FineWeb-Edu)
	- Tune Newton-Muon hyperparameters for looped setting
	- Test on Mixture-of-Recursions with routing
	- Compare with Hyperloop Transformers architecture
	- Add µP analysis for transfer scaling