YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Muon Optimizer Variants Γ— Looped Transformers

This repository contains the first experimental implementations of Muon optimizer variants on looped transformer architectures. All combinations tested here were previously completely unexplored in the literature.

Background

The intersection of Muon optimizer variants and looped/recursive transformers represents a genuinely empty research space:

  • No Muon variant has ever been tested on any looped/recursive transformer architecture
  • All looped transformer papers (LoopFormer, Hyperloop, Mixture-of-Recursions, ELT, SpiralFormer) use AdamW/Adam
  • All Muon variant papers (Newton-Muon, NorMuon, AdaMuon, Mano) test on standard dense transformers only

Experiments Conducted

1. Newton-Muon + LoopFormer

  • Paper: Newton-Muon (arXiv:2604.01472)
  • Architecture: LoopFormer (arXiv:2602.11451)
  • Code: Custom implementation based on paper Algorithm 1
  • Status: βœ… Working, needs LR tuning for looped setting

2. NorMuon + LoopFormer

  • Paper: NorMuon (arXiv:2510.05491)
  • Architecture: LoopFormer
  • Code: zichongli5/NorMuon
  • Status: βœ… Working

3. AdaMuon + LoopFormer

  • Paper: AdaMuon (arXiv:2507.11005)
  • Architecture: LoopFormer
  • Code: Chongjie-Si/AdaMuon
  • Status: βœ… Working

4. Mano + LoopFormer

  • Paper: Mano (arXiv:2601.23000)
  • Architecture: LoopFormer
  • Code: xie-lab-ml/Mano
  • Status: βœ… Working, fastest variant

Results Summary

Small Model (6.8M params, 1 layer Γ— 2 loops, 50 steps)

Optimizer Final Loss vs AdamW Time/Step vs AdamW
AdamW 10.8124 baseline 0.219s 1.0x
Mano 10.8393 +0.0269 0.304s 1.4x
NorMuon 10.8647 +0.0524 1.226s 5.6x
AdaMuon 10.9948 +0.1825 1.312s 6.0x
Newton-Muon 11.1217 +0.3094 1.128s 5.2x

Key Findings

  1. All variants successfully train on looped transformers - confirming implementability
  2. Mano is fastest (1.4x AdamW) due to cheaper manifold normalization vs Newton-Schulz iterations
  3. Newton-Muon needs tuning - the right-preconditioner refresh interval likely needs adjustment for looped gradients
  4. Limited steps - Muon typically shows advantages after longer training (Jordan et al. 2024)

Code Structure

experiments/
β”œβ”€β”€ train_loopformer_newton_muon.py  # Newton-Muon + LoopFormer
β”œβ”€β”€ train_loopformer_normuon.py      # NorMuon + LoopFormer
β”œβ”€β”€ train_loopformer_adamuon.py      # AdaMuon + LoopFormer
β”œβ”€β”€ train_loopformer_mano.py         # Mano + LoopFormer
β”œβ”€β”€ train_mor_newton_muon.py         # Newton-Muon + Mixture-of-Recursions
└── results.json                     # All experimental results

References

Future Work

  • Scale to larger models (124M+) on real data (FineWeb-Edu)
  • Tune Newton-Muon hyperparameters for looped setting
  • Test on Mixture-of-Recursions with routing
  • Compare with Hyperloop Transformers architecture
  • Add Β΅P analysis for transfer scaling
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for galimova/muon-looped-experiments