arxiv:2505.02222

Practical Efficiency of Muon for Pretraining

Published on May 4, 2025

· Submitted by

AK on May 6, 2025

Upvote

Authors:

Essential AI ,

Ishaan Shah ,

Anthony M. Polloreno ,

Karl Stratos ,

Philip Monk ,

Adarsh Chaluvaraju ,

Andrew Hojel ,

Andrew Ma ,

Ashish Tanwer ,

Khoi Nguyen ,

Michael Callahan ,

Michael Pust ,

Mohit Parmar ,

Peter Rushton ,

Platon Mazarakis ,

Ritvik Kapila ,

Somanshu Singla ,

Tim Romanski

Abstract

Muon, a second-order optimizer, improves data efficiency and computational savings over AdamW, especially at large batch sizes, and combined with muP, it provides efficient hyperparameter transfer and minimal resource overhead.

AI-generated summary

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.