Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm Paper • 2602.11543 • Published Feb 12 • 6 • 4