-
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 10 -
Training language models to follow instructions with human feedback
Paper • 2203.02155 • Published • 24 -
GPT-4 Technical Report
Paper • 2303.08774 • Published • 7
Collections
Discover the best community collections!
Collections including paper arxiv:2502.16982
-
CAME: Confidence-guided Adaptive Memory Efficient Optimization
Paper • 2307.02047 • Published • 2 -
Practical Efficiency of Muon for Pretraining
Paper • 2505.02222 • Published • 41 -
AdaMuon: Adaptive Muon Optimizer
Paper • 2507.11005 • Published • 2 -
Muon is Scalable for LLM Training
Paper • 2502.16982 • Published • 12
-
Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
Paper • 2501.18512 • Published • 29 -
DiLoCo: Distributed Low-Communication Training of Language Models
Paper • 2311.08105 • Published • 16 -
Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
Paper • 2503.09799 • Published • 15 -
Muon is Scalable for LLM Training
Paper • 2502.16982 • Published • 12
-
Textbooks Are All You Need
Paper • 2306.11644 • Published • 154 -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper • 2501.12948 • Published • 447 -
Muon is Scalable for LLM Training
Paper • 2502.16982 • Published • 12 -
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Paper • 2406.17557 • Published • 102
-
Language Models are Few-Shot Learners
Paper • 2005.14165 • Published • 20 -
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 10 -
Training language models to follow instructions with human feedback
Paper • 2203.02155 • Published • 24 -
GPT-4 Technical Report
Paper • 2303.08774 • Published • 7
-
Textbooks Are All You Need
Paper • 2306.11644 • Published • 154 -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper • 2501.12948 • Published • 447 -
Muon is Scalable for LLM Training
Paper • 2502.16982 • Published • 12 -
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Paper • 2406.17557 • Published • 102
-
CAME: Confidence-guided Adaptive Memory Efficient Optimization
Paper • 2307.02047 • Published • 2 -
Practical Efficiency of Muon for Pretraining
Paper • 2505.02222 • Published • 41 -
AdaMuon: Adaptive Muon Optimizer
Paper • 2507.11005 • Published • 2 -
Muon is Scalable for LLM Training
Paper • 2502.16982 • Published • 12
-
Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch
Paper • 2501.18512 • Published • 29 -
DiLoCo: Distributed Low-Communication Training of Language Models
Paper • 2311.08105 • Published • 16 -
Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo
Paper • 2503.09799 • Published • 15 -
Muon is Scalable for LLM Training
Paper • 2502.16982 • Published • 12