-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
Paper • 2210.04186 • Published
Collections
Discover the best community collections!
Collections including paper arxiv:2001.08361
-
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Scaling Laws for Autoregressive Generative Modeling
Paper • 2010.14701 • Published • 1 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
A Survey on Data Selection for Language Models
Paper • 2402.16827 • Published • 4
-
STaR: Bootstrapping Reasoning With Reasoning
Paper • 2203.14465 • Published • 9 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 108 -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper • 2501.12948 • Published • 447
-
Recurrent Neural Network Regularization
Paper • 1409.2329 • Published • 1 -
Pointer Networks
Paper • 1506.03134 • Published • 1 -
Order Matters: Sequence to sequence for sets
Paper • 1511.06391 • Published • 1 -
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Paper • 1811.06965 • Published • 1
-
Neural Machine Translation by Jointly Learning to Align and Translate
Paper • 1409.0473 • Published • 7 -
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26 -
Hierarchical Reasoning Model
Paper • 2506.21734 • Published • 50
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Paper • 2206.10789 • Published • 4 -
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Paper • 2401.00448 • Published • 30 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10
-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
RoFormer: Enhanced Transformer with Rotary Position Embedding
Paper • 2104.09864 • Published • 17 -
LoRA Learns Less and Forgets Less
Paper • 2405.09673 • Published • 91
-
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
An Empirical Model of Large-Batch Training
Paper • 1812.06162 • Published • 3 -
Measuring the Effects of Data Parallelism on Neural Network Training
Paper • 1811.03600 • Published • 2 -
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Paper • 1804.04235 • Published • 2
-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Analogy Generation by Prompting Large Language Models: A Case Study of InstructGPT
Paper • 2210.04186 • Published
-
Neural Machine Translation by Jointly Learning to Align and Translate
Paper • 1409.0473 • Published • 7 -
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper • 1810.04805 • Published • 26 -
Hierarchical Reasoning Model
Paper • 2506.21734 • Published • 50
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Paper • 2206.10789 • Published • 4 -
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Paper • 2401.00448 • Published • 30 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10
-
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Scaling Laws for Autoregressive Generative Modeling
Paper • 2010.14701 • Published • 1 -
Training Compute-Optimal Large Language Models
Paper • 2203.15556 • Published • 11 -
A Survey on Data Selection for Language Models
Paper • 2402.16827 • Published • 4
-
STaR: Bootstrapping Reasoning With Reasoning
Paper • 2203.14465 • Published • 9 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 108 -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper • 2501.12948 • Published • 447
-
Attention Is All You Need
Paper • 1706.03762 • Published • 120 -
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
RoFormer: Enhanced Transformer with Rotary Position Embedding
Paper • 2104.09864 • Published • 17 -
LoRA Learns Less and Forgets Less
Paper • 2405.09673 • Published • 91
-
Recurrent Neural Network Regularization
Paper • 1409.2329 • Published • 1 -
Pointer Networks
Paper • 1506.03134 • Published • 1 -
Order Matters: Sequence to sequence for sets
Paper • 1511.06391 • Published • 1 -
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Paper • 1811.06965 • Published • 1
-
Scaling Laws for Neural Language Models
Paper • 2001.08361 • Published • 10 -
An Empirical Model of Large-Batch Training
Paper • 1812.06162 • Published • 3 -
Measuring the Effects of Data Parallelism on Neural Network Training
Paper • 1811.03600 • Published • 2 -
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Paper • 1804.04235 • Published • 2