Papers
updated
Beyond Language Models: Byte Models are Digital World Simulators
Paper
• 2402.19155
• Published • 53
Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
Paper
• 2402.19427
• Published • 57
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper
• 2403.00522
• Published • 46
Resonance RoPE: Improving Context Length Generalization of Large
Language Models
Paper
• 2403.00071
• Published • 24
Learning and Leveraging World Models in Visual Representation Learning
Paper
• 2403.00504
• Published • 33
AtP*: An efficient and scalable method for localizing LLM behaviour to
components
Paper
• 2403.00745
• Published • 14
Learning to Decode Collaboratively with Multiple Language Models
Paper
• 2403.03870
• Published • 21
ShortGPT: Layers in Large Language Models are More Redundant Than You
Expect
Paper
• 2403.03853
• Published • 66
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Paper
• 2403.03507
• Published • 190
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
• 2403.05135
• Published • 45
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
• 2403.05525
• Published • 49
Stealing Part of a Production Language Model
Paper
• 2403.06634
• Published • 91
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published • 77
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
Paper
• 2403.07816
• Published • 45
Synth^2: Boosting Visual-Language Models with Synthetic Captions and
Image Embeddings
Paper
• 2403.07750
• Published • 23
Chronos: Learning the Language of Time Series
Paper
• 2403.07815
• Published • 49
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
• 2403.09611
• Published • 129
Veagle: Advancements in Multimodal Representation Learning
Paper
• 2403.08773
• Published • 10
Paper
• 2309.16609
• Published • 38
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
• 2308.12966
• Published • 11
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
• 2403.10301
• Published • 54
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Paper
• 2403.13372
• Published • 183
The Unreasonable Ineffectiveness of the Deeper Layers
Paper
• 2403.17887
• Published • 82
InternLM2 Technical Report
Paper
• 2403.17297
• Published • 34
Jamba: A Hybrid Transformer-Mamba Language Model
Paper
• 2403.19887
• Published • 112
Transformer-Lite: High-efficiency Deployment of Large Language Models on
Mobile Phone GPUs
Paper
• 2403.20041
• Published • 34
Localizing Paragraph Memorization in Language Models
Paper
• 2403.19851
• Published • 15
DiJiang: Efficient Large Language Models through Compact Kernelization
Paper
• 2403.19928
• Published • 12
Long-form factuality in large language models
Paper
• 2403.18802
• Published • 26
Mixture-of-Depths: Dynamically allocating compute in transformer-based
language models
Paper
• 2404.02258
• Published • 107
Leave No Context Behind: Efficient Infinite Context Transformers with
Infini-attention
Paper
• 2404.07143
• Published • 111
Pre-training Small Base LMs with Fewer Tokens
Paper
• 2404.08634
• Published • 36
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
• 2404.08801
• Published • 66
SnapKV: LLM Knows What You are Looking for Before Generation
Paper
• 2404.14469
• Published • 27
FlowMind: Automatic Workflow Generation with LLMs
Paper
• 2404.13050
• Published • 34
Paper
• 2412.15115
• Published • 377
YuLan-Mini: An Open Data-efficient Language Model
Paper
• 2412.17743
• Published • 66
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper
• 2412.18925
• Published • 107
Token-Budget-Aware LLM Reasoning
Paper
• 2412.18547
• Published • 46
DeepSeek-V3 Technical Report
Paper
• 2412.19437
• Published • 82
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published • 302
Evolving Deeper LLM Thinking
Paper
• 2501.09891
• Published • 115
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper
• 2502.02737
• Published • 258
Demystifying Long Chain-of-Thought Reasoning in LLMs
Paper
• 2502.03373
• Published • 58
LIMO: Less is More for Reasoning
Paper
• 2502.03387
• Published • 62
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
• 2502.06703
• Published • 153
The Differences Between Direct Alignment Algorithms are a Blur
Paper
• 2502.01237
• Published • 113
s1: Simple test-time scaling
Paper
• 2501.19393
• Published • 125
Qwen2.5-1M Technical Report
Paper
• 2501.15383
• Published • 72
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
Paper
• 2501.12948
• Published • 447
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper
• 2501.12599
• Published • 128
Transformers without Normalization
Paper
• 2503.10622
• Published • 172
A Survey on Post-training of Large Language Models
Paper
• 2503.06072
• Published • 11
Stop Overthinking: A Survey on Efficient Reasoning for Large Language
Models
Paper
• 2503.16419
• Published • 77
Reinforcement Learning for Reasoning in Small LLMs: What Works and What
Doesn't
Paper
• 2503.16219
• Published • 52
Softpick: No Attention Sink, No Massive Activations with Rectified
Softmax
Paper
• 2504.20966
• Published • 31