-
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Paper • 2602.10693 • Published • 220 -
Reinforced Attention Learning
Paper • 2602.04884 • Published • 29 -
Learning to Reason in 13 Parameters
Paper • 2602.04118 • Published • 6 -
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Paper • 2405.17604 • Published • 3
Collections
Discover the best community collections!
Collections including paper arxiv:2412.06464
-
Provable Benefits of In-Tool Learning for Large Language Models
Paper • 2508.20755 • Published • 11 -
CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning
Paper • 2508.15868 • Published • 3 -
Gated Delta Networks: Improving Mamba2 with Delta Rule
Paper • 2412.06464 • Published • 17
-
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM
Paper • 2401.02994 • Published • 52 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 59 -
Repeat After Me: Transformers are Better than State Space Models at Copying
Paper • 2402.01032 • Published • 24 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25
-
Higher-order Linear Attention
Paper • 2510.27258 • Published • 15 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 154 -
xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference
Paper • 2503.13427 • Published • 3 -
MoM: Linear Sequence Modeling with Mixture-of-Memories
Paper • 2502.13685 • Published • 36
-
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks
Paper • 2402.04248 • Published • 32 -
Scavenging Hyena: Distilling Transformers into Long Convolution Models
Paper • 2401.17574 • Published • 17 -
Scalable Autoregressive Image Generation with Mamba
Paper • 2408.12245 • Published • 26 -
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Paper • 2408.12570 • Published • 32
-
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Paper • 2602.10693 • Published • 220 -
Reinforced Attention Learning
Paper • 2602.04884 • Published • 29 -
Learning to Reason in 13 Parameters
Paper • 2602.04118 • Published • 6 -
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Paper • 2405.17604 • Published • 3
-
Higher-order Linear Attention
Paper • 2510.27258 • Published • 15 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 154 -
xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference
Paper • 2503.13427 • Published • 3 -
MoM: Linear Sequence Modeling with Mixture-of-Memories
Paper • 2502.13685 • Published • 36
-
Provable Benefits of In-Tool Learning for Large Language Models
Paper • 2508.20755 • Published • 11 -
CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning
Paper • 2508.15868 • Published • 3 -
Gated Delta Networks: Improving Mamba2 with Delta Rule
Paper • 2412.06464 • Published • 17
-
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks
Paper • 2402.04248 • Published • 32 -
Scavenging Hyena: Distilling Transformers into Long Convolution Models
Paper • 2401.17574 • Published • 17 -
Scalable Autoregressive Image Generation with Mamba
Paper • 2408.12245 • Published • 26 -
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Paper • 2408.12570 • Published • 32
-
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM
Paper • 2401.02994 • Published • 52 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 59 -
Repeat After Me: Transformers are Better than State Space Models at Copying
Paper • 2402.01032 • Published • 24 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25