-
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Paper • 2602.10693 • Published • 220 -
Reinforced Attention Learning
Paper • 2602.04884 • Published • 29 -
Learning to Reason in 13 Parameters
Paper • 2602.04118 • Published • 6 -
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Paper • 2405.17604 • Published • 3
Collections
Discover the best community collections!
Collections including paper arxiv:2503.10622
-
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Paper • 1502.01852 • Published • 1 -
Deep Residual Learning for Image Recognition
Paper • 1512.03385 • Published • 16 -
Focal Loss for Dense Object Detection
Paper • 1708.02002 • Published -
Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers
Paper • 2409.20537 • Published • 13
-
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 37 -
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Paper • 2502.15007 • Published • 175 -
Transformers without Normalization
Paper • 2503.10622 • Published • 172 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32
-
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Paper • 2503.03601 • Published • 233 -
Transformers without Normalization
Paper • 2503.10622 • Published • 172 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 154 -
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Paper • 2503.11647 • Published • 148
-
A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
Paper • 2601.22966 • Published -
STEM: Scaling Transformers with Embedding Modules
Paper • 2601.10639 • Published • 2 -
Deep Delta Learning
Paper • 2601.00417 • Published • 34 -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 322
-
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Paper • 2602.10693 • Published • 220 -
Reinforced Attention Learning
Paper • 2602.04884 • Published • 29 -
Learning to Reason in 13 Parameters
Paper • 2602.04118 • Published • 6 -
LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
Paper • 2405.17604 • Published • 3
-
A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
Paper • 2601.22966 • Published -
STEM: Scaling Transformers with Embedding Modules
Paper • 2601.10639 • Published • 2 -
Deep Delta Learning
Paper • 2601.00417 • Published • 34 -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 322
-
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Paper • 1502.01852 • Published • 1 -
Deep Residual Learning for Image Recognition
Paper • 1512.03385 • Published • 16 -
Focal Loss for Dense Object Detection
Paper • 1708.02002 • Published -
Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers
Paper • 2409.20537 • Published • 13
-
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 37 -
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Paper • 2502.15007 • Published • 175 -
Transformers without Normalization
Paper • 2503.10622 • Published • 172 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32
-
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Paper • 2503.03601 • Published • 233 -
Transformers without Normalization
Paper • 2503.10622 • Published • 172 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 154 -
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Paper • 2503.11647 • Published • 148