-
The Art of Scaling Reinforcement Learning Compute for LLMs
Paper • 2510.13786 • Published • 33 -
Attention Is All You Need for KV Cache in Diffusion LLMs
Paper • 2510.14973 • Published • 42 -
BitNet Distillation
Paper • 2510.13998 • Published • 59 -
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
Paper • 2510.19430 • Published • 53
Collections
Discover the best community collections!
Collections including paper arxiv:2502.07864
-
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
Paper • 2502.08639 • Published • 43 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 69 -
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
Paper • 2502.07737 • Published • 9 -
Enhance-A-Video: Better Generated Video for Free
Paper • 2502.07508 • Published • 21
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 32 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 110 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 27 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 107
-
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 108 -
Causal Diffusion Transformers for Generative Modeling
Paper • 2412.12095 • Published • 23 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 69
-
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 69 -
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Paper • 2503.02495 • Published • 9 -
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
Paper • 2504.18415 • Published • 50
-
deepseek-ai/DeepSeek-V3-Base
Updated • 16.9k • 1.69k -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 69 -
Qwen2.5 Bakeneko 32b Instruct Awq
⚡2Generate AI-powered Japanese assistant replies
-
Deepseek R1 Distill Qwen2.5 Bakeneko 32b Awq
⚡3Chat with an AI to get detailed text responses
-
Selective Attention Improves Transformer
Paper • 2410.02703 • Published • 25 -
Differential Transformer
Paper • 2410.05258 • Published • 182 -
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Paper • 2410.05076 • Published • 8 -
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Paper • 2410.13276 • Published • 29
-
The Art of Scaling Reinforcement Learning Compute for LLMs
Paper • 2510.13786 • Published • 33 -
Attention Is All You Need for KV Cache in Diffusion LLMs
Paper • 2510.14973 • Published • 42 -
BitNet Distillation
Paper • 2510.13998 • Published • 59 -
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
Paper • 2510.19430 • Published • 53
-
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
Paper • 2502.08639 • Published • 43 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 69 -
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
Paper • 2502.07737 • Published • 9 -
Enhance-A-Video: Better Generated Video for Free
Paper • 2502.07508 • Published • 21
-
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 69 -
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Paper • 2503.02495 • Published • 9 -
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs
Paper • 2504.18415 • Published • 50
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 32 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 110 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 27 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 107
-
deepseek-ai/DeepSeek-V3-Base
Updated • 16.9k • 1.69k -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 69 -
Qwen2.5 Bakeneko 32b Instruct Awq
⚡2Generate AI-powered Japanese assistant replies
-
Deepseek R1 Distill Qwen2.5 Bakeneko 32b Awq
⚡3Chat with an AI to get detailed text responses
-
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper • 2412.09871 • Published • 108 -
Causal Diffusion Transformers for Generative Modeling
Paper • 2412.12095 • Published • 23 -
Tensor Product Attention Is All You Need
Paper • 2501.06425 • Published • 90 -
TransMLA: Multi-head Latent Attention Is All You Need
Paper • 2502.07864 • Published • 69
-
Selective Attention Improves Transformer
Paper • 2410.02703 • Published • 25 -
Differential Transformer
Paper • 2410.05258 • Published • 182 -
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
Paper • 2410.05076 • Published • 8 -
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Paper • 2410.13276 • Published • 29