Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2502.07864

about 17 hours ago

The Art of Scaling Reinforcement Learning Compute for LLMs

Paper • 2510.13786 • Published Oct 15, 2025 • 33
Attention Is All You Need for KV Cache in Diffusion LLMs

Paper • 2510.14973 • Published Oct 16, 2025 • 42
BitNet Distillation

Paper • 2510.13998 • Published Oct 15, 2025 • 59
GigaBrain-0: A World Model-Powered Vision-Language-Action Model

Paper • 2510.19430 • Published Oct 22, 2025 • 53

CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

Paper • 2502.08639 • Published Feb 12, 2025 • 43
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling

Paper • 2502.07737 • Published Feb 11, 2025 • 9
Enhance-A-Video: Better Generated Video for Free

Paper • 2502.07508 • Published Feb 11, 2025 • 21

TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69

MLLM-as-a-Judge for Image Safety without Human Labeling

Paper • 2501.00192 • Published Dec 31, 2024 • 32
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1, 2025 • 110
Xmodel-2 Technical Report

Paper • 2412.19638 • Published Dec 27, 2024 • 27
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Paper • 2412.18925 • Published Dec 25, 2024 • 107

Architectural Proposals

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 108
Causal Diffusion Transformers for Generative Modeling

Paper • 2412.12095 • Published Dec 16, 2024 • 23
Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published Jan 11, 2025 • 90
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69

Base Model for TransMLA

TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69
fxmeng/transmla_pretrain_6B_tokens

Viewer • Updated Jul 5, 2025 • 5.94M • 812
fxmeng/transmla_pretrain_1B_tokens

Viewer • Updated Jul 5, 2025 • 1.14M • 129
fxmeng/transmla_pretrain_100m_tokens

Viewer • Updated Jul 5, 2025 • 100k • 17

TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69

Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published Jan 11, 2025 • 90
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Paper • 2503.02495 • Published Mar 4, 2025 • 9
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

Paper • 2504.18415 • Published Apr 25, 2025 • 50

deepseek-ai/DeepSeek-V3-Base

Updated Mar 27, 2025 • 16.9k • 1.69k
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69
Running on Zero

Agents

2

Qwen2.5 Bakeneko 32b Instruct Awq

⚡

2

Generate AI-powered Japanese assistant replies
Sleeping

Agents

3

Deepseek R1 Distill Qwen2.5 Bakeneko 32b Awq

⚡

3

Chat with an AI to get detailed text responses

Selective Attention Improves Transformer

Paper • 2410.02703 • Published Oct 3, 2024 • 25
Differential Transformer

Paper • 2410.05258 • Published Oct 7, 2024 • 182
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Paper • 2410.05076 • Published Oct 7, 2024 • 8
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Paper • 2410.13276 • Published Oct 17, 2024 • 29

about 17 hours ago

The Art of Scaling Reinforcement Learning Compute for LLMs

Paper • 2510.13786 • Published Oct 15, 2025 • 33
Attention Is All You Need for KV Cache in Diffusion LLMs

Paper • 2510.14973 • Published Oct 16, 2025 • 42
BitNet Distillation

Paper • 2510.13998 • Published Oct 15, 2025 • 59
GigaBrain-0: A World Model-Powered Vision-Language-Action Model

Paper • 2510.19430 • Published Oct 22, 2025 • 53

Base Model for TransMLA

TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69
fxmeng/transmla_pretrain_6B_tokens

Viewer • Updated Jul 5, 2025 • 5.94M • 812
fxmeng/transmla_pretrain_1B_tokens

Viewer • Updated Jul 5, 2025 • 1.14M • 129
fxmeng/transmla_pretrain_100m_tokens

Viewer • Updated Jul 5, 2025 • 100k • 17

CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation

Paper • 2502.08639 • Published Feb 12, 2025 • 43
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling

Paper • 2502.07737 • Published Feb 11, 2025 • 9
Enhance-A-Video: Better Generated Video for Free

Paper • 2502.07508 • Published Feb 11, 2025 • 21

TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69

TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69

Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published Jan 11, 2025 • 90
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

Paper • 2503.02495 • Published Mar 4, 2025 • 9
BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

Paper • 2504.18415 • Published Apr 25, 2025 • 50

MLLM-as-a-Judge for Image Safety without Human Labeling

Paper • 2501.00192 • Published Dec 31, 2024 • 32
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1, 2025 • 110
Xmodel-2 Technical Report

Paper • 2412.19638 • Published Dec 27, 2024 • 27
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Paper • 2412.18925 • Published Dec 25, 2024 • 107

deepseek-ai/DeepSeek-V3-Base

Updated Mar 27, 2025 • 16.9k • 1.69k
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69
Running on Zero

Agents

2

Qwen2.5 Bakeneko 32b Instruct Awq

⚡

2

Generate AI-powered Japanese assistant replies
Sleeping

Agents

3

Deepseek R1 Distill Qwen2.5 Bakeneko 32b Awq

⚡

3

Chat with an AI to get detailed text responses

Architectural Proposals

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 108
Causal Diffusion Transformers for Generative Modeling

Paper • 2412.12095 • Published Dec 16, 2024 • 23
Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published Jan 11, 2025 • 90
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69

Selective Attention Improves Transformer

Paper • 2410.02703 • Published Oct 3, 2024 • 25
Differential Transformer

Paper • 2410.05258 • Published Oct 7, 2024 • 182
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Paper • 2410.05076 • Published Oct 7, 2024 • 8
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Paper • 2410.13276 • Published Oct 17, 2024 • 29

Previous
1
2
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs