-
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Paper • 2602.24286 • Published • 98 -
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
Paper • 2505.22758 • Published • 1 -
Liger Kernel: Efficient Triton Kernels for LLM Training
Paper • 2410.10989 • Published • 3 -
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38
Collections
Discover the best community collections!
Collections including paper arxiv:2311.01282
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 8 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 32 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 119
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Paper • 2311.03285 • Published • 30 -
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Paper • 2311.06243 • Published • 21 -
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Paper • 2311.05908 • Published • 14
-
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks
Paper • 2310.19909 • Published • 21 -
Memory Augmented Language Models through Mixture of Word Experts
Paper • 2311.10768 • Published • 19 -
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 32
-
Matryoshka Diffusion Models
Paper • 2310.15111 • Published • 45 -
Data Filtering Networks
Paper • 2309.17425 • Published • 6 -
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
E3 TTS: Easy End-to-End Diffusion-based Text to Speech
Paper • 2311.00945 • Published • 16
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 17 -
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper • 2402.11131 • Published • 42
-
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
Paper • 2602.24286 • Published • 98 -
FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference
Paper • 2505.22758 • Published • 1 -
Liger Kernel: Efficient Triton Kernels for LLM Training
Paper • 2410.10989 • Published • 3 -
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38
-
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks
Paper • 2310.19909 • Published • 21 -
Memory Augmented Language Models through Mixture of Word Experts
Paper • 2311.10768 • Published • 19 -
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 32
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 8 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 32 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 119
-
Matryoshka Diffusion Models
Paper • 2310.15111 • Published • 45 -
Data Filtering Networks
Paper • 2309.17425 • Published • 6 -
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
E3 TTS: Easy End-to-End Diffusion-based Text to Speech
Paper • 2311.00945 • Published • 16
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Paper • 2311.03285 • Published • 30 -
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Paper • 2311.06243 • Published • 21 -
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Paper • 2311.05908 • Published • 14
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 38 -
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Paper • 2401.02669 • Published • 17 -
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper • 2402.11131 • Published • 42