YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
SageAttention 3 Blackwell FP4 β RTX 5090 (Windows)
Pre-built SageAttention 3 FP4 attention kernels for NVIDIA RTX 5090 (sm_120 Blackwell) on Windows. Uses CUTLASS FP4 TMA kernels for microscaling FP4 quantization β designed for long-sequence attention workloads.
Wheel
sageattn3-1.0.0+cu131torch2.12.mengqin-cp312-cp312-win_amd64.whl
Requirements
| Component | Version |
|---|---|
| GPU | NVIDIA RTX 50 series only (sm_120 Blackwell) |
| OS | Windows 10/11 x64 |
| Python | 3.12 |
| PyTorch | 2.12.0 nightly cu128 |
| Triton | 3.6.0 |
| Input dtype | bfloat16 only (fp16 not supported) |
Installation
pip install sageattn3-1.0.0+cu131torch2.12.mengqin-cp312-cp312-win_amd64.whl --no-deps
Important: Always use
--no-depsto prevent pip from overwriting your PyTorch installation.
Verify
import torch
from sageattn3 import sageattn3_blackwell
q = torch.randn(1, 24, 128, 128, device="cuda", dtype=torch.bfloat16)
k = q.clone()
v = q.clone()
o = sageattn3_blackwell(q, k, v)
print("SA3 OK, output shape:", o.shape)
Usage
from sageattn3 import sageattn3_blackwell
# q, k, v: (batch, heads, seq_len, head_dim) in bfloat16
# Batched attention only (uniform sequence lengths)
output = sageattn3_blackwell(q, k, v, per_block_mean=False)
When to Use SA3
SA3 FP4 kernels provide speedup on long-sequence attention workloads:
| Use Case | Sequence Length | SA3 Benefit |
|---|---|---|
| Video generation (WanVideo) | 10k-100k+ tokens | Significant speedup |
| Image upscaling (SeedVR2) | 144-472 tokens/window | No speedup (use SA2) |
| LLM inference | 4k+ tokens | Potential speedup |
SA3 quantizes Q/K to FP4, reducing memory bandwidth 4x compared to BF16. This advantage shows at long sequences where attention is memory-bandwidth-bound. At short sequences, the FP4 quantization overhead (via Triton kernel) exceeds the bandwidth savings.
MSVC Alignment Fix
This wheel is built from the mengqin/SageAttention branch which fixes a critical MSVC compilation issue (PR #323):
Problem: CUTLASS FP4 TMA kernels require 128-byte aligned kernel parameters. Without /Zc:__cplusplus, MSVC reports __cplusplus as 199711 (C++98), causing CUTLASS alignment macros (CUTE_GRID_CONSTANT) to be disabled β compiled kernels have misaligned TMA descriptors β runtime crash (CUDA error: misaligned address). With /Zc:__cplusplus enabled, MSVC correctly detects C++17 but cannot pass 128-byte aligned parameters by value β compile error C2719.
Fix: mengqin's branch modifies kernel parameter passing to use pointers instead of by-value 128-byte aligned structs, compatible with MSVC limitations.
Build from Source
# Use mengqin branch (NOT thu-ml main)
git clone https://github.com/mengqin/SageAttention.git
cd SageAttention/sageattention3_blackwell
# CRITICAL: Open a completely fresh CMD window
# vcvars64 adds ~2000 chars to PATH; reusing CMD causes overflow
call "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
# Activate your venv AFTER vcvars
call path\to\venv\Scripts\activate.bat
set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1
set TORCH_CUDA_ARCH_LIST=12.0
set DISTUTILS_USE_SDK=1
set MAX_JOBS=4
# Subst to avoid spaces in path
subst Z: "path\to\sageattention3_blackwell"
Z:
pip install . --no-build-isolation
subst Z: /d
Build takes 15-30 minutes due to CUTLASS FP4 template instantiation (2 CUDA files: api.cu + fp4_quantization_4d.cu).
Known Issues
- bfloat16 only: fp16 input causes Triton quantization kernel crash. Always cast to bfloat16 before calling SA3.
- CUDA_LAUNCH_BLOCKING: Do not set
CUDA_LAUNCH_BLOCKING=1when using SA3. It causes TMA descriptor timing issues leading to misaligned address errors. - Uniform sequence lengths: SA3 batched API requires all sequences in a batch to have the same length. For variable-length workloads, use a wrapper that pads or falls back to SA2.
- PyTorch ABI: Built against PyTorch 2.12 nightly. Different versions may cause
ImportError: DLL load failed. - VRAM: SA3 uses ~2-3GB more VRAM than SA2 for FP4 quantization buffers.
References
- thu-ml/SageAttention β Official repository
- mengqin PR #323 β MSVC alignment fix
- NVIDIA/CUTLASS #2906 β SM120 NVF4 TMA alignment analysis
- SageAttention3 paper β NeurIPS 2025 Spotlight
License
Apache 2.0