SageAttention 3 Blackwell FP4 — RTX 5090 (Windows)

Pre-built SageAttention 3 FP4 attention kernels for NVIDIA RTX 5090 (sm_120 Blackwell) on Windows. Uses CUTLASS FP4 TMA kernels for microscaling FP4 quantization — designed for long-sequence attention workloads.

Wheel

sageattn3-1.0.0+cu131torch2.12.mengqin-cp312-cp312-win_amd64.whl

Requirements

Component	Version
GPU	NVIDIA RTX 50 series only (sm_120 Blackwell)
OS	Windows 10/11 x64
Python	3.12
PyTorch	2.12.0 nightly cu128
Triton	3.6.0
Input dtype	bfloat16 only (fp16 not supported)

Installation

pip install sageattn3-1.0.0+cu131torch2.12.mengqin-cp312-cp312-win_amd64.whl --no-deps

Important: Always use --no-deps to prevent pip from overwriting your PyTorch installation.

Verify

import torch
from sageattn3 import sageattn3_blackwell

q = torch.randn(1, 24, 128, 128, device="cuda", dtype=torch.bfloat16)
k = q.clone()
v = q.clone()
o = sageattn3_blackwell(q, k, v)
print("SA3 OK, output shape:", o.shape)

Usage

from sageattn3 import sageattn3_blackwell

# q, k, v: (batch, heads, seq_len, head_dim) in bfloat16
# Batched attention only (uniform sequence lengths)
output = sageattn3_blackwell(q, k, v, per_block_mean=False)

When to Use SA3

SA3 FP4 kernels provide speedup on long-sequence attention workloads:

Use Case	Sequence Length	SA3 Benefit
Video generation (WanVideo)	10k-100k+ tokens	Significant speedup
Image upscaling (SeedVR2)	144-472 tokens/window	No speedup (use SA2)
LLM inference	4k+ tokens	Potential speedup

SA3 quantizes Q/K to FP4, reducing memory bandwidth 4x compared to BF16. This advantage shows at long sequences where attention is memory-bandwidth-bound. At short sequences, the FP4 quantization overhead (via Triton kernel) exceeds the bandwidth savings.

MSVC Alignment Fix

This wheel is built from the mengqin/SageAttention branch which fixes a critical MSVC compilation issue (PR #323):

Problem: CUTLASS FP4 TMA kernels require 128-byte aligned kernel parameters. Without /Zc:__cplusplus, MSVC reports __cplusplus as 199711 (C++98), causing CUTLASS alignment macros (CUTE_GRID_CONSTANT) to be disabled → compiled kernels have misaligned TMA descriptors → runtime crash (CUDA error: misaligned address). With /Zc:__cplusplus enabled, MSVC correctly detects C++17 but cannot pass 128-byte aligned parameters by value → compile error C2719.

Fix: mengqin's branch modifies kernel parameter passing to use pointers instead of by-value 128-byte aligned structs, compatible with MSVC limitations.

Build from Source

# Use mengqin branch (NOT thu-ml main)
git clone https://github.com/mengqin/SageAttention.git
cd SageAttention/sageattention3_blackwell

# CRITICAL: Open a completely fresh CMD window
# vcvars64 adds ~2000 chars to PATH; reusing CMD causes overflow
call "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"

# Activate your venv AFTER vcvars
call path\to\venv\Scripts\activate.bat

set CUDA_HOME=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1
set TORCH_CUDA_ARCH_LIST=12.0
set DISTUTILS_USE_SDK=1
set MAX_JOBS=4

# Subst to avoid spaces in path
subst Z: "path\to\sageattention3_blackwell"
Z:
pip install . --no-build-isolation
subst Z: /d

Build takes 15-30 minutes due to CUTLASS FP4 template instantiation (2 CUDA files: api.cu + fp4_quantization_4d.cu).

Known Issues

bfloat16 only: fp16 input causes Triton quantization kernel crash. Always cast to bfloat16 before calling SA3.
CUDA_LAUNCH_BLOCKING: Do not set CUDA_LAUNCH_BLOCKING=1 when using SA3. It causes TMA descriptor timing issues leading to misaligned address errors.
Uniform sequence lengths: SA3 batched API requires all sequences in a batch to have the same length. For variable-length workloads, use a wrapper that pads or falls back to SA2.
PyTorch ABI: Built against PyTorch 2.12 nightly. Different versions may cause ImportError: DLL load failed.
VRAM: SA3 uses ~2-3GB more VRAM than SA2 for FP4 quantization buffers.

References

thu-ml/SageAttention — Official repository
mengqin PR #323 — MSVC alignment fix
NVIDIA/CUTLASS #2906 — SM120 NVF4 TMA alignment analysis
SageAttention3 paper — NeurIPS 2025 Spotlight

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for nhathoangfoto/SageAttention-3-Blackwell-FP4

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Paper • 2505.11594 • Published May 16, 2025 • 75