Size differences...
I found your compiled wheel and another, both same name but big difference in size. Why is yours 100mb bigger?
I'm just curious if there are extras in your version. "flash_attn-2.8.3+cu130torch2.9.1cxx11abiTRUE-cp312-cp312-win_amd64.whl"
See links below.
Reply to Discussion #4: Size Differences
Hi, thanks for the question!
Actually, I think you may have the files mixed up β my build is the smaller one (228 MB), not the larger one (350 MB). The other wheel you're comparing against is approximately 122 MB larger than mine, not the other way around.
I did a thorough binary-level analysis of both wheels to give you a definitive answer on exactly why the size difference exists.
TL;DR
The ~122 MB size difference is caused entirely by one extra CUDA GPU architecture (sm_110) compiled into the other wheel. There is zero functional difference for any currently existing GPU.
Detailed Analysis
1. File Sizes
| My Build | The Other Build | |
|---|---|---|
| Compressed (.whl) | 228.48 MB | 350.21 MB |
| Difference | β | +121.73 MB |
Both wheels contain a single compiled binary (flash_attn_2_cuda.cp312-win_amd64.pyd) that accounts for 99.8%+ of the total size.
| My Build | The Other Build | |
|---|---|---|
| .pyd uncompressed | 919.60 MB | 1,239.79 MB |
| .pyd compressed | 228.12 MB | 349.76 MB |
| Difference (compressed) | β | +121.64 MB |
2. Root Cause: CUDA Architecture Targets
I used cuobjdump --list-elf to inspect the compiled CUDA kernels (cubins) embedded in each .pyd binary. Here are the results:
My Build β 4 architectures, 288 total ELF cubins
| Architecture | Kernel Count | Target GPUs |
|---|---|---|
sm_80 |
72 | Ampere (A100, RTX 3060β3090) |
sm_90 |
72 | Hopper (H100, H200) |
sm_100 |
72 | Blackwell Datacenter (B200, GB200) |
sm_120 |
72 | Blackwell Consumer (RTX 5090, 5080, etc.) |
| Total | 288 |
The Other Build β 5 architectures, 360 total ELF cubins
| Architecture | Kernel Count | Target GPUs |
|---|---|---|
sm_80 |
72 | Ampere (A100, RTX 3060β3090) |
sm_90 |
72 | Hopper (H100, H200) |
sm_100 |
72 | Blackwell Datacenter (B200, GB200) |
sm_110 |
72 | NVIDIA Thor (automotive SoC β not a desktop/server GPU) |
sm_120 |
72 | Blackwell Consumer (RTX 5090, 5080, etc.) |
| Total | 360 |
The entire 122 MB difference comes from the 72 extra sm_110 kernels in the other build. sm_110 corresponds to the NVIDIA Thor automotive SoC (used in self-driving cars), not a desktop or server GPU. These kernels are completely irrelevant for any PC or datacenter user.
3. Python File Differences
Beyond the binary, I also compared every Python file inside both wheels:
| Category | Files Changed | Impact on Users |
|---|---|---|
| AMD ROCm Triton backend | 4 files | β Irrelevant on Windows/NVIDIA |
| CuteDSL (FA3 experimental) | 18 modified + 14 extra files | β Pure Python, not compiled into the FA2 .pyd |
| Hopper FA3 interface | 2 files | β FA3 code path, does not affect FA2 |
| Tests / Benchmarks / Build scripts | 3 files | β Not runtime code |
Core Runtime (flash_attn_interface.py) |
1 file | β See below |
The sole core runtime file (flash_attn/flash_attn_interface.py) has only 4 minor changes, all of which are AMD ROCm (HIP) conditional branches. On NVIDIA GPUs, the exact same code path is executed in both builds. The changes are:
- 3 changes: Added
if torch.version.hip:branches for AMD-specific tensor shape handling β on NVIDIA, the originalelsepath runs identically. - 1 change:
cache_seqlensshape reference changed fromk_cache.shape[0]toq.shape[0]β in practice these values are always equal.
None of these changes affect functionality on NVIDIA GPUs.
The 14 extra Python files in the other build are all in flash_attn/cute/ (CuteDSL experimental code) and hopper/ (test files). These are auxiliary files related to FlashAttention-3 development and are not used by the FlashAttention-2 CUDA binary.
4. Metadata Comparison
Both wheels have identical metadata:
- Same version:
2.8.3 - Same author: Tri Dao
- Same dependencies:
torch,einops - Same wheel generator:
setuptools (70.2.0) - Same Python tag:
cp312-cp312-win_amd64
Summary
| Aspect | My Build (228 MB) | The Other Build (350 MB) |
|---|---|---|
| CUDA architectures | sm_80, sm_90, sm_100, sm_120 | sm_80, sm_90, sm_100, sm_110, sm_120 |
Extra arch (sm_110) |
Not included | Included (+122 MB) |
sm_110 useful for PC users? |
β | No (automotive SoC) |
| FA2 functionality | β Complete | β Complete |
| Works on Ampere (RTX 30xx)? | β | β |
| Works on Hopper (H100)? | β | β |
| Works on Blackwell (RTX 50xx)? | β | β |
| Extra code / hidden features? | None | None |
My build is smaller because it doesn't include kernels for sm_110 (NVIDIA Thor β an automotive SoC for self-driving cars, not a desktop or server GPU). There are no extra features, no hidden code, and no functional differences. You get the exact same Flash Attention 2 performance and behavior from both wheels on every desktop and server NVIDIA GPU (Ampere, Hopper, and Blackwell).
Hope this clears things up! π