Size differences...

#4
by essence25 - opened

I found your compiled wheel and another, both same name but big difference in size. Why is yours 100mb bigger?
I'm just curious if there are extras in your version. "flash_attn-2.8.3+cu130torch2.9.1cxx11abiTRUE-cp312-cp312-win_amd64.whl"
See links below.

https://huggingface.co/Wildminder/AI-windows-whl/blob/main/flash_attn-2.8.3%2Bcu130torch2.9.1cxx11abiTRUE-cp312-cp312-win_amd64.whl

https://huggingface.co/ussoewwin/Flash-Attention-2_for_Windows/blob/main/flash_attn-2.8.3%2Bcu130torch2.9.1cxx11abiTRUE-cp312-cp312-win_amd64.whl

Reply to Discussion #4: Size Differences


Hi, thanks for the question!

Actually, I think you may have the files mixed up β€” my build is the smaller one (228 MB), not the larger one (350 MB). The other wheel you're comparing against is approximately 122 MB larger than mine, not the other way around.

I did a thorough binary-level analysis of both wheels to give you a definitive answer on exactly why the size difference exists.

TL;DR

The ~122 MB size difference is caused entirely by one extra CUDA GPU architecture (sm_110) compiled into the other wheel. There is zero functional difference for any currently existing GPU.


Detailed Analysis

1. File Sizes

My Build The Other Build
Compressed (.whl) 228.48 MB 350.21 MB
Difference β€” +121.73 MB

Both wheels contain a single compiled binary (flash_attn_2_cuda.cp312-win_amd64.pyd) that accounts for 99.8%+ of the total size.

My Build The Other Build
.pyd uncompressed 919.60 MB 1,239.79 MB
.pyd compressed 228.12 MB 349.76 MB
Difference (compressed) β€” +121.64 MB

2. Root Cause: CUDA Architecture Targets

I used cuobjdump --list-elf to inspect the compiled CUDA kernels (cubins) embedded in each .pyd binary. Here are the results:

My Build β€” 4 architectures, 288 total ELF cubins

Architecture Kernel Count Target GPUs
sm_80 72 Ampere (A100, RTX 3060–3090)
sm_90 72 Hopper (H100, H200)
sm_100 72 Blackwell Datacenter (B200, GB200)
sm_120 72 Blackwell Consumer (RTX 5090, 5080, etc.)
Total 288

The Other Build β€” 5 architectures, 360 total ELF cubins

Architecture Kernel Count Target GPUs
sm_80 72 Ampere (A100, RTX 3060–3090)
sm_90 72 Hopper (H100, H200)
sm_100 72 Blackwell Datacenter (B200, GB200)
sm_110 72 NVIDIA Thor (automotive SoC β€” not a desktop/server GPU)
sm_120 72 Blackwell Consumer (RTX 5090, 5080, etc.)
Total 360

The entire 122 MB difference comes from the 72 extra sm_110 kernels in the other build. sm_110 corresponds to the NVIDIA Thor automotive SoC (used in self-driving cars), not a desktop or server GPU. These kernels are completely irrelevant for any PC or datacenter user.

3. Python File Differences

Beyond the binary, I also compared every Python file inside both wheels:

Category Files Changed Impact on Users
AMD ROCm Triton backend 4 files ❌ Irrelevant on Windows/NVIDIA
CuteDSL (FA3 experimental) 18 modified + 14 extra files ❌ Pure Python, not compiled into the FA2 .pyd
Hopper FA3 interface 2 files ❌ FA3 code path, does not affect FA2
Tests / Benchmarks / Build scripts 3 files ❌ Not runtime code
Core Runtime (flash_attn_interface.py) 1 file ❌ See below

The sole core runtime file (flash_attn/flash_attn_interface.py) has only 4 minor changes, all of which are AMD ROCm (HIP) conditional branches. On NVIDIA GPUs, the exact same code path is executed in both builds. The changes are:

  1. 3 changes: Added if torch.version.hip: branches for AMD-specific tensor shape handling β€” on NVIDIA, the original else path runs identically.
  2. 1 change: cache_seqlens shape reference changed from k_cache.shape[0] to q.shape[0] β€” in practice these values are always equal.

None of these changes affect functionality on NVIDIA GPUs.

The 14 extra Python files in the other build are all in flash_attn/cute/ (CuteDSL experimental code) and hopper/ (test files). These are auxiliary files related to FlashAttention-3 development and are not used by the FlashAttention-2 CUDA binary.

4. Metadata Comparison

Both wheels have identical metadata:

  • Same version: 2.8.3
  • Same author: Tri Dao
  • Same dependencies: torch, einops
  • Same wheel generator: setuptools (70.2.0)
  • Same Python tag: cp312-cp312-win_amd64

Summary

Aspect My Build (228 MB) The Other Build (350 MB)
CUDA architectures sm_80, sm_90, sm_100, sm_120 sm_80, sm_90, sm_100, sm_110, sm_120
Extra arch (sm_110) Not included Included (+122 MB)
sm_110 useful for PC users? β€” No (automotive SoC)
FA2 functionality βœ… Complete βœ… Complete
Works on Ampere (RTX 30xx)? βœ… βœ…
Works on Hopper (H100)? βœ… βœ…
Works on Blackwell (RTX 50xx)? βœ… βœ…
Extra code / hidden features? None None

My build is smaller because it doesn't include kernels for sm_110 (NVIDIA Thor β€” an automotive SoC for self-driving cars, not a desktop or server GPU). There are no extra features, no hidden code, and no functional differences. You get the exact same Flash Attention 2 performance and behavior from both wheels on every desktop and server NVIDIA GPU (Ampere, Hopper, and Blackwell).

Hope this clears things up! 😊

Sign up or log in to comment