Size differences...

by essence25 - opened 7 days ago

I found your compiled wheel and another, both same name but big difference in size. Why is yours 100mb bigger?
I'm just curious if there are extras in your version. "flash_attn-2.8.3+cu130torch2.9.1cxx11abiTRUE-cp312-cp312-win_amd64.whl"
See links below.

https://huggingface.co/Wildminder/AI-windows-whl/blob/main/flash_attn-2.8.3%2Bcu130torch2.9.1cxx11abiTRUE-cp312-cp312-win_amd64.whl

https://huggingface.co/ussoewwin/Flash-Attention-2_for_Windows/blob/main/flash_attn-2.8.3%2Bcu130torch2.9.1cxx11abiTRUE-cp312-cp312-win_amd64.whl

ussoewwin

Owner 7 days ago

•

edited 7 days ago

Reply to Discussion #4: Size Differences

Hi, thanks for the question!

Actually, I think you may have the files mixed up — my build is the smaller one (~~228 MB), not the larger one (~~350 MB). The other wheel you're comparing against is approximately 122 MB larger than mine, not the other way around.

I did a thorough binary-level analysis of both wheels to give you a definitive answer on exactly why the size difference exists.

TL;DR

The ~122 MB size difference is caused entirely by one extra CUDA GPU architecture (sm_110) compiled into the other wheel. There is zero functional difference for any currently existing GPU.

Detailed Analysis

1. File Sizes

	My Build	The Other Build
Compressed (.whl)	228.48 MB	350.21 MB
Difference	—	+121.73 MB

Both wheels contain a single compiled binary (flash_attn_2_cuda.cp312-win_amd64.pyd) that accounts for 99.8%+ of the total size.

	My Build	The Other Build
.pyd uncompressed	919.60 MB	1,239.79 MB
.pyd compressed	228.12 MB	349.76 MB
Difference (compressed)	—	+121.64 MB

2. Root Cause: CUDA Architecture Targets

I used cuobjdump --list-elf to inspect the compiled CUDA kernels (cubins) embedded in each .pyd binary. Here are the results:

My Build — 4 architectures, 288 total ELF cubins

Architecture	Kernel Count	Target GPUs
`sm_80`	72	Ampere (A100, RTX 3060–3090)
`sm_90`	72	Hopper (H100, H200)
`sm_100`	72	Blackwell Datacenter (B200, GB200)
`sm_120`	72	Blackwell Consumer (RTX 5090, 5080, etc.)
Total	288

The Other Build — 5 architectures, 360 total ELF cubins

Architecture	Kernel Count	Target GPUs
`sm_80`	72	Ampere (A100, RTX 3060–3090)
`sm_90`	72	Hopper (H100, H200)
`sm_100`	72	Blackwell Datacenter (B200, GB200)
`sm_110`	72	NVIDIA Thor (automotive SoC — not a desktop/server GPU)
`sm_120`	72	Blackwell Consumer (RTX 5090, 5080, etc.)
Total	360

The entire 122 MB difference comes from the 72 extra sm_110 kernels in the other build. sm_110 corresponds to the NVIDIA Thor automotive SoC (used in self-driving cars), not a desktop or server GPU. These kernels are completely irrelevant for any PC or datacenter user.

3. Python File Differences

Beyond the binary, I also compared every Python file inside both wheels:

Category	Files Changed	Impact on Users
AMD ROCm Triton backend	4 files	❌ Irrelevant on Windows/NVIDIA
CuteDSL (FA3 experimental)	18 modified + 14 extra files	❌ Pure Python, not compiled into the FA2 `.pyd`
Hopper FA3 interface	2 files	❌ FA3 code path, does not affect FA2
Tests / Benchmarks / Build scripts	3 files	❌ Not runtime code
Core Runtime (`flash_attn_interface.py`)	1 file	❌ See below

The sole core runtime file (flash_attn/flash_attn_interface.py) has only 4 minor changes, all of which are AMD ROCm (HIP) conditional branches. On NVIDIA GPUs, the exact same code path is executed in both builds. The changes are:

3 changes: Added if torch.version.hip: branches for AMD-specific tensor shape handling — on NVIDIA, the original else path runs identically.
1 change: cache_seqlens shape reference changed from k_cache.shape[0] to q.shape[0] — in practice these values are always equal.

None of these changes affect functionality on NVIDIA GPUs.

The 14 extra Python files in the other build are all in flash_attn/cute/ (CuteDSL experimental code) and hopper/ (test files). These are auxiliary files related to FlashAttention-3 development and are not used by the FlashAttention-2 CUDA binary.

4. Metadata Comparison

Both wheels have identical metadata:

Same version: 2.8.3
Same author: Tri Dao
Same dependencies: torch, einops
Same wheel generator: setuptools (70.2.0)
Same Python tag: cp312-cp312-win_amd64

Summary

Aspect	My Build (228 MB)	The Other Build (350 MB)
CUDA architectures	sm_80, sm_90, sm_100, sm_120	sm_80, sm_90, sm_100, sm_110, sm_120
Extra arch (`sm_110`)	Not included	Included (+122 MB)
`sm_110` useful for PC users?	—	No (automotive SoC)
FA2 functionality	✅ Complete	✅ Complete
Works on Ampere (RTX 30xx)?	✅	✅
Works on Hopper (H100)?	✅	✅
Works on Blackwell (RTX 50xx)?	✅	✅
Extra code / hidden features?	None	None

My build is smaller because it doesn't include kernels for sm_110 (NVIDIA Thor — an automotive SoC for self-driving cars, not a desktop or server GPU). There are no extra features, no hidden code, and no functional differences. You get the exact same Flash Attention 2 performance and behavior from both wheels on every desktop and server NVIDIA GPU (Ampere, Hopper, and Blackwell).

Hope this clears things up! 😊

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment