flash-attention — Triton kernel

A pure-Triton implementation of Flash Attention 1 (Dao et al., 2022) packaged for the Hugging Face Kernel Hub.

Unlike hand-written CUDA implementations, this kernel is written entirely in Python/Triton and is JIT-compiled at runtime, making it easy to read, modify, and experiment with.

Algorithm

Flash Attention avoids materialising the full N × N attention matrix in HBM by fusing the softmax and the value-weighted sum into a single tiled pass using the online softmax trick (Milakov & Gimelshein, 2018):

O_i ← softmax(Q_i · Kᵀ) · V        (tiled over K/V, never storing full S)

Memory complexity drops from O(N²) → O(N · d), which is the primary bottleneck for long-context inference and training.

Usage

Via the `kernels` package

import torch
from kernels import get_kernel

fa = get_kernel("kernels-community/flash-attention", version=1)

B, H, N, d = 2, 8, 1024, 64
q = torch.randn(B, H, N, d, device="cuda", dtype=torch.float16)
k = torch.randn(B, H, N, d, device="cuda", dtype=torch.float16)
v = torch.randn(B, H, N, d, device="cuda", dtype=torch.float16)

out = fa.flash_attention_forward(q, k, v, causal=False)
print(out.shape)  # [2, 8, 1024, 64]

Local development

# 1. Clone
git clone https://huggingface.co/kernels-community/flash-attention
cd flash-attention

# 2. Install dependencies
pip install torch triton pytest

# 3. Run tests
pytest tests/ -v

# 4. Run benchmark
python benchmarks/bench_flash_attention.py

Performance

Sequence length	Flash-Attn Triton	PyTorch ref	Speedup
128	0.11 ms	0.19 ms	1.70×
256	0.15 ms	0.24 ms	1.57×
512	0.17 ms	0.43 ms	2.47×
1024	0.29 ms	1.78 ms	6.15×
2048	0.79 ms	7.11 ms	8.98×
4096	2.54 ms	27.01 ms	10.63×

Repository structure

flash-attention-1-triton/
├── build.toml                        # kernel-builder configuration
├── flake.nix                         # Nix build environment
├── flash_attention_kernel/
│   └── flash_attention.py            # Triton forward/backward kernels + launcher
├── torch-ext/
│   ├── torch_binding.h               # C++ op declaration
│   ├── torch_binding.cpp             # Torch op registration
│   └── flash_attention/
│       └── __init__.py               # Python-level wrapper (uses _ops alias)
├── tests/
│   └── test_flash_attention.py       # pytest correctness & smoke tests
├── benchmarks/
│   └── bench_flash_attention.py      # triton.testing perf report
└── README.md

References

Dao et al. (2022) — Flash Attention: Fast and Memory-Efficient Exact Attention with IO-Awareness
https://arxiv.org/abs/2205.14135
Milakov & Gimelshein (2018) — Online normalizer calculation for softmax
https://arxiv.org/abs/1805.02867
Triton tutorials — Flash Attention
https://triton-lang.org/main/getting-started/tutorials/06-fused-attention.html
HuggingFace Kernel Hub docs
https://huggingface.co/docs/kernels/

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for sigmoid-neuron/flash-attention-1-triton

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Paper • 2205.14135 • Published May 27, 2022 • 15

Online normalizer calculation for softmax

Paper • 1805.02867 • Published May 8, 2018 • 1