Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Twill: Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs
|
| 2 |
+
|
| 3 |
+
Implementation of **["Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs"](https://arxiv.org/abs/2512.18134)** by Rupanshu Soi et al. (NVIDIA Research, 2024).
|
| 4 |
+
|
| 5 |
+
## What is Twill?
|
| 6 |
+
|
| 7 |
+
Twill is the first system that automatically derives **provably optimal** Software Pipelining (SWP) + Warp Specialization (WS) schedules for tensor core GPU kernels. It formulates the joint optimization as a constraint satisfaction problem solved by off-the-shelf ILP and SMT solvers.
|
| 8 |
+
|
| 9 |
+
**Key result:** Twill automatically rediscovers the expert-designed schedules of FlashAttention-3 (Hopper) and FlashAttention-4 (Blackwell) β proving these human-designed schedules are optimal.
|
| 10 |
+
|
| 11 |
+
## Architecture
|
| 12 |
+
|
| 13 |
+
The solver has two phases following Algorithm 1 from the paper:
|
| 14 |
+
|
| 15 |
+
```
|
| 16 |
+
Phase 1: ILP Modulo Scheduling (CBC solver via PuLP)
|
| 17 |
+
β Finds optimal initiation interval I and initial schedule M
|
| 18 |
+
|
| 19 |
+
Phase 2: SMT Joint SWP + WS (Z3 solver)
|
| 20 |
+
β Finds optimal schedule M* and warp assignment A*
|
| 21 |
+
β Encodes constraints from Figures 4, 5, 6 of the paper
|
| 22 |
+
|
| 23 |
+
Cost Normalization (Section 5.2):
|
| 24 |
+
β Reduces cycle counts while preserving ratios
|
| 25 |
+
β Makes the ILP/SMT problems tractable for real GPU cycle counts (~1000 cycles)
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## Quick Start
|
| 29 |
+
|
| 30 |
+
```bash
|
| 31 |
+
pip install pulp z3-solver numpy matplotlib
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
```python
|
| 35 |
+
from twill.kernels import flash_attention_forward_simplified
|
| 36 |
+
from twill.twill_solver import twill_solve
|
| 37 |
+
from twill.visualization import visualize_schedule
|
| 38 |
+
|
| 39 |
+
# Build the simplified Flash Attention dependence graph (Figure 1)
|
| 40 |
+
graph = flash_attention_forward_simplified()
|
| 41 |
+
|
| 42 |
+
# Run the full Twill solver
|
| 43 |
+
result = twill_solve(graph, max_I=5, verbose=True)
|
| 44 |
+
|
| 45 |
+
# Visualize the result
|
| 46 |
+
print(visualize_schedule(graph, result.joint_result))
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
Output:
|
| 50 |
+
```
|
| 51 |
+
SOLUTION FOUND in 0.12s
|
| 52 |
+
Initiation Interval I = 2 β optimal!
|
| 53 |
+
Schedule Length L = 4
|
| 54 |
+
Overlapping copies = 2
|
| 55 |
+
Schedule: S@0, P@2, O@3 β S extracted into prologue
|
| 56 |
+
Warp Assignment: all on warp 0 (no variable-latency ops)
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## Pre-built Kernel Descriptions
|
| 60 |
+
|
| 61 |
+
| Kernel | Section | Architecture | Key Result |
|
| 62 |
+
|--------|---------|-------------|------------|
|
| 63 |
+
| `flash_attention_forward_simplified()` | Β§3 (Figure 1) | Hopper | I=2, SWP extracts S into prologue |
|
| 64 |
+
| `flash_attention_forward_hopper()` | Β§6.2.1 | Hopper (H100) | Rediscovers FA3 pipeline + ping-pong |
|
| 65 |
+
| `flash_attention_forward_blackwell()` | Β§6.2.2 | Blackwell (B200) | Rediscovers FA4 strategy |
|
| 66 |
+
| `simple_gemm_pipeline()` | β | Hopper | Load-compute overlap, TMA on producer warp |
|
| 67 |
+
|
| 68 |
+
## Custom Kernels
|
| 69 |
+
|
| 70 |
+
Define your own kernel dependence graph:
|
| 71 |
+
|
| 72 |
+
```python
|
| 73 |
+
from twill.graph import hopper_machine
|
| 74 |
+
from twill.kernels import custom_kernel
|
| 75 |
+
from twill.twill_solver import twill_solve
|
| 76 |
+
|
| 77 |
+
machine = hopper_machine()
|
| 78 |
+
graph = custom_kernel(
|
| 79 |
+
machine=machine,
|
| 80 |
+
instructions=[
|
| 81 |
+
{"name": "load_A", "cycles": 1, "fu": "TMA", "variable_latency": True, "streaming": True},
|
| 82 |
+
{"name": "load_B", "cycles": 1, "fu": "TMA", "variable_latency": True, "streaming": True},
|
| 83 |
+
{"name": "gemm", "cycles": 2, "fu": "TC"},
|
| 84 |
+
{"name": "relu", "cycles": 1, "fu": "EXP"},
|
| 85 |
+
],
|
| 86 |
+
edges=[
|
| 87 |
+
{"src": "load_A", "dst": "gemm", "delay": 1},
|
| 88 |
+
{"src": "load_B", "dst": "gemm", "delay": 1},
|
| 89 |
+
{"src": "gemm", "dst": "relu", "delay": 2},
|
| 90 |
+
{"src": "relu", "dst": "relu", "delay": 1, "delta": 1}, # loop-carried
|
| 91 |
+
],
|
| 92 |
+
)
|
| 93 |
+
|
| 94 |
+
result = twill_solve(graph, verbose=True)
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
## Code Generation
|
| 98 |
+
|
| 99 |
+
Twill generates three output formats:
|
| 100 |
+
|
| 101 |
+
```python
|
| 102 |
+
from twill.codegen import generate_pseudocode, generate_cuda_skeleton, generate_pipelined_code
|
| 103 |
+
|
| 104 |
+
# Human-readable pseudocode with warp annotations
|
| 105 |
+
print(generate_pseudocode(graph, result.joint_result))
|
| 106 |
+
|
| 107 |
+
# CUDA C++ skeleton with warp-specialized structure
|
| 108 |
+
print(generate_cuda_skeleton(graph, result.joint_result))
|
| 109 |
+
|
| 110 |
+
# Structured PipelinedCode object for further processing
|
| 111 |
+
code = generate_pipelined_code(graph, result.joint_result)
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
## Module Structure
|
| 115 |
+
|
| 116 |
+
```
|
| 117 |
+
twill/
|
| 118 |
+
βββ __init__.py # Package exports
|
| 119 |
+
βββ graph.py # DependenceGraph, Instruction, RRT, MachineDescription
|
| 120 |
+
βββ cost_normalization.py # Section 5.2: ILP-based cycle count normalization
|
| 121 |
+
βββ modulo_scheduler.py # Phase 1: ILP modulo scheduling (CBC solver)
|
| 122 |
+
βββ smt_joint.py # Phase 2: SMT joint SWP+WS (Z3 solver)
|
| 123 |
+
βββ twill_solver.py # Algorithm 1: Main search procedure
|
| 124 |
+
βββ codegen.py # Code generation (pseudocode, CUDA skeleton)
|
| 125 |
+
βββ visualization.py # Schedule visualization (text + matplotlib)
|
| 126 |
+
βββ kernels.py # Pre-built kernel descriptions (FMHA, GEMM)
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
## Constraint Groups (from the paper)
|
| 130 |
+
|
| 131 |
+
### Figure 4: Modulo Scheduling Constraints
|
| 132 |
+
- **Uniqueness**: Each operation scheduled exactly once per iteration copy
|
| 133 |
+
- **Consistency**: Modulo structure preserved across copies (offset by I)
|
| 134 |
+
- **Completion**: Operations must finish before end of schedule
|
| 135 |
+
- **Dependence**: Data dependencies respected across iterations
|
| 136 |
+
- **Capacity**: Functional unit capacities not exceeded
|
| 137 |
+
|
| 138 |
+
### Figure 5: Memory Allocation Constraints
|
| 139 |
+
- **Memory Capacity**: Working set fits in on-chip memory (SMEM, TMEM, registers)
|
| 140 |
+
- **Liveness**: SSA-based backward dataflow for variable lifetimes
|
| 141 |
+
|
| 142 |
+
### Figure 6: Warp Assignment Constraints
|
| 143 |
+
- **WarpUniqueness**: Each instruction assigned to exactly one warp
|
| 144 |
+
- **VariableLatency**: Variable-latency ops go to dedicated producer warp
|
| 145 |
+
- **WarpCapacity**: Per-warp resource budget respected
|
| 146 |
+
|
| 147 |
+
## Solvers Used
|
| 148 |
+
|
| 149 |
+
| Component | Solver | Theory | Paper Reference |
|
| 150 |
+
|-----------|--------|--------|-----------------|
|
| 151 |
+
| Cost Normalization | CBC (PuLP) | ILP | Section 5.2 (paper uses SCIP) |
|
| 152 |
+
| Modulo Scheduling | CBC (PuLP) | ILP | Section 3.1, Stoutchinin et al. |
|
| 153 |
+
| Joint SWP + WS | Z3 | QFLIA (SMT) | Section 4 (paper uses Yices2) |
|
| 154 |
+
|
| 155 |
+
## Tests
|
| 156 |
+
|
| 157 |
+
```bash
|
| 158 |
+
python test_twill.py
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
```
|
| 162 |
+
β PASS Cost Normalization
|
| 163 |
+
β PASS Modulo Scheduling Only
|
| 164 |
+
β PASS Simplified FA (Figure 1)
|
| 165 |
+
β PASS Simple GEMM
|
| 166 |
+
β PASS Hopper FMHA Forward
|
| 167 |
+
β PASS Blackwell FMHA Forward
|
| 168 |
+
|
| 169 |
+
Passed: 6/6
|
| 170 |
+
Total time: ~5s
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
## Limitations
|
| 174 |
+
|
| 175 |
+
Following the paper (Section 5.4):
|
| 176 |
+
- Only supports singly-nested loops without control flow
|
| 177 |
+
- Tile size is not automatically determined (external concern)
|
| 178 |
+
- Code generation produces skeletons, not fully compilable CUDA
|
| 179 |
+
(the paper notes that even their implementation required "hand-compilation" to CUDA C++
|
| 180 |
+
because Triton made incorrect decisions during code generation)
|
| 181 |
+
|
| 182 |
+
## Citation
|
| 183 |
+
|
| 184 |
+
```bibtex
|
| 185 |
+
@article{soi2024twill,
|
| 186 |
+
title={Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs},
|
| 187 |
+
author={Soi, Rupanshu and others},
|
| 188 |
+
journal={arXiv preprint arXiv:2512.18134},
|
| 189 |
+
year={2024}
|
| 190 |
+
}
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
## Related Work
|
| 194 |
+
|
| 195 |
+
- [FlashAttention-3](https://arxiv.org/abs/2407.08608) β Hopper FMHA schedule that Twill rediscovers
|
| 196 |
+
- [FlashAttention-4](https://arxiv.org/abs/2603.05451) β Blackwell FMHA schedule that Twill rediscovers
|
| 197 |
+
- [ThunderKittens](https://github.com/HazyResearch/ThunderKittens) β Warp-level kernel framework
|
| 198 |
+
- [CUTLASS 3.x](https://github.com/NVIDIA/cutlass) β NVIDIA GEMM templates with WS
|
| 199 |
+
- [Tawa](https://arxiv.org/abs/2510.14719) β Automatic WS compiler (downstream of Twill)
|
| 200 |
+
- [Cypress](https://arxiv.org/abs/2504.07004) β Task-based GPU programming model
|