AshenNav commited on
Commit
334c3f4
Β·
verified Β·
1 Parent(s): a9df93f

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +200 -0
README.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Twill: Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs
2
+
3
+ Implementation of **["Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs"](https://arxiv.org/abs/2512.18134)** by Rupanshu Soi et al. (NVIDIA Research, 2024).
4
+
5
+ ## What is Twill?
6
+
7
+ Twill is the first system that automatically derives **provably optimal** Software Pipelining (SWP) + Warp Specialization (WS) schedules for tensor core GPU kernels. It formulates the joint optimization as a constraint satisfaction problem solved by off-the-shelf ILP and SMT solvers.
8
+
9
+ **Key result:** Twill automatically rediscovers the expert-designed schedules of FlashAttention-3 (Hopper) and FlashAttention-4 (Blackwell) β€” proving these human-designed schedules are optimal.
10
+
11
+ ## Architecture
12
+
13
+ The solver has two phases following Algorithm 1 from the paper:
14
+
15
+ ```
16
+ Phase 1: ILP Modulo Scheduling (CBC solver via PuLP)
17
+ β†’ Finds optimal initiation interval I and initial schedule M
18
+
19
+ Phase 2: SMT Joint SWP + WS (Z3 solver)
20
+ β†’ Finds optimal schedule M* and warp assignment A*
21
+ β†’ Encodes constraints from Figures 4, 5, 6 of the paper
22
+
23
+ Cost Normalization (Section 5.2):
24
+ β†’ Reduces cycle counts while preserving ratios
25
+ β†’ Makes the ILP/SMT problems tractable for real GPU cycle counts (~1000 cycles)
26
+ ```
27
+
28
+ ## Quick Start
29
+
30
+ ```bash
31
+ pip install pulp z3-solver numpy matplotlib
32
+ ```
33
+
34
+ ```python
35
+ from twill.kernels import flash_attention_forward_simplified
36
+ from twill.twill_solver import twill_solve
37
+ from twill.visualization import visualize_schedule
38
+
39
+ # Build the simplified Flash Attention dependence graph (Figure 1)
40
+ graph = flash_attention_forward_simplified()
41
+
42
+ # Run the full Twill solver
43
+ result = twill_solve(graph, max_I=5, verbose=True)
44
+
45
+ # Visualize the result
46
+ print(visualize_schedule(graph, result.joint_result))
47
+ ```
48
+
49
+ Output:
50
+ ```
51
+ SOLUTION FOUND in 0.12s
52
+ Initiation Interval I = 2 ← optimal!
53
+ Schedule Length L = 4
54
+ Overlapping copies = 2
55
+ Schedule: S@0, P@2, O@3 ← S extracted into prologue
56
+ Warp Assignment: all on warp 0 (no variable-latency ops)
57
+ ```
58
+
59
+ ## Pre-built Kernel Descriptions
60
+
61
+ | Kernel | Section | Architecture | Key Result |
62
+ |--------|---------|-------------|------------|
63
+ | `flash_attention_forward_simplified()` | Β§3 (Figure 1) | Hopper | I=2, SWP extracts S into prologue |
64
+ | `flash_attention_forward_hopper()` | Β§6.2.1 | Hopper (H100) | Rediscovers FA3 pipeline + ping-pong |
65
+ | `flash_attention_forward_blackwell()` | Β§6.2.2 | Blackwell (B200) | Rediscovers FA4 strategy |
66
+ | `simple_gemm_pipeline()` | β€” | Hopper | Load-compute overlap, TMA on producer warp |
67
+
68
+ ## Custom Kernels
69
+
70
+ Define your own kernel dependence graph:
71
+
72
+ ```python
73
+ from twill.graph import hopper_machine
74
+ from twill.kernels import custom_kernel
75
+ from twill.twill_solver import twill_solve
76
+
77
+ machine = hopper_machine()
78
+ graph = custom_kernel(
79
+ machine=machine,
80
+ instructions=[
81
+ {"name": "load_A", "cycles": 1, "fu": "TMA", "variable_latency": True, "streaming": True},
82
+ {"name": "load_B", "cycles": 1, "fu": "TMA", "variable_latency": True, "streaming": True},
83
+ {"name": "gemm", "cycles": 2, "fu": "TC"},
84
+ {"name": "relu", "cycles": 1, "fu": "EXP"},
85
+ ],
86
+ edges=[
87
+ {"src": "load_A", "dst": "gemm", "delay": 1},
88
+ {"src": "load_B", "dst": "gemm", "delay": 1},
89
+ {"src": "gemm", "dst": "relu", "delay": 2},
90
+ {"src": "relu", "dst": "relu", "delay": 1, "delta": 1}, # loop-carried
91
+ ],
92
+ )
93
+
94
+ result = twill_solve(graph, verbose=True)
95
+ ```
96
+
97
+ ## Code Generation
98
+
99
+ Twill generates three output formats:
100
+
101
+ ```python
102
+ from twill.codegen import generate_pseudocode, generate_cuda_skeleton, generate_pipelined_code
103
+
104
+ # Human-readable pseudocode with warp annotations
105
+ print(generate_pseudocode(graph, result.joint_result))
106
+
107
+ # CUDA C++ skeleton with warp-specialized structure
108
+ print(generate_cuda_skeleton(graph, result.joint_result))
109
+
110
+ # Structured PipelinedCode object for further processing
111
+ code = generate_pipelined_code(graph, result.joint_result)
112
+ ```
113
+
114
+ ## Module Structure
115
+
116
+ ```
117
+ twill/
118
+ β”œβ”€β”€ __init__.py # Package exports
119
+ β”œβ”€β”€ graph.py # DependenceGraph, Instruction, RRT, MachineDescription
120
+ β”œβ”€β”€ cost_normalization.py # Section 5.2: ILP-based cycle count normalization
121
+ β”œβ”€β”€ modulo_scheduler.py # Phase 1: ILP modulo scheduling (CBC solver)
122
+ β”œβ”€β”€ smt_joint.py # Phase 2: SMT joint SWP+WS (Z3 solver)
123
+ β”œβ”€β”€ twill_solver.py # Algorithm 1: Main search procedure
124
+ β”œβ”€β”€ codegen.py # Code generation (pseudocode, CUDA skeleton)
125
+ β”œβ”€β”€ visualization.py # Schedule visualization (text + matplotlib)
126
+ └── kernels.py # Pre-built kernel descriptions (FMHA, GEMM)
127
+ ```
128
+
129
+ ## Constraint Groups (from the paper)
130
+
131
+ ### Figure 4: Modulo Scheduling Constraints
132
+ - **Uniqueness**: Each operation scheduled exactly once per iteration copy
133
+ - **Consistency**: Modulo structure preserved across copies (offset by I)
134
+ - **Completion**: Operations must finish before end of schedule
135
+ - **Dependence**: Data dependencies respected across iterations
136
+ - **Capacity**: Functional unit capacities not exceeded
137
+
138
+ ### Figure 5: Memory Allocation Constraints
139
+ - **Memory Capacity**: Working set fits in on-chip memory (SMEM, TMEM, registers)
140
+ - **Liveness**: SSA-based backward dataflow for variable lifetimes
141
+
142
+ ### Figure 6: Warp Assignment Constraints
143
+ - **WarpUniqueness**: Each instruction assigned to exactly one warp
144
+ - **VariableLatency**: Variable-latency ops go to dedicated producer warp
145
+ - **WarpCapacity**: Per-warp resource budget respected
146
+
147
+ ## Solvers Used
148
+
149
+ | Component | Solver | Theory | Paper Reference |
150
+ |-----------|--------|--------|-----------------|
151
+ | Cost Normalization | CBC (PuLP) | ILP | Section 5.2 (paper uses SCIP) |
152
+ | Modulo Scheduling | CBC (PuLP) | ILP | Section 3.1, Stoutchinin et al. |
153
+ | Joint SWP + WS | Z3 | QFLIA (SMT) | Section 4 (paper uses Yices2) |
154
+
155
+ ## Tests
156
+
157
+ ```bash
158
+ python test_twill.py
159
+ ```
160
+
161
+ ```
162
+ βœ“ PASS Cost Normalization
163
+ βœ“ PASS Modulo Scheduling Only
164
+ βœ“ PASS Simplified FA (Figure 1)
165
+ βœ“ PASS Simple GEMM
166
+ βœ“ PASS Hopper FMHA Forward
167
+ βœ“ PASS Blackwell FMHA Forward
168
+
169
+ Passed: 6/6
170
+ Total time: ~5s
171
+ ```
172
+
173
+ ## Limitations
174
+
175
+ Following the paper (Section 5.4):
176
+ - Only supports singly-nested loops without control flow
177
+ - Tile size is not automatically determined (external concern)
178
+ - Code generation produces skeletons, not fully compilable CUDA
179
+ (the paper notes that even their implementation required "hand-compilation" to CUDA C++
180
+ because Triton made incorrect decisions during code generation)
181
+
182
+ ## Citation
183
+
184
+ ```bibtex
185
+ @article{soi2024twill,
186
+ title={Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs},
187
+ author={Soi, Rupanshu and others},
188
+ journal={arXiv preprint arXiv:2512.18134},
189
+ year={2024}
190
+ }
191
+ ```
192
+
193
+ ## Related Work
194
+
195
+ - [FlashAttention-3](https://arxiv.org/abs/2407.08608) β€” Hopper FMHA schedule that Twill rediscovers
196
+ - [FlashAttention-4](https://arxiv.org/abs/2603.05451) β€” Blackwell FMHA schedule that Twill rediscovers
197
+ - [ThunderKittens](https://github.com/HazyResearch/ThunderKittens) β€” Warp-level kernel framework
198
+ - [CUTLASS 3.x](https://github.com/NVIDIA/cutlass) β€” NVIDIA GEMM templates with WS
199
+ - [Tawa](https://arxiv.org/abs/2510.14719) β€” Automatic WS compiler (downstream of Twill)
200
+ - [Cypress](https://arxiv.org/abs/2504.07004) β€” Task-based GPU programming model