AshenNav
/

twill-swp-ws

Model card Files Files and versions

xet

Community

AshenNav commited on 20 days ago

Commit

334c3f4

verified ·

1 Parent(s): a9df93f

Upload README.md

Browse files

Files changed (1) hide show

README.md +200 -0

README.md ADDED Viewed

	@@ -0,0 +1,200 @@

+# Twill: Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs
+Implementation of **["Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs"](https://arxiv.org/abs/2512.18134)** by Rupanshu Soi et al. (NVIDIA Research, 2024).
+## What is Twill?
+Twill is the first system that automatically derives **provably optimal** Software Pipelining (SWP) + Warp Specialization (WS) schedules for tensor core GPU kernels. It formulates the joint optimization as a constraint satisfaction problem solved by off-the-shelf ILP and SMT solvers.
+**Key result:** Twill automatically rediscovers the expert-designed schedules of FlashAttention-3 (Hopper) and FlashAttention-4 (Blackwell) — proving these human-designed schedules are optimal.
+## Architecture
+The solver has two phases following Algorithm 1 from the paper:
+```
+Phase 1: ILP Modulo Scheduling (CBC solver via PuLP)
+    → Finds optimal initiation interval I and initial schedule M
+Phase 2: SMT Joint SWP + WS (Z3 solver)
+    → Finds optimal schedule M* and warp assignment A*
+    → Encodes constraints from Figures 4, 5, 6 of the paper
+Cost Normalization (Section 5.2):
+    → Reduces cycle counts while preserving ratios
+    → Makes the ILP/SMT problems tractable for real GPU cycle counts (~1000 cycles)
+```
+## Quick Start
+```bash
+pip install pulp z3-solver numpy matplotlib
+```
+```python
+from twill.kernels import flash_attention_forward_simplified
+from twill.twill_solver import twill_solve
+from twill.visualization import visualize_schedule
+# Build the simplified Flash Attention dependence graph (Figure 1)
+graph = flash_attention_forward_simplified()
+# Run the full Twill solver
+result = twill_solve(graph, max_I=5, verbose=True)
+# Visualize the result
+print(visualize_schedule(graph, result.joint_result))
+```
+Output:
+```
+SOLUTION FOUND in 0.12s
+  Initiation Interval I = 2     ← optimal!
+  Schedule Length L = 4
+  Overlapping copies = 2
+  Schedule: S@0, P@2, O@3      ← S extracted into prologue
+  Warp Assignment: all on warp 0 (no variable-latency ops)
+```
+## Pre-built Kernel Descriptions
+| Kernel | Section | Architecture | Key Result |
+|--------|---------|-------------|------------|
+| `flash_attention_forward_simplified()` | §3 (Figure 1) | Hopper | I=2, SWP extracts S into prologue |
+| `flash_attention_forward_hopper()` | §6.2.1 | Hopper (H100) | Rediscovers FA3 pipeline + ping-pong |
+| `flash_attention_forward_blackwell()` | §6.2.2 | Blackwell (B200) | Rediscovers FA4 strategy |
+| `simple_gemm_pipeline()` | — | Hopper | Load-compute overlap, TMA on producer warp |
+## Custom Kernels
+Define your own kernel dependence graph:
+```python
+from twill.graph import hopper_machine
+from twill.kernels import custom_kernel
+from twill.twill_solver import twill_solve
+machine = hopper_machine()
+graph = custom_kernel(
+    machine=machine,
+    instructions=[
+        {"name": "load_A", "cycles": 1, "fu": "TMA", "variable_latency": True, "streaming": True},
+        {"name": "load_B", "cycles": 1, "fu": "TMA", "variable_latency": True, "streaming": True},
+        {"name": "gemm",   "cycles": 2, "fu": "TC"},
+        {"name": "relu",   "cycles": 1, "fu": "EXP"},
+    ],
+    edges=[
+        {"src": "load_A", "dst": "gemm", "delay": 1},
+        {"src": "load_B", "dst": "gemm", "delay": 1},
+        {"src": "gemm",   "dst": "relu",  "delay": 2},
+        {"src": "relu",   "dst": "relu",  "delay": 1, "delta": 1},  # loop-carried
+    ],
+)
+result = twill_solve(graph, verbose=True)
+```
+## Code Generation
+Twill generates three output formats:
+```python
+from twill.codegen import generate_pseudocode, generate_cuda_skeleton, generate_pipelined_code
+# Human-readable pseudocode with warp annotations
+print(generate_pseudocode(graph, result.joint_result))
+# CUDA C++ skeleton with warp-specialized structure
+print(generate_cuda_skeleton(graph, result.joint_result))
+# Structured PipelinedCode object for further processing
+code = generate_pipelined_code(graph, result.joint_result)
+```
+## Module Structure
+```
+twill/
+├── __init__.py              # Package exports
+├── graph.py                 # DependenceGraph, Instruction, RRT, MachineDescription
+├── cost_normalization.py    # Section 5.2: ILP-based cycle count normalization
+├── modulo_scheduler.py      # Phase 1: ILP modulo scheduling (CBC solver)
+├── smt_joint.py             # Phase 2: SMT joint SWP+WS (Z3 solver)
+├── twill_solver.py          # Algorithm 1: Main search procedure
+├── codegen.py               # Code generation (pseudocode, CUDA skeleton)
+├── visualization.py         # Schedule visualization (text + matplotlib)
+└── kernels.py               # Pre-built kernel descriptions (FMHA, GEMM)
+```
+## Constraint Groups (from the paper)
+### Figure 4: Modulo Scheduling Constraints
+- **Uniqueness**: Each operation scheduled exactly once per iteration copy
+- **Consistency**: Modulo structure preserved across copies (offset by I)
+- **Completion**: Operations must finish before end of schedule
+- **Dependence**: Data dependencies respected across iterations
+- **Capacity**: Functional unit capacities not exceeded
+### Figure 5: Memory Allocation Constraints
+- **Memory Capacity**: Working set fits in on-chip memory (SMEM, TMEM, registers)
+- **Liveness**: SSA-based backward dataflow for variable lifetimes
+### Figure 6: Warp Assignment Constraints
+- **WarpUniqueness**: Each instruction assigned to exactly one warp
+- **VariableLatency**: Variable-latency ops go to dedicated producer warp
+- **WarpCapacity**: Per-warp resource budget respected
+## Solvers Used
+| Component | Solver | Theory | Paper Reference |
+|-----------|--------|--------|-----------------|
+| Cost Normalization | CBC (PuLP) | ILP | Section 5.2 (paper uses SCIP) |
+| Modulo Scheduling | CBC (PuLP) | ILP | Section 3.1, Stoutchinin et al. |
+| Joint SWP + WS | Z3 | QFLIA (SMT) | Section 4 (paper uses Yices2) |
+## Tests
+```bash
+python test_twill.py
+```
+```
+✓ PASS  Cost Normalization
+✓ PASS  Modulo Scheduling Only
+✓ PASS  Simplified FA (Figure 1)
+✓ PASS  Simple GEMM
+✓ PASS  Hopper FMHA Forward
+✓ PASS  Blackwell FMHA Forward
+Passed: 6/6
+Total time: ~5s
+```
+## Limitations
+Following the paper (Section 5.4):
+- Only supports singly-nested loops without control flow
+- Tile size is not automatically determined (external concern)
+- Code generation produces skeletons, not fully compilable CUDA
+  (the paper notes that even their implementation required "hand-compilation" to CUDA C++
+  because Triton made incorrect decisions during code generation)
+## Citation
+```bibtex
+@article{soi2024twill,
+  title={Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs},
+  author={Soi, Rupanshu and others},
+  journal={arXiv preprint arXiv:2512.18134},
+  year={2024}
+}
+```
+## Related Work
+- [FlashAttention-3](https://arxiv.org/abs/2407.08608) — Hopper FMHA schedule that Twill rediscovers
+- [FlashAttention-4](https://arxiv.org/abs/2603.05451) — Blackwell FMHA schedule that Twill rediscovers
+- [ThunderKittens](https://github.com/HazyResearch/ThunderKittens) — Warp-level kernel framework
+- [CUTLASS 3.x](https://github.com/NVIDIA/cutlass) — NVIDIA GEMM templates with WS
+- [Tawa](https://arxiv.org/abs/2510.14719) — Automatic WS compiler (downstream of Twill)
+- [Cypress](https://arxiv.org/abs/2504.07004) — Task-based GPU programming model