AshenNav
/

twill-swp-ws

Model card Files Files and versions

xet

Community

AshenNav commited on 15 days ago

Commit

9e6c8b9

verified ·

1 Parent(s): 1717684

Upload README.md

Browse files

Files changed (1) hide show

README.md +103 -128

README.md CHANGED Viewed

@@ -1,185 +1,150 @@
-# Twill: Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs
-Implementation of **["Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs"](https://arxiv.org/abs/2512.18134)** by Rupanshu Soi et al. (NVIDIA Research, 2024).
-## What is Twill?
-Twill is the first system that automatically derives **provably optimal** Software Pipelining (SWP) + Warp Specialization (WS) schedules for tensor core GPU kernels. It formulates the joint optimization as a constraint satisfaction problem solved by off-the-shelf ILP and SMT solvers.
-**Key result:** Twill automatically rediscovers the expert-designed schedules of FlashAttention-3 (Hopper) and FlashAttention-4 (Blackwell) — proving these human-designed schedules are optimal.
-## Architecture
-The solver has two phases following Algorithm 1 from the paper:
 ```
-Phase 1: ILP Modulo Scheduling (CBC solver via PuLP)
-    → Finds optimal initiation interval I and initial schedule M
-Phase 2: SMT Joint SWP + WS (Z3 solver)
-    → Finds optimal schedule M* and warp assignment A*
-    → Encodes constraints from Figures 4, 5, 6 of the paper
-Cost Normalization (Section 5.2):
-    → Reduces cycle counts while preserving ratios
-    → Makes the ILP/SMT problems tractable for real GPU cycle counts (~1000 cycles)
 ```
 ## Quick Start
 ```bash
-pip install pulp z3-solver numpy matplotlib
 ```
 ```python
 from twill.kernels import flash_attention_forward_simplified
 from twill.twill_solver import twill_solve
-from twill.visualization import visualize_schedule
-# Build the simplified Flash Attention dependence graph (Figure 1)
 graph = flash_attention_forward_simplified()
-# Run the full Twill solver
 result = twill_solve(graph, max_I=5, verbose=True)
-# Visualize the result
-print(visualize_schedule(graph, result.joint_result))
 ```
-Output:
-```
-SOLUTION FOUND in 0.12s
-  Initiation Interval I = 2     ← optimal!
-  Schedule Length L = 4
-  Overlapping copies = 2
-  Schedule: S@0, P@2, O@3      ← S extracted into prologue
-  Warp Assignment: all on warp 0 (no variable-latency ops)
-```
-## Pre-built Kernel Descriptions
-| Kernel | Section | Architecture | Key Result |
-|--------|---------|-------------|------------|
-| `flash_attention_forward_simplified()` | §3 (Figure 1) | Hopper | I=2, SWP extracts S into prologue |
-| `flash_attention_forward_hopper()` | §6.2.1 | Hopper (H100) | Rediscovers FA3 pipeline + ping-pong |
-| `flash_attention_forward_blackwell()` | §6.2.2 | Blackwell (B200) | Rediscovers FA4 strategy |
-| `simple_gemm_pipeline()` | — | Hopper | Load-compute overlap, TMA on producer warp |
-## Custom Kernels
-Define your own kernel dependence graph:
 ```python
-from twill.graph import hopper_machine
-from twill.kernels import custom_kernel
-from twill.twill_solver import twill_solve
-machine = hopper_machine()
-graph = custom_kernel(
-    machine=machine,
-    instructions=[
-        {"name": "load_A", "cycles": 1, "fu": "TMA", "variable_latency": True, "streaming": True},
-        {"name": "load_B", "cycles": 1, "fu": "TMA", "variable_latency": True, "streaming": True},
-        {"name": "gemm",   "cycles": 2, "fu": "TC"},
-        {"name": "relu",   "cycles": 1, "fu": "EXP"},
-    ],
-    edges=[
-        {"src": "load_A", "dst": "gemm", "delay": 1},
-        {"src": "load_B", "dst": "gemm", "delay": 1},
-        {"src": "gemm",   "dst": "relu",  "delay": 2},
-        {"src": "relu",   "dst": "relu",  "delay": 1, "delta": 1},  # loop-carried
-    ],
-)
-result = twill_solve(graph, verbose=True)
 ```
-## Code Generation
-Twill generates three output formats:
 ```python
-from twill.codegen import generate_pseudocode, generate_cuda_skeleton, generate_pipelined_code
-# Human-readable pseudocode with warp annotations
-print(generate_pseudocode(graph, result.joint_result))
-# CUDA C++ skeleton with warp-specialized structure
-print(generate_cuda_skeleton(graph, result.joint_result))
-# Structured PipelinedCode object for further processing
-code = generate_pipelined_code(graph, result.joint_result)
-```
 ## Module Structure
 ```
 twill/
-├── __init__.py              # Package exports
 ├── graph.py                 # DependenceGraph, Instruction, RRT, MachineDescription
-├── cost_normalization.py    # Section 5.2: ILP-based cycle count normalization
-├── modulo_scheduler.py      # Phase 1: ILP modulo scheduling (CBC solver)
-├── smt_joint.py             # Phase 2: SMT joint SWP+WS (Z3 solver)
-├── twill_solver.py          # Algorithm 1: Main search procedure
 ├── codegen.py               # Code generation (pseudocode, CUDA skeleton)
 ├── visualization.py         # Schedule visualization (text + matplotlib)
 └── kernels.py               # Pre-built kernel descriptions (FMHA, GEMM)
 ```
-## Constraint Groups (from the paper)
-### Figure 4: Modulo Scheduling Constraints
-- **Uniqueness**: Each operation scheduled exactly once per iteration copy
-- **Consistency**: Modulo structure preserved across copies (offset by I)
-- **Completion**: Operations must finish before end of schedule
-- **Dependence**: Data dependencies respected across iterations
-- **Capacity**: Functional unit capacities not exceeded
-### Figure 5: Memory Allocation Constraints
-- **Memory Capacity**: Working set fits in on-chip memory (SMEM, TMEM, registers)
-- **Liveness**: SSA-based backward dataflow for variable lifetimes
-### Figure 6: Warp Assignment Constraints
-- **WarpUniqueness**: Each instruction assigned to exactly one warp
-- **VariableLatency**: Variable-latency ops go to dedicated producer warp
-- **WarpCapacity**: Per-warp resource budget respected
-## Solvers Used
-| Component | Solver | Theory | Paper Reference |
-|-----------|--------|--------|-----------------|
-| Cost Normalization | CBC (PuLP) | ILP | Section 5.2 (paper uses SCIP) |
-| Modulo Scheduling | CBC (PuLP) | ILP | Section 3.1, Stoutchinin et al. |
-| Joint SWP + WS | Z3 | QFLIA (SMT) | Section 4 (paper uses Yices2) |
 ## Tests
 ```bash
-python test_twill.py
-```
-```
-✓ PASS  Cost Normalization
-✓ PASS  Modulo Scheduling Only
-✓ PASS  Simplified FA (Figure 1)
-✓ PASS  Simple GEMM
-✓ PASS  Hopper FMHA Forward
-✓ PASS  Blackwell FMHA Forward
-Passed: 6/6
-Total time: ~5s
 ```
-## Limitations
-Following the paper (Section 5.4):
-- Only supports singly-nested loops without control flow
-- Tile size is not automatically determined (external concern)
-- Code generation produces skeletons, not fully compilable CUDA
-  (the paper notes that even their implementation required "hand-compilation" to CUDA C++
-  because Triton made incorrect decisions during code generation)
-## Citation
 ```bibtex
 @article{soi2024twill,
@@ -188,6 +153,13 @@ Following the paper (Section 5.4):
   journal={arXiv preprint arXiv:2512.18134},
   year={2024}
 }
 ```
 ## Related Work
@@ -198,3 +170,6 @@ Following the paper (Section 5.4):
 - [CUTLASS 3.x](https://github.com/NVIDIA/cutlass) — NVIDIA GEMM templates with WS
 - [Tawa](https://arxiv.org/abs/2510.14719) — Automatic WS compiler (downstream of Twill)
 - [Cypress](https://arxiv.org/abs/2504.07004) — Task-based GPU programming model

+# Twill + GauS: Optimal GPU Kernel Scheduling
+Implementation of two complementary papers for GPU kernel scheduling optimization:
+1. **[Twill](https://arxiv.org/abs/2512.18134)** — "Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs" (Soi et al., NVIDIA Research, 2024). Exact ILP+SMT solver that finds provably optimal schedules.
+2. **[GauS](https://arxiv.org/abs/2602.20427)** — "Differentiable Scheduling Optimization via Gaussian Reparameterization" (Cai et al., 2026). Scalable gradient-based solver using Gaussian distributions and Augmented Lagrangian Method.
+## When to Use Which
+| Property | Twill (ILP+SMT) | GauS (Differentiable) |
+|----------|-----------------|----------------------|
+| **Optimality** | Provably optimal | Approximate (Pareto-optimal) |
+| **Speed on small graphs** | Fast (< 1s for 3-7 nodes) | Slower (~10s for 3-7 nodes) |
+| **Scalability** | Exponential in graph size | O(|V|) parameters, GPU-accelerable |
+| **Warp specialization** | Full joint SWP+WS | Schedule only (no WS) |
+| **Best for** | Kernels with < 50 ops | Large graphs with 100+ ops |
+## Architecture
 ```
+┌─────────────────────────────────────────────────────────┐
+│                    DependenceGraph                       │
+│  G = (V, E), Machine Description, RRTs                  │
+├────────────────────┬────────────────────────────────────┤
+│    Twill Solver    │         GauS Solver                │
+│                    │                                    │
+│  Phase 1: ILP      │  Gaussian Reparameterization       │
+│  (CBC/PuLP)        │  X_i ~ N(μ_i, σ_i²)               │
+│  → Modulo Schedule │  → P_i^d = Φ((d+0.5-μ)/σ)        │
+│                    │        - Φ((d-0.5-μ)/σ)            │
+│  Phase 2: SMT      │                                    │
+│  (Z3/QFLIA)       │  Augmented Lagrangian Method        │
+│  → Joint SWP + WS │  → Adam optimizer on (μ, σ)        │
+│                    │  → Dependency + Resource +          │
+│  Cost Norm (§5.2)  │    Modulo + Recurrence losses      │
+│  → Ratio-preserving│                                    │
+│    cycle count     │  Legalization Heuristics            │
+│    reduction       │  → Topological pass (regular)      │
+│                    │  → Fixed-point iteration (modulo)   │
+├────────────────────┴────────────────────────────────────┤
+│               Code Generation                           │
+│  Pseudocode · CUDA Skeleton · Pipelined Schedule        │
+└─────────────────────────────────────────────────────────┘
 ```
 ## Quick Start
 ```bash
+pip install pulp z3-solver numpy matplotlib torch
 ```
+### Twill (exact solver)
 ```python
 from twill.kernels import flash_attention_forward_simplified
 from twill.twill_solver import twill_solve
 graph = flash_attention_forward_simplified()
 result = twill_solve(graph, max_I=5, verbose=True)
+# → I=2, schedule: S@0, P@2, O@3 (proves FA3 optimal)
 ```
+### GauS (differentiable solver)
 ```python
+from twill.kernels import flash_attention_forward_hopper
+from twill.gaus_solver import gaus_solve_twill_graph
+graph = flash_attention_forward_hopper()
+result = gaus_solve_twill_graph(graph, target_II=4, D=20, verbose=True)
+# → II=4, feasible schedule found via gradient descent
 ```
+### GauS on custom large graphs
 ```python
+from twill.gaus_solver import GauSSolver, generate_random_dag
+graph = generate_random_dag(num_nodes=1000, edge_density=0.01, num_back_edges=50)
+solver = GauSSolver(graph, D=500, lr=0.01)
+result = solver.solve_modulo(II=10, R_cap=50.0, max_iters=2000)
+```
+## Pre-built Kernel Descriptions
+| Kernel | Architecture | Twill Result | GauS Result |
+|--------|-------------|-------------|-------------|
+| `flash_attention_forward_simplified()` | Hopper | I=2 ✓ (optimal) | II=2 ✓ (feasible) |
+| `flash_attention_forward_hopper()` | Hopper (H100) | I=4, FA3 pipeline ✓ | II=4 ✓ (feasible) |
+| `flash_attention_forward_blackwell()` | Blackwell (B200) | I=2, FA4 strategy ✓ | II=2 ✓ (feasible) |
+| `simple_gemm_pipeline()` | Hopper | I=2, TMA on producer ✓ | II=2 ✓ (feasible) |
 ## Module Structure
 ```
 twill/
+├── __init__.py              # Package exports (v0.2.0)
 ├── graph.py                 # DependenceGraph, Instruction, RRT, MachineDescription
+├── cost_normalization.py    # §5.2: ILP-based cycle count normalization
+├── modulo_scheduler.py      # Twill Phase 1: ILP modulo scheduling (CBC)
+├── smt_joint.py             # Twill Phase 2: SMT joint SWP+WS (Z3)
+├── twill_solver.py          # Twill Algorithm 1: Main search procedure
+├── gaus_solver.py           # GauS: Differentiable scheduling (PyTorch)
 ├── codegen.py               # Code generation (pseudocode, CUDA skeleton)
 ├── visualization.py         # Schedule visualization (text + matplotlib)
 └── kernels.py               # Pre-built kernel descriptions (FMHA, GEMM)
 ```
+## GauS: Technical Details
+### Gaussian Reparameterization (§3.1)
+Each operator `v_i` is modeled as `X_i ~ N(μ_i, σ_i²)` with only **2|V| parameters** (vs. D·|V| for categorical approaches). The probability of scheduling at step `d`:
+```
+P_i^d = Φ((d+0.5-μ_i)/σ_i) - Φ((d-0.5-μ_i)/σ_i)
+```
+### Differentiable Loss Functions
+- **Dependency**: Expected topological violations via CDF products
+- **Resource**: LogSumExp smooth-max of per-step usage + ReLU violations
+- **Modulo Resource**: Wrapped Gaussian probabilities into II-slot reservation table
+- **Recurrence**: Expected back-edge constraint violations
+- **Memory**: CDF-based active lifetime estimation
+- **Communication**: Expected edge lengths (compactness)
+### Augmented Lagrangian Method (ALM)
+```
+L_total = L_primary + Σ_i (λ_i · V_i + ρ/2 · ||V_i||²)
+λ_i ← λ_i + ρ · V_i    (dual update)
+```
+Hyperparameters (from paper): `lr=0.01, ρ=1e-4, τ=1e-2, κ=1/6`
 ## Tests
 ```bash
+python test_twill.py   # Twill: 6/6 pass (~5s)
+python test_gaus.py    # GauS:  7/7 pass (~30s)
 ```
+## Solvers Used
+| Component | Solver | Theory | Paper |
+|-----------|--------|--------|-------|
+| Twill Cost Normalization | CBC (PuLP) | ILP | Soi et al. §5.2 |
+| Twill Modulo Scheduling | CBC (PuLP) | ILP | Soi et al. §3.1 |
+| Twill Joint SWP+WS | Z3 | QFLIA (SMT) | Soi et al. §4 |
+| GauS Scheduling | PyTorch (Adam) | Differentiable | Cai et al. §3 |
+## Citations
 ```bibtex
 @article{soi2024twill,
   journal={arXiv preprint arXiv:2512.18134},
   year={2024}
 }
+@article{cai2026gaus,
+  title={GauS: Differentiable Scheduling Optimization via Gaussian Reparameterization},
+  author={Cai, Yaohui and others},
+  journal={arXiv preprint arXiv:2602.20427},
+  year={2026}
+}
 ```
 ## Related Work
 - [CUTLASS 3.x](https://github.com/NVIDIA/cutlass) — NVIDIA GEMM templates with WS
 - [Tawa](https://arxiv.org/abs/2510.14719) — Automatic WS compiler (downstream of Twill)
 - [Cypress](https://arxiv.org/abs/2504.07004) — Task-based GPU programming model
+- [Nautilus](https://arxiv.org/abs/2604.14825) — Auto-scheduling tensor compiler (fills Twill's tile-size gap)
+- [MPK](https://arxiv.org/abs/2512.22219) — Cross-kernel software pipelining
+- [GS-Schedule](https://github.com/Yu-Maryland/Differentiable_Scheduler_ICML24) — GauS predecessor (categorical approach)