# Twill + GauS: Optimal GPU Kernel Scheduling Implementation of two complementary papers for GPU kernel scheduling optimization: 1. **[Twill](https://arxiv.org/abs/2512.18134)** — "Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs" (Soi et al., NVIDIA Research, 2024). Exact ILP+SMT solver that finds provably optimal schedules. 2. **[GauS](https://arxiv.org/abs/2602.20427)** — "Differentiable Scheduling Optimization via Gaussian Reparameterization" (Cai et al., 2026). Scalable gradient-based solver using Gaussian distributions and Augmented Lagrangian Method. ## When to Use Which | Property | Twill (ILP+SMT) | GauS (Differentiable) | |----------|-----------------|----------------------| | **Optimality** | Provably optimal | Approximate (Pareto-optimal) | | **Speed on small graphs** | Fast (< 1s for 3-7 nodes) | Slower (~10s for 3-7 nodes) | | **Scalability** | Exponential in graph size | O(|V|) parameters, GPU-accelerable | | **Warp specialization** | Full joint SWP+WS | Schedule only (no WS) | | **Best for** | Kernels with < 50 ops | Large graphs with 100+ ops | ## Architecture ``` ┌─────────────────────────────────────────────────────────┐ │ DependenceGraph │ │ G = (V, E), Machine Description, RRTs │ ├────────────────────┬────────────────────────────────────┤ │ Twill Solver │ GauS Solver │ │ │ │ │ Phase 1: ILP │ Gaussian Reparameterization │ │ (CBC/PuLP) │ X_i ~ N(μ_i, σ_i²) │ │ → Modulo Schedule │ → P_i^d = Φ((d+0.5-μ)/σ) │ │ │ - Φ((d-0.5-μ)/σ) │ │ Phase 2: SMT │ │ │ (Z3/QFLIA) │ Augmented Lagrangian Method │ │ → Joint SWP + WS │ → Adam optimizer on (μ, σ) │ │ │ → Dependency + Resource + │ │ Cost Norm (§5.2) │ Modulo + Recurrence losses │ │ → Ratio-preserving│ │ │ cycle count │ Legalization Heuristics │ │ reduction │ → Topological pass (regular) │ │ │ → Fixed-point iteration (modulo) │ ├────────────────────┴────────────────────────────────────┤ │ Code Generation │ │ Pseudocode · CUDA Skeleton · Pipelined Schedule │ └─────────────────────────────────────────────────────────┘ ``` ## Quick Start ```bash pip install pulp z3-solver numpy matplotlib torch ``` ### Twill (exact solver) ```python from twill.kernels import flash_attention_forward_simplified from twill.twill_solver import twill_solve graph = flash_attention_forward_simplified() result = twill_solve(graph, max_I=5, verbose=True) # → I=2, schedule: S@0, P@2, O@3 (proves FA3 optimal) ``` ### GauS (differentiable solver) ```python from twill.kernels import flash_attention_forward_hopper from twill.gaus_solver import gaus_solve_twill_graph graph = flash_attention_forward_hopper() result = gaus_solve_twill_graph(graph, target_II=4, D=20, verbose=True) # → II=4, feasible schedule found via gradient descent ``` ### GauS on custom large graphs ```python from twill.gaus_solver import GauSSolver, generate_random_dag graph = generate_random_dag(num_nodes=1000, edge_density=0.01, num_back_edges=50) solver = GauSSolver(graph, D=500, lr=0.01) result = solver.solve_modulo(II=10, R_cap=50.0, max_iters=2000) ``` ## Pre-built Kernel Descriptions | Kernel | Architecture | Twill Result | GauS Result | |--------|-------------|-------------|-------------| | `flash_attention_forward_simplified()` | Hopper | I=2 ✓ (optimal) | II=2 ✓ (feasible) | | `flash_attention_forward_hopper()` | Hopper (H100) | I=4, FA3 pipeline ✓ | II=4 ✓ (feasible) | | `flash_attention_forward_blackwell()` | Blackwell (B200) | I=2, FA4 strategy ✓ | II=2 ✓ (feasible) | | `simple_gemm_pipeline()` | Hopper | I=2, TMA on producer ✓ | II=2 ✓ (feasible) | ## Module Structure ``` twill/ ├── __init__.py # Package exports (v0.2.0) ├── graph.py # DependenceGraph, Instruction, RRT, MachineDescription ├── cost_normalization.py # §5.2: ILP-based cycle count normalization ├── modulo_scheduler.py # Twill Phase 1: ILP modulo scheduling (CBC) ├── smt_joint.py # Twill Phase 2: SMT joint SWP+WS (Z3) ├── twill_solver.py # Twill Algorithm 1: Main search procedure ├── gaus_solver.py # GauS: Differentiable scheduling (PyTorch) ├── codegen.py # Code generation (pseudocode, CUDA skeleton) ├── visualization.py # Schedule visualization (text + matplotlib) └── kernels.py # Pre-built kernel descriptions (FMHA, GEMM) ``` ## GauS: Technical Details ### Gaussian Reparameterization (§3.1) Each operator `v_i` is modeled as `X_i ~ N(μ_i, σ_i²)` with only **2|V| parameters** (vs. D·|V| for categorical approaches). The probability of scheduling at step `d`: ``` P_i^d = Φ((d+0.5-μ_i)/σ_i) - Φ((d-0.5-μ_i)/σ_i) ``` ### Differentiable Loss Functions - **Dependency**: Expected topological violations via CDF products - **Resource**: LogSumExp smooth-max of per-step usage + ReLU violations - **Modulo Resource**: Wrapped Gaussian probabilities into II-slot reservation table - **Recurrence**: Expected back-edge constraint violations - **Memory**: CDF-based active lifetime estimation - **Communication**: Expected edge lengths (compactness) ### Augmented Lagrangian Method (ALM) ``` L_total = L_primary + Σ_i (λ_i · V_i + ρ/2 · ||V_i||²) λ_i ← λ_i + ρ · V_i (dual update) ``` Hyperparameters (from paper): `lr=0.01, ρ=1e-4, τ=1e-2, κ=1/6` ## Tests ```bash python test_twill.py # Twill: 6/6 pass (~5s) python test_gaus.py # GauS: 7/7 pass (~30s) ``` ## Solvers Used | Component | Solver | Theory | Paper | |-----------|--------|--------|-------| | Twill Cost Normalization | CBC (PuLP) | ILP | Soi et al. §5.2 | | Twill Modulo Scheduling | CBC (PuLP) | ILP | Soi et al. §3.1 | | Twill Joint SWP+WS | Z3 | QFLIA (SMT) | Soi et al. §4 | | GauS Scheduling | PyTorch (Adam) | Differentiable | Cai et al. §3 | ## Citations ```bibtex @article{soi2024twill, title={Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs}, author={Soi, Rupanshu and others}, journal={arXiv preprint arXiv:2512.18134}, year={2024} } @article{cai2026gaus, title={GauS: Differentiable Scheduling Optimization via Gaussian Reparameterization}, author={Cai, Yaohui and others}, journal={arXiv preprint arXiv:2602.20427}, year={2026} } ``` ## Related Work - [FlashAttention-3](https://arxiv.org/abs/2407.08608) — Hopper FMHA schedule that Twill rediscovers - [FlashAttention-4](https://arxiv.org/abs/2603.05451) — Blackwell FMHA schedule that Twill rediscovers - [ThunderKittens](https://github.com/HazyResearch/ThunderKittens) — Warp-level kernel framework - [CUTLASS 3.x](https://github.com/NVIDIA/cutlass) — NVIDIA GEMM templates with WS - [Tawa](https://arxiv.org/abs/2510.14719) — Automatic WS compiler (downstream of Twill) - [Cypress](https://arxiv.org/abs/2504.07004) — Task-based GPU programming model - [Nautilus](https://arxiv.org/abs/2604.14825) — Auto-scheduling tensor compiler (fills Twill's tile-size gap) - [MPK](https://arxiv.org/abs/2512.22219) — Cross-kernel software pipelining - [GS-Schedule](https://github.com/Yu-Maryland/Differentiable_Scheduler_ICML24) — GauS predecessor (categorical approach)