Upload README.md

9e6c8b9 verified 14 days ago

8.25 kB

	# Twill + GauS: Optimal GPU Kernel Scheduling

	Implementation of two complementary papers for GPU kernel scheduling optimization:

	1. [Twill](https://arxiv.org/abs/2512.18134) — "Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs" (Soi et al., NVIDIA Research, 2024). Exact ILP+SMT solver that finds provably optimal schedules.

	2. [GauS](https://arxiv.org/abs/2602.20427) — "Differentiable Scheduling Optimization via Gaussian Reparameterization" (Cai et al., 2026). Scalable gradient-based solver using Gaussian distributions and Augmented Lagrangian Method.

	## When to Use Which

	\| Property \| Twill (ILP+SMT) \| GauS (Differentiable) \|
	\|----------\|-----------------\|----------------------\|
	\| Optimality \| Provably optimal \| Approximate (Pareto-optimal) \|
	\| Speed on small graphs \| Fast (< 1s for 3-7 nodes) \| Slower (~10s for 3-7 nodes) \|
	\| Scalability \| Exponential in graph size \| O(\|V\|) parameters, GPU-accelerable \|
	\| Warp specialization \| Full joint SWP+WS \| Schedule only (no WS) \|
	\| Best for \| Kernels with < 50 ops \| Large graphs with 100+ ops \|

	## Architecture

	```
	┌─────────────────────────────────────────────────────────┐
	│ DependenceGraph │
	│ G = (V, E), Machine Description, RRTs │
	├────────────────────┬────────────────────────────────────┤
	│ Twill Solver │ GauS Solver │
	│ │ │
	│ Phase 1: ILP │ Gaussian Reparameterization │
	│ (CBC/PuLP) │ X_i ~ N(μ_i, σ_i²) │
	│ → Modulo Schedule │ → P_i^d = Φ((d+0.5-μ)/σ) │
	│ │ - Φ((d-0.5-μ)/σ) │
	│ Phase 2: SMT │ │
	│ (Z3/QFLIA) │ Augmented Lagrangian Method │
	│ → Joint SWP + WS │ → Adam optimizer on (μ, σ) │
	│ │ → Dependency + Resource + │
	│ Cost Norm (§5.2) │ Modulo + Recurrence losses │
	│ → Ratio-preserving│ │
	│ cycle count │ Legalization Heuristics │
	│ reduction │ → Topological pass (regular) │
	│ │ → Fixed-point iteration (modulo) │
	├────────────────────┴────────────────────────────────────┤
	│ Code Generation │
	│ Pseudocode · CUDA Skeleton · Pipelined Schedule │
	└─────────────────────────────────────────────────────────┘
	```

	## Quick Start

	```bash
	pip install pulp z3-solver numpy matplotlib torch
	```

	### Twill (exact solver)
	```python
	from twill.kernels import flash_attention_forward_simplified
	from twill.twill_solver import twill_solve

	graph = flash_attention_forward_simplified()
	result = twill_solve(graph, max_I=5, verbose=True)
	# → I=2, schedule: S@0, P@2, O@3 (proves FA3 optimal)
	```

	### GauS (differentiable solver)
	```python
	from twill.kernels import flash_attention_forward_hopper
	from twill.gaus_solver import gaus_solve_twill_graph

	graph = flash_attention_forward_hopper()
	result = gaus_solve_twill_graph(graph, target_II=4, D=20, verbose=True)
	# → II=4, feasible schedule found via gradient descent
	```

	### GauS on custom large graphs
	```python
	from twill.gaus_solver import GauSSolver, generate_random_dag

	graph = generate_random_dag(num_nodes=1000, edge_density=0.01, num_back_edges=50)
	solver = GauSSolver(graph, D=500, lr=0.01)
	result = solver.solve_modulo(II=10, R_cap=50.0, max_iters=2000)
	```

	## Pre-built Kernel Descriptions

	\| Kernel \| Architecture \| Twill Result \| GauS Result \|
	\|--------\|-------------\|-------------\|-------------\|
	\| `flash_attention_forward_simplified()` \| Hopper \| I=2 ✓ (optimal) \| II=2 ✓ (feasible) \|
	\| `flash_attention_forward_hopper()` \| Hopper (H100) \| I=4, FA3 pipeline ✓ \| II=4 ✓ (feasible) \|
	\| `flash_attention_forward_blackwell()` \| Blackwell (B200) \| I=2, FA4 strategy ✓ \| II=2 ✓ (feasible) \|
	\| `simple_gemm_pipeline()` \| Hopper \| I=2, TMA on producer ✓ \| II=2 ✓ (feasible) \|

	## Module Structure

	```
	twill/
	├── __init__.py # Package exports (v0.2.0)
	├── graph.py # DependenceGraph, Instruction, RRT, MachineDescription
	├── cost_normalization.py # §5.2: ILP-based cycle count normalization
	├── modulo_scheduler.py # Twill Phase 1: ILP modulo scheduling (CBC)
	├── smt_joint.py # Twill Phase 2: SMT joint SWP+WS (Z3)
	├── twill_solver.py # Twill Algorithm 1: Main search procedure
	├── gaus_solver.py # GauS: Differentiable scheduling (PyTorch)
	├── codegen.py # Code generation (pseudocode, CUDA skeleton)
	├── visualization.py # Schedule visualization (text + matplotlib)
	└── kernels.py # Pre-built kernel descriptions (FMHA, GEMM)
	```

	## GauS: Technical Details

	### Gaussian Reparameterization (§3.1)
	Each operator `v_i` is modeled as `X_i ~ N(μ_i, σ_i²)` with only 2\|V\| parameters (vs. D·\|V\| for categorical approaches). The probability of scheduling at step `d`:

	```
	P_i^d = Φ((d+0.5-μ_i)/σ_i) - Φ((d-0.5-μ_i)/σ_i)
	```

	### Differentiable Loss Functions
	- Dependency: Expected topological violations via CDF products
	- Resource: LogSumExp smooth-max of per-step usage + ReLU violations
	- Modulo Resource: Wrapped Gaussian probabilities into II-slot reservation table
	- Recurrence: Expected back-edge constraint violations
	- Memory: CDF-based active lifetime estimation
	- Communication: Expected edge lengths (compactness)

	### Augmented Lagrangian Method (ALM)
	```
	L_total = L_primary + Σ_i (λ_i · V_i + ρ/2 · \|\|V_i\|\|²)
	λ_i ← λ_i + ρ · V_i (dual update)
	```
	Hyperparameters (from paper): `lr=0.01, ρ=1e-4, τ=1e-2, κ=1/6`

	## Tests

	```bash
	python test_twill.py # Twill: 6/6 pass (~5s)
	python test_gaus.py # GauS: 7/7 pass (~30s)
	```

	## Solvers Used

	\| Component \| Solver \| Theory \| Paper \|
	\|-----------\|--------\|--------\|-------\|
	\| Twill Cost Normalization \| CBC (PuLP) \| ILP \| Soi et al. §5.2 \|
	\| Twill Modulo Scheduling \| CBC (PuLP) \| ILP \| Soi et al. §3.1 \|
	\| Twill Joint SWP+WS \| Z3 \| QFLIA (SMT) \| Soi et al. §4 \|
	\| GauS Scheduling \| PyTorch (Adam) \| Differentiable \| Cai et al. §3 \|

	## Citations

	```bibtex
	@article{soi2024twill,
	title={Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs},
	author={Soi, Rupanshu and others},
	journal={arXiv preprint arXiv:2512.18134},
	year={2024}
	}

	@article{cai2026gaus,
	title={GauS: Differentiable Scheduling Optimization via Gaussian Reparameterization},
	author={Cai, Yaohui and others},
	journal={arXiv preprint arXiv:2602.20427},
	year={2026}
	}
	```

	## Related Work

	- [FlashAttention-3](https://arxiv.org/abs/2407.08608) — Hopper FMHA schedule that Twill rediscovers
	- [FlashAttention-4](https://arxiv.org/abs/2603.05451) — Blackwell FMHA schedule that Twill rediscovers
	- [ThunderKittens](https://github.com/HazyResearch/ThunderKittens) — Warp-level kernel framework
	- [CUTLASS 3.x](https://github.com/NVIDIA/cutlass) — NVIDIA GEMM templates with WS
	- [Tawa](https://arxiv.org/abs/2510.14719) — Automatic WS compiler (downstream of Twill)
	- [Cypress](https://arxiv.org/abs/2504.07004) — Task-based GPU programming model
	- [Nautilus](https://arxiv.org/abs/2604.14825) — Auto-scheduling tensor compiler (fills Twill's tile-size gap)
	- [MPK](https://arxiv.org/abs/2512.22219) — Cross-kernel software pipelining
	- [GS-Schedule](https://github.com/Yu-Maryland/Differentiable_Scheduler_ICML24) — GauS predecessor (categorical approach)