File size: 8,252 Bytes

9e6c8b9
334c3f4
9e6c8b9
334c3f4
9e6c8b9
334c3f4
9e6c8b9
334c3f4
9e6c8b9
334c3f4
9e6c8b9
 
 
 
 
 
 
334c3f4
9e6c8b9
334c3f4
 
9e6c8b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
334c3f4
 
 
 
 
9e6c8b9
334c3f4
 
9e6c8b9
334c3f4
 
 
 
 
 
9e6c8b9
334c3f4
 
9e6c8b9
334c3f4
9e6c8b9
 
334c3f4
9e6c8b9
 
 
334c3f4
 
9e6c8b9
334c3f4
9e6c8b9
334c3f4
9e6c8b9
 
 
 
334c3f4
9e6c8b9
334c3f4
9e6c8b9
 
 
 
 
 
334c3f4
 
 
 
 
9e6c8b9
334c3f4
9e6c8b9
 
 
 
 
334c3f4
 
 
 
 
9e6c8b9
334c3f4
9e6c8b9
 
334c3f4
9e6c8b9
 
 
334c3f4
9e6c8b9
 
 
 
 
 
 
334c3f4
9e6c8b9
 
 
 
 
 
334c3f4
 
 
 
9e6c8b9
 
334c3f4
 
9e6c8b9
334c3f4
9e6c8b9
 
 
 
 
 
334c3f4
9e6c8b9
334c3f4
 
 
 
 
 
 
 
9e6c8b9
 
 
 
 
 
 
334c3f4
 
 
 
 
 
 
 
 
 
9e6c8b9

# Twill + GauS: Optimal GPU Kernel Scheduling

Implementation of two complementary papers for GPU kernel scheduling optimization:

1. **[Twill](https://arxiv.org/abs/2512.18134)** — "Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs" (Soi et al., NVIDIA Research, 2024). Exact ILP+SMT solver that finds provably optimal schedules.

2. **[GauS](https://arxiv.org/abs/2602.20427)** — "Differentiable Scheduling Optimization via Gaussian Reparameterization" (Cai et al., 2026). Scalable gradient-based solver using Gaussian distributions and Augmented Lagrangian Method.

## When to Use Which

| Property | Twill (ILP+SMT) | GauS (Differentiable) |
|----------|-----------------|----------------------|
| **Optimality** | Provably optimal | Approximate (Pareto-optimal) |
| **Speed on small graphs** | Fast (< 1s for 3-7 nodes) | Slower (~10s for 3-7 nodes) |
| **Scalability** | Exponential in graph size | O(|V|) parameters, GPU-accelerable |
| **Warp specialization** | Full joint SWP+WS | Schedule only (no WS) |
| **Best for** | Kernels with < 50 ops | Large graphs with 100+ ops |

## Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    DependenceGraph                       │
│  G = (V, E), Machine Description, RRTs                  │
├────────────────────┬────────────────────────────────────┤
│    Twill Solver    │         GauS Solver                │
│                    │                                    │
│  Phase 1: ILP      │  Gaussian Reparameterization       │
│  (CBC/PuLP)        │  X_i ~ N(μ_i, σ_i²)               │
│  → Modulo Schedule │  → P_i^d = Φ((d+0.5-μ)/σ)        │
│                    │        - Φ((d-0.5-μ)/σ)            │
│  Phase 2: SMT      │                                    │
│  (Z3/QFLIA)       │  Augmented Lagrangian Method        │
│  → Joint SWP + WS │  → Adam optimizer on (μ, σ)        │
│                    │  → Dependency + Resource +          │
│  Cost Norm (§5.2)  │    Modulo + Recurrence losses      │
│  → Ratio-preserving│                                    │
│    cycle count     │  Legalization Heuristics            │
│    reduction       │  → Topological pass (regular)      │
│                    │  → Fixed-point iteration (modulo)   │
├────────────────────┴────────────────────────────────────┤
│               Code Generation                           │
│  Pseudocode · CUDA Skeleton · Pipelined Schedule        │
└─────────────────────────────────────────────────────────┘
```

## Quick Start

```bash
pip install pulp z3-solver numpy matplotlib torch
```

### Twill (exact solver)
```python
from twill.kernels import flash_attention_forward_simplified
from twill.twill_solver import twill_solve

graph = flash_attention_forward_simplified()
result = twill_solve(graph, max_I=5, verbose=True)
# → I=2, schedule: S@0, P@2, O@3 (proves FA3 optimal)
```

### GauS (differentiable solver)
```python
from twill.kernels import flash_attention_forward_hopper
from twill.gaus_solver import gaus_solve_twill_graph

graph = flash_attention_forward_hopper()
result = gaus_solve_twill_graph(graph, target_II=4, D=20, verbose=True)
# → II=4, feasible schedule found via gradient descent
```

### GauS on custom large graphs
```python
from twill.gaus_solver import GauSSolver, generate_random_dag

graph = generate_random_dag(num_nodes=1000, edge_density=0.01, num_back_edges=50)
solver = GauSSolver(graph, D=500, lr=0.01)
result = solver.solve_modulo(II=10, R_cap=50.0, max_iters=2000)
```

## Pre-built Kernel Descriptions

| Kernel | Architecture | Twill Result | GauS Result |
|--------|-------------|-------------|-------------|
| `flash_attention_forward_simplified()` | Hopper | I=2 ✓ (optimal) | II=2 ✓ (feasible) |
| `flash_attention_forward_hopper()` | Hopper (H100) | I=4, FA3 pipeline ✓ | II=4 ✓ (feasible) |
| `flash_attention_forward_blackwell()` | Blackwell (B200) | I=2, FA4 strategy ✓ | II=2 ✓ (feasible) |
| `simple_gemm_pipeline()` | Hopper | I=2, TMA on producer ✓ | II=2 ✓ (feasible) |

## Module Structure

```
twill/
├── __init__.py              # Package exports (v0.2.0)
├── graph.py                 # DependenceGraph, Instruction, RRT, MachineDescription
├── cost_normalization.py    # §5.2: ILP-based cycle count normalization
├── modulo_scheduler.py      # Twill Phase 1: ILP modulo scheduling (CBC)
├── smt_joint.py             # Twill Phase 2: SMT joint SWP+WS (Z3)
├── twill_solver.py          # Twill Algorithm 1: Main search procedure
├── gaus_solver.py           # GauS: Differentiable scheduling (PyTorch)
├── codegen.py               # Code generation (pseudocode, CUDA skeleton)
├── visualization.py         # Schedule visualization (text + matplotlib)
└── kernels.py               # Pre-built kernel descriptions (FMHA, GEMM)
```

## GauS: Technical Details

### Gaussian Reparameterization (§3.1)
Each operator `v_i` is modeled as `X_i ~ N(μ_i, σ_i²)` with only **2|V| parameters** (vs. D·|V| for categorical approaches). The probability of scheduling at step `d`:

```
P_i^d = Φ((d+0.5-μ_i)/σ_i) - Φ((d-0.5-μ_i)/σ_i)
```

### Differentiable Loss Functions
- **Dependency**: Expected topological violations via CDF products
- **Resource**: LogSumExp smooth-max of per-step usage + ReLU violations
- **Modulo Resource**: Wrapped Gaussian probabilities into II-slot reservation table
- **Recurrence**: Expected back-edge constraint violations
- **Memory**: CDF-based active lifetime estimation
- **Communication**: Expected edge lengths (compactness)

### Augmented Lagrangian Method (ALM)
```
L_total = L_primary + Σ_i (λ_i · V_i + ρ/2 · ||V_i||²)
λ_i ← λ_i + ρ · V_i    (dual update)
```
Hyperparameters (from paper): `lr=0.01, ρ=1e-4, τ=1e-2, κ=1/6`

## Tests

```bash
python test_twill.py   # Twill: 6/6 pass (~5s)
python test_gaus.py    # GauS:  7/7 pass (~30s)
```

## Solvers Used

| Component | Solver | Theory | Paper |
|-----------|--------|--------|-------|
| Twill Cost Normalization | CBC (PuLP) | ILP | Soi et al. §5.2 |
| Twill Modulo Scheduling | CBC (PuLP) | ILP | Soi et al. §3.1 |
| Twill Joint SWP+WS | Z3 | QFLIA (SMT) | Soi et al. §4 |
| GauS Scheduling | PyTorch (Adam) | Differentiable | Cai et al. §3 |

## Citations

```bibtex
@article{soi2024twill,
  title={Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs},
  author={Soi, Rupanshu and others},
  journal={arXiv preprint arXiv:2512.18134},
  year={2024}
}

@article{cai2026gaus,
  title={GauS: Differentiable Scheduling Optimization via Gaussian Reparameterization},
  author={Cai, Yaohui and others},
  journal={arXiv preprint arXiv:2602.20427},
  year={2026}
}
```

## Related Work

- [FlashAttention-3](https://arxiv.org/abs/2407.08608) — Hopper FMHA schedule that Twill rediscovers
- [FlashAttention-4](https://arxiv.org/abs/2603.05451) — Blackwell FMHA schedule that Twill rediscovers
- [ThunderKittens](https://github.com/HazyResearch/ThunderKittens) — Warp-level kernel framework
- [CUTLASS 3.x](https://github.com/NVIDIA/cutlass) — NVIDIA GEMM templates with WS
- [Tawa](https://arxiv.org/abs/2510.14719) — Automatic WS compiler (downstream of Twill)
- [Cypress](https://arxiv.org/abs/2504.07004) — Task-based GPU programming model
- [Nautilus](https://arxiv.org/abs/2604.14825) — Auto-scheduling tensor compiler (fills Twill's tile-size gap)
- [MPK](https://arxiv.org/abs/2512.22219) — Cross-kernel software pipelining
- [GS-Schedule](https://github.com/Yu-Maryland/Differentiable_Scheduler_ICML24) — GauS predecessor (categorical approach)