# Twill + GauS: Optimal GPU Kernel Scheduling

Implementation of two complementary papers for GPU kernel scheduling optimization:

1. **[Twill](https://arxiv.org/abs/2512.18134)** — "Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs" (Soi et al., NVIDIA Research, 2024). Exact ILP+SMT solver that finds provably optimal schedules.

2. **[GauS](https://arxiv.org/abs/2602.20427)** — "Differentiable Scheduling Optimization via Gaussian Reparameterization" (Cai et al., 2026). Scalable gradient-based solver using Gaussian distributions and Augmented Lagrangian Method.

## When to Use Which

| Property | Twill (ILP+SMT) | GauS (Differentiable) |
|----------|-----------------|----------------------|
| **Optimality** | Provably optimal | Approximate (Pareto-optimal) |
| **Speed on small graphs** | Fast (< 1s for 3-7 nodes) | Slower (~10s for 3-7 nodes) |
| **Scalability** | Exponential in graph size | O(|V|) parameters, GPU-accelerable |
| **Warp specialization** | Full joint SWP+WS | Schedule only (no WS) |
| **Best for** | Kernels with < 50 ops | Large graphs with 100+ ops |

## Architecture

```
┌─────────────────────────────────────────────────────────┐
│                    DependenceGraph                       │
│  G = (V, E), Machine Description, RRTs                  │
├────────────────────┬────────────────────────────────────┤
│    Twill Solver    │         GauS Solver                │
│                    │                                    │
│  Phase 1: ILP      │  Gaussian Reparameterization       │
│  (CBC/PuLP)        │  X_i ~ N(μ_i, σ_i²)               │
│  → Modulo Schedule │  → P_i^d = Φ((d+0.5-μ)/σ)        │
│                    │        - Φ((d-0.5-μ)/σ)            │
│  Phase 2: SMT      │                                    │
│  (Z3/QFLIA)       │  Augmented Lagrangian Method        │
│  → Joint SWP + WS │  → Adam optimizer on (μ, σ)        │
│                    │  → Dependency + Resource +          │
│  Cost Norm (§5.2)  │    Modulo + Recurrence losses      │
│  → Ratio-preserving│                                    │
│    cycle count     │  Legalization Heuristics            │
│    reduction       │  → Topological pass (regular)      │
│                    │  → Fixed-point iteration (modulo)   │
├────────────────────┴────────────────────────────────────┤
│               Code Generation                           │
│  Pseudocode · CUDA Skeleton · Pipelined Schedule        │
└─────────────────────────────────────────────────────────┘
```

## Quick Start

```bash
pip install pulp z3-solver numpy matplotlib torch
```

### Twill (exact solver)
```python
from twill.kernels import flash_attention_forward_simplified
from twill.twill_solver import twill_solve

graph = flash_attention_forward_simplified()
result = twill_solve(graph, max_I=5, verbose=True)
# → I=2, schedule: S@0, P@2, O@3 (proves FA3 optimal)
```

### GauS (differentiable solver)
```python
from twill.kernels import flash_attention_forward_hopper
from twill.gaus_solver import gaus_solve_twill_graph

graph = flash_attention_forward_hopper()
result = gaus_solve_twill_graph(graph, target_II=4, D=20, verbose=True)
# → II=4, feasible schedule found via gradient descent
```

### GauS on custom large graphs
```python
from twill.gaus_solver import GauSSolver, generate_random_dag

graph = generate_random_dag(num_nodes=1000, edge_density=0.01, num_back_edges=50)
solver = GauSSolver(graph, D=500, lr=0.01)
result = solver.solve_modulo(II=10, R_cap=50.0, max_iters=2000)
```

## Pre-built Kernel Descriptions

| Kernel | Architecture | Twill Result | GauS Result |
|--------|-------------|-------------|-------------|
| `flash_attention_forward_simplified()` | Hopper | I=2 ✓ (optimal) | II=2 ✓ (feasible) |
| `flash_attention_forward_hopper()` | Hopper (H100) | I=4, FA3 pipeline ✓ | II=4 ✓ (feasible) |
| `flash_attention_forward_blackwell()` | Blackwell (B200) | I=2, FA4 strategy ✓ | II=2 ✓ (feasible) |
| `simple_gemm_pipeline()` | Hopper | I=2, TMA on producer ✓ | II=2 ✓ (feasible) |

## Module Structure

```
twill/
├── __init__.py              # Package exports (v0.2.0)
├── graph.py                 # DependenceGraph, Instruction, RRT, MachineDescription
├── cost_normalization.py    # §5.2: ILP-based cycle count normalization
├── modulo_scheduler.py      # Twill Phase 1: ILP modulo scheduling (CBC)
├── smt_joint.py             # Twill Phase 2: SMT joint SWP+WS (Z3)
├── twill_solver.py          # Twill Algorithm 1: Main search procedure
├── gaus_solver.py           # GauS: Differentiable scheduling (PyTorch)
├── codegen.py               # Code generation (pseudocode, CUDA skeleton)
├── visualization.py         # Schedule visualization (text + matplotlib)
└── kernels.py               # Pre-built kernel descriptions (FMHA, GEMM)
```

## GauS: Technical Details

### Gaussian Reparameterization (§3.1)
Each operator `v_i` is modeled as `X_i ~ N(μ_i, σ_i²)` with only **2|V| parameters** (vs. D·|V| for categorical approaches). The probability of scheduling at step `d`:

```
P_i^d = Φ((d+0.5-μ_i)/σ_i) - Φ((d-0.5-μ_i)/σ_i)
```

### Differentiable Loss Functions
- **Dependency**: Expected topological violations via CDF products
- **Resource**: LogSumExp smooth-max of per-step usage + ReLU violations
- **Modulo Resource**: Wrapped Gaussian probabilities into II-slot reservation table
- **Recurrence**: Expected back-edge constraint violations
- **Memory**: CDF-based active lifetime estimation
- **Communication**: Expected edge lengths (compactness)

### Augmented Lagrangian Method (ALM)
```
L_total = L_primary + Σ_i (λ_i · V_i + ρ/2 · ||V_i||²)
λ_i ← λ_i + ρ · V_i    (dual update)
```
Hyperparameters (from paper): `lr=0.01, ρ=1e-4, τ=1e-2, κ=1/6`

## Tests

```bash
python test_twill.py   # Twill: 6/6 pass (~5s)
python test_gaus.py    # GauS:  7/7 pass (~30s)
```

## Solvers Used

| Component | Solver | Theory | Paper |
|-----------|--------|--------|-------|
| Twill Cost Normalization | CBC (PuLP) | ILP | Soi et al. §5.2 |
| Twill Modulo Scheduling | CBC (PuLP) | ILP | Soi et al. §3.1 |
| Twill Joint SWP+WS | Z3 | QFLIA (SMT) | Soi et al. §4 |
| GauS Scheduling | PyTorch (Adam) | Differentiable | Cai et al. §3 |

## Citations

```bibtex
@article{soi2024twill,
  title={Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs},
  author={Soi, Rupanshu and others},
  journal={arXiv preprint arXiv:2512.18134},
  year={2024}
}

@article{cai2026gaus,
  title={GauS: Differentiable Scheduling Optimization via Gaussian Reparameterization},
  author={Cai, Yaohui and others},
  journal={arXiv preprint arXiv:2602.20427},
  year={2026}
}
```

## Related Work

- [FlashAttention-3](https://arxiv.org/abs/2407.08608) — Hopper FMHA schedule that Twill rediscovers
- [FlashAttention-4](https://arxiv.org/abs/2603.05451) — Blackwell FMHA schedule that Twill rediscovers
- [ThunderKittens](https://github.com/HazyResearch/ThunderKittens) — Warp-level kernel framework
- [CUTLASS 3.x](https://github.com/NVIDIA/cutlass) — NVIDIA GEMM templates with WS
- [Tawa](https://arxiv.org/abs/2510.14719) — Automatic WS compiler (downstream of Twill)
- [Cypress](https://arxiv.org/abs/2504.07004) — Task-based GPU programming model
- [Nautilus](https://arxiv.org/abs/2604.14825) — Auto-scheduling tensor compiler (fills Twill's tile-size gap)
- [MPK](https://arxiv.org/abs/2512.22219) — Cross-kernel software pipelining
- [GS-Schedule](https://github.com/Yu-Maryland/Differentiable_Scheduler_ICML24) — GauS predecessor (categorical approach)