| # 🌊 LiquidDiffusion: Attention-Free Image Generation with Liquid Neural Networks |
|
|
| A **novel image generation architecture** that replaces all attention mechanisms with Parallel CfC (Closed-form Continuous-depth) blocks from Liquid Neural Networks. |
|
|
| **This is genuinely novel research** — no existing paper uses CfC/LTC as a diffusion model backbone. |
|
|
| [](https://colab.research.google.com/github/krystv/liquid-diffusion/blob/main/LiquidDiffusion_Training.ipynb) |
|
|
| ## 🔬 Key Innovations |
|
|
| | Feature | Description | |
| |---------|-------------| |
| | **No Attention** | All spatial mixing via multi-scale depthwise convolutions (3×3, 5×5, 7×7) + global average pooling | |
| | **Fully Parallelizable** | No sequential ODE solving — CfC closed-form solution eliminates the computational bottleneck of Neural ODEs | |
| | **CfC × Diffusion Bridge** | The diffusion noise level `t` IS the liquid time constant — natural mathematical correspondence | |
| | **Liquid Relaxation Residuals** | Time-aware skip connections: `α·input + (1-α)·output` where `α = exp(-λ·t)` adapts to noise level | |
| | **Fits 16GB VRAM** | Tiny model (8M params) fits in ~4GB; designed for Colab free tier T4 | |
|
|
| ## 📐 Architecture |
|
|
| ``` |
| Input: noisy image [B, 3, H, W] + timestep t ∈ [0, 1] |
| |
| Time Embedding: Sinusoidal PE → MLP → t_emb [B, dim] |
| |
| Conv Stem: 3×3 conv → SiLU → 3×3 conv |
| |
| Encoder: |
| Stage 1: [LiquidDiffusionBlock × N₁] → DownSample (stride-2 conv) |
| Stage 2: [LiquidDiffusionBlock × N₂] → DownSample |
| Stage 3: [LiquidDiffusionBlock × N₃] |
| |
| Bottleneck: [LiquidDiffusionBlock × 2] |
| |
| Decoder (mirror of encoder): |
| Stage 3: UpSample → SkipFusion → [LiquidDiffusionBlock × N₃] |
| Stage 2: UpSample → SkipFusion → [LiquidDiffusionBlock × N₂] |
| Stage 1: [LiquidDiffusionBlock × N₁] |
| |
| Output: GroupNorm → SiLU → 3×3 conv → velocity prediction [B, 3, H, W] |
| ``` |
|
|
| ### LiquidDiffusionBlock |
| ``` |
| x → AdaLN(t) → ParallelCfC(t) → +residual |
| → MultiScaleSpatialMix(t) → +residual |
| → AdaLN(t) → FeedForward → +residual |
| ``` |
|
|
| ### ParallelCfC (Core Innovation) |
| ```python |
| # CfC Eq.10 adapted for 2D spatial features: |
| backbone = SiLU(Conv1x1(DWConv7x7(x))) # shared spatial context |
| f = Conv1x1(backbone) # time-constant gate |
| g = DWConv→SiLU→Conv1x1(backbone) # "from" state |
| h = DWConv→SiLU→Conv1x1(backbone) # "to" state (attractor) |
| gate = σ(time_a(t_emb) · f - time_b(t_emb)) # liquid time gate |
| cfc_out = gate · g + (1-gate) · h # CfC interpolation |
| |
| # Liquid relaxation residual: |
| α = exp(-softplus(ρ) · |t|) # time-aware weight |
| output = α · input + (1-α) · cfc_out # noise-adaptive residual |
| ``` |
|
|
| ## 📊 Model Configurations |
|
|
| | Config | Channels | Blocks | Params | Resolution | VRAM (fp16) | |
| |--------|----------|--------|--------|-----------|-------------| |
| | **tiny** | [64, 128, 256] | [2, 2, 4] | ~8M | 256×256 | ~4GB | |
| | **small** | [96, 192, 384] | [2, 3, 6] | ~25M | 256×256 | ~8GB | |
| | **base** | [128, 256, 512] | [2, 4, 8] | ~65M | 512×512 | ~14GB | |
| | **large** | [128, 256, 512, 768] | [2, 4, 8, 4] | ~120M | 512×512 | ~24GB | |
|
|
| ## 🏋️ Training |
|
|
| ### Rectified Flow (simplest effective objective) |
| ``` |
| x_t = (1-t) · x_data + t · noise, t ~ U[0,1] |
| Loss = ||model(x_t, t) - (noise - x_data)||² |
| ``` |
| No noise schedule. No variance. Just MSE on a straight-line velocity. |
|
|
| ### Sampling (Euler ODE) |
| ```python |
| z = randn(B, 3, H, W) # start from noise |
| for i in range(N, 0, -1): |
| t = i / N |
| z = z - model(z, t) / N # Euler step |
| ``` |
| Typically 25-50 steps. |
|
|
| ### Quick Start |
| ```python |
| from liquid_diffusion import liquid_diffusion_tiny, RectifiedFlowTrainer |
| |
| model = liquid_diffusion_tiny() |
| trainer = RectifiedFlowTrainer(model, lr=1e-4, device='cuda') |
| |
| # Training step |
| images = get_batch() # [B, 3, 256, 256] in [-1, 1] |
| metrics = trainer.train_step(images) |
| print(f"Loss: {metrics['loss']:.4f}") |
| |
| # Generate |
| samples = trainer.sample(batch_size=4, image_size=256, num_steps=50) |
| ``` |
|
|
| ### Recommended Datasets |
| - **CelebA-HQ** (`huggan/CelebA-HQ`) — 30K face images, 256px |
| - **Flowers-102** (`huggan/flowers-102-categories`) — botanical images |
| - **AFHQ** — 15K animal faces (cats, dogs, wildlife) |
| - Any folder of images |
|
|
| ## 🧮 Mathematical Foundation |
|
|
| ### Liquid Time-Constant Networks (LTC) |
| *Hasani et al., AAAI 2021 — [arxiv:2006.04439](https://arxiv.org/abs/2006.04439)* |
|
|
| The fundamental ODE: |
| ``` |
| dx/dt = -[1/τ + f(x,I,θ)] · x + f(x,I,θ) · A |
| ``` |
| Key: system time constant `τ_sys = τ/(1 + τ·f)` is **input-dependent** — neurons adapt their response speed. |
|
|
| ### CfC: Closed-form Solution |
| *Hasani et al., Nature Machine Intelligence 2022 — [arxiv:2106.13898](https://arxiv.org/abs/2106.13898)* |
|
|
| Solves the LTC ODE analytically: |
| ``` |
| x(t) = σ(-f(x,I;θf)·t) ⊙ g(x,I;θg) + [1 - σ(-f(x,I;θf)·t)] ⊙ h(x,I;θh) |
| ``` |
| Eliminates ODE solver → **fully parallelizable**, one order of magnitude faster. |
|
|
| ### Our CfC-Diffusion Bridge |
| We observe that CfC's time parameter `t` and diffusion's noise level `t` serve analogous roles: |
| - CfC: `t` controls interpolation between "from" (g) and "to" (h) states |
| - Diffusion: `t` controls the noise level the denoiser must handle |
|
|
| By using the diffusion timestep directly as CfC's time parameter: |
| - `t≈0` (clean): gate ≈ 0.5 → balanced g/h → flexible detail processing |
| - `t≈1` (noisy): gate saturates → specialized denoising behavior |
| - The gate function `f` is **input-dependent** → each image region gets adaptive time response |
|
|
| ### Parallel Liquid Relaxation (from LiquidTAD) |
| *[arxiv:2604.18274](https://arxiv.org/abs/2604.18274)* |
|
|
| ``` |
| α = exp(-softplus(ρ) · t_diff) |
| output = α · input + (1-α) · gated_transform(input) |
| ``` |
| When `t` is large (noisy): α ≈ 0 → rely on CfC output (needs strong processing). |
| When `t` is small (clean): α ≈ 1 → preserve input (only minor refinement needed). |
|
|
| ## 📚 References |
|
|
| 1. Hasani et al., "Liquid Time-constant Networks", AAAI 2021 — [arxiv:2006.04439](https://arxiv.org/abs/2006.04439) |
| 2. Hasani et al., "Closed-form Continuous-time Neural Networks", Nature MI 2022 — [arxiv:2106.13898](https://arxiv.org/abs/2106.13898) |
| 3. Lechner et al., "Neural Circuit Policies", Nature MI 2020 |
| 4. LiquidTAD: Parallel liquid relaxation — [arxiv:2604.18274](https://arxiv.org/abs/2604.18274) |
| 5. USM: U-Shape Mamba for diffusion — [arxiv:2504.13499](https://arxiv.org/abs/2504.13499) |
| 6. DiffuSSM: Diffusion without attention — [arxiv:2311.18257](https://arxiv.org/abs/2311.18257) |
| 7. Liu et al., "Flow Straight and Fast: Rectified Flow", ICLR 2023 — [arxiv:2209.03003](https://arxiv.org/abs/2209.03003) |
| 8. Lee et al., "Improving the Training of Rectified Flows" — [arxiv:2405.20320](https://arxiv.org/abs/2405.20320) |
|
|
| ## License |
|
|
| MIT |
|
|