🌊 LiquidDiffusion: Attention-Free Image Generation with Liquid Neural Networks

A novel image generation architecture that replaces all attention mechanisms with Parallel CfC (Closed-form Continuous-depth) blocks from Liquid Neural Networks.

This is genuinely novel research — no existing paper uses CfC/LTC as a diffusion model backbone.

🔬 Key Innovations

Feature	Description
No Attention	All spatial mixing via multi-scale depthwise convolutions (3×3, 5×5, 7×7) + global average pooling
Fully Parallelizable	No sequential ODE solving — CfC closed-form solution eliminates the computational bottleneck of Neural ODEs
CfC × Diffusion Bridge	The diffusion noise level `t` IS the liquid time constant — natural mathematical correspondence
Liquid Relaxation Residuals	Time-aware skip connections: `α·input + (1-α)·output` where `α = exp(-λ·t)` adapts to noise level
Fits 16GB VRAM	Tiny model (8M params) fits in ~4GB; designed for Colab free tier T4

📐 Architecture

Input: noisy image [B, 3, H, W] + timestep t ∈ [0, 1]

Time Embedding: Sinusoidal PE → MLP → t_emb [B, dim]

Conv Stem: 3×3 conv → SiLU → 3×3 conv

Encoder:
  Stage 1: [LiquidDiffusionBlock × N₁] → DownSample (stride-2 conv)
  Stage 2: [LiquidDiffusionBlock × N₂] → DownSample
  Stage 3: [LiquidDiffusionBlock × N₃]

Bottleneck: [LiquidDiffusionBlock × 2]

Decoder (mirror of encoder):
  Stage 3: UpSample → SkipFusion → [LiquidDiffusionBlock × N₃]
  Stage 2: UpSample → SkipFusion → [LiquidDiffusionBlock × N₂]
  Stage 1: [LiquidDiffusionBlock × N₁]

Output: GroupNorm → SiLU → 3×3 conv → velocity prediction [B, 3, H, W]

LiquidDiffusionBlock

x → AdaLN(t) → ParallelCfC(t) → +residual
  → MultiScaleSpatialMix(t) → +residual
  → AdaLN(t) → FeedForward → +residual

ParallelCfC (Core Innovation)

# CfC Eq.10 adapted for 2D spatial features:
backbone = SiLU(Conv1x1(DWConv7x7(x)))     # shared spatial context
f = Conv1x1(backbone)                        # time-constant gate  
g = DWConv→SiLU→Conv1x1(backbone)           # "from" state
h = DWConv→SiLU→Conv1x1(backbone)           # "to" state (attractor)
gate = σ(time_a(t_emb) · f - time_b(t_emb)) # liquid time gate
cfc_out = gate · g + (1-gate) · h            # CfC interpolation

# Liquid relaxation residual:
α = exp(-softplus(ρ) · |t|)                  # time-aware weight
output = α · input + (1-α) · cfc_out         # noise-adaptive residual

📊 Model Configurations

Config	Channels	Blocks	Params	Resolution	VRAM (fp16)
tiny	[64, 128, 256]	[2, 2, 4]	~8M	256×256	~4GB
small	[96, 192, 384]	[2, 3, 6]	~25M	256×256	~8GB
base	[128, 256, 512]	[2, 4, 8]	~65M	512×512	~14GB
large	[128, 256, 512, 768]	[2, 4, 8, 4]	~120M	512×512	~24GB

🏋️ Training

Rectified Flow (simplest effective objective)

x_t = (1-t) · x_data + t · noise,   t ~ U[0,1]
Loss = ||model(x_t, t) - (noise - x_data)||²

No noise schedule. No variance. Just MSE on a straight-line velocity.

Sampling (Euler ODE)

z = randn(B, 3, H, W)  # start from noise
for i in range(N, 0, -1):
    t = i / N
    z = z - model(z, t) / N  # Euler step

Typically 25-50 steps.

Quick Start

from liquid_diffusion import liquid_diffusion_tiny, RectifiedFlowTrainer

model = liquid_diffusion_tiny()
trainer = RectifiedFlowTrainer(model, lr=1e-4, device='cuda')

# Training step
images = get_batch()  # [B, 3, 256, 256] in [-1, 1]
metrics = trainer.train_step(images)
print(f"Loss: {metrics['loss']:.4f}")

# Generate
samples = trainer.sample(batch_size=4, image_size=256, num_steps=50)

Recommended Datasets

CelebA-HQ (huggan/CelebA-HQ) — 30K face images, 256px
Flowers-102 (huggan/flowers-102-categories) — botanical images
AFHQ — 15K animal faces (cats, dogs, wildlife)
Any folder of images

🧮 Mathematical Foundation

Liquid Time-Constant Networks (LTC)

Hasani et al., AAAI 2021 — arxiv:2006.04439

The fundamental ODE:

dx/dt = -[1/τ + f(x,I,θ)] · x + f(x,I,θ) · A

Key: system time constant τ_sys = τ/(1 + τ·f) is input-dependent — neurons adapt their response speed.

CfC: Closed-form Solution

Hasani et al., Nature Machine Intelligence 2022 — arxiv:2106.13898

Solves the LTC ODE analytically:

x(t) = σ(-f(x,I;θf)·t) ⊙ g(x,I;θg) + [1 - σ(-f(x,I;θf)·t)] ⊙ h(x,I;θh)

Eliminates ODE solver → fully parallelizable, one order of magnitude faster.

Our CfC-Diffusion Bridge

We observe that CfC's time parameter t and diffusion's noise level t serve analogous roles:

CfC: t controls interpolation between "from" (g) and "to" (h) states
Diffusion: t controls the noise level the denoiser must handle

By using the diffusion timestep directly as CfC's time parameter:

t≈0 (clean): gate ≈ 0.5 → balanced g/h → flexible detail processing
t≈1 (noisy): gate saturates → specialized denoising behavior
The gate function f is input-dependent → each image region gets adaptive time response

Parallel Liquid Relaxation (from LiquidTAD)

arxiv:2604.18274

α = exp(-softplus(ρ) · t_diff)
output = α · input + (1-α) · gated_transform(input)

When t is large (noisy): α ≈ 0 → rely on CfC output (needs strong processing). When t is small (clean): α ≈ 1 → preserve input (only minor refinement needed).

📚 References

Hasani et al., "Liquid Time-constant Networks", AAAI 2021 — arxiv:2006.04439
Hasani et al., "Closed-form Continuous-time Neural Networks", Nature MI 2022 — arxiv:2106.13898
Lechner et al., "Neural Circuit Policies", Nature MI 2020
LiquidTAD: Parallel liquid relaxation — arxiv:2604.18274
USM: U-Shape Mamba for diffusion — arxiv:2504.13499
DiffuSSM: Diffusion without attention — arxiv:2311.18257
Liu et al., "Flow Straight and Fast: Rectified Flow", ICLR 2023 — arxiv:2209.03003
Lee et al., "Improving the Training of Rectified Flows" — arxiv:2405.20320

License

MIT