| # π§ͺ LiquidGen: Liquid Neural Network Image Generator |
|
|
| **A novel attention-free image generation model based on Liquid Neural Network dynamics from MIT CSAIL.** |
|
|
| LiquidGen replaces self-attention in diffusion models with **Closed-form Continuous-depth (CfC)** liquid dynamics β making it fully parallelizable, memory-efficient, and trainable on a single consumer GPU (Colab free tier T4). |
|
|
| ## ποΈ Architecture |
|
|
| ``` |
| Input Image β Flux VAE Encoder β Noisy Latent β LiquidGen Backbone β Predicted Velocity β Euler ODE β Clean Latent β VAE Decoder β Output Image |
| ``` |
|
|
| ### Key Components |
|
|
| | Component | What it does | Replaces | |
| |-----------|-------------|----------| |
| | **LiquidTimeConstant** | `Ξ±Β·x + (1-Ξ±)Β·stimulus` with learnable decay Ξ± = exp(-softplus(Ο)) | Residual connections | |
| | **GatedDepthwiseStimulusConv** | Local spatial context via gated DW-conv | Self-attention (local) | |
| | **ZigzagScan1D** | Global context via zigzag-ordered 1D conv | Self-attention (global) | |
| | **AdaptiveGroupNorm** | Timestep conditioning via scale/shift | AdaLN in DiT | |
| | **U-Net Long Skips** | Skip connections from shallow to deep blocks | Standard residual | |
|
|
| ### Core Innovation: Liquid Time Constants |
|
|
| From the CfC paper (Hasani et al., Nature Machine Intelligence 2022): |
|
|
| ``` |
| x_{t+1} = exp(-Ξt/Ο_t) Β· x_t + (1 - exp(-Ξt/Ο_t)) Β· h(x_t, u_t) |
| ``` |
|
|
| Our parallelizable version: |
| ```python |
| Ξ± = exp(-softplus(Ο)) # Per-channel learnable retention |
| output = Ξ± * state + (1 - Ξ±) * stimulus # Exponential relaxation |
| ``` |
|
|
| **No sequential ODE solving.** No attention. Fully parallelizable. |
|
|
| ## π Model Sizes |
|
|
| | Model | Params | VRAM (train) | Best For | |
| |-------|--------|-------------|----------| |
| | **LiquidGen-S** | ~55M | ~4-6 GB | 256px, fast experiments | |
| | **LiquidGen-B** | ~140M | ~8-10 GB | 256/512px, balanced | |
| | **LiquidGen-L** | ~280M | ~12-14 GB | 512px, high quality | |
|
|
| All models fit comfortably in **16GB VRAM** (Colab free tier T4 GPU). |
|
|
| ## π Quick Start |
|
|
| ### Using the Colab Notebook |
| Open `LiquidGen_Colab_Notebook.ipynb` in Google Colab and follow the steps. It includes: |
| - Complete model code (no external dependencies beyond PyTorch + diffusers) |
| - Configurable training on WikiArt dataset (artistic paintings) |
| - Support for 256px and 512px generation |
| - Class-conditional generation (27 art styles) |
| - Loss plotting and sample visualization |
|
|
| ### Using the Python Scripts |
|
|
| ```python |
| from model import liquidgen_base |
| import torch |
| |
| # Create model |
| model = liquidgen_base(num_classes=27).cuda() |
| print(f"Parameters: {model.count_params()/1e6:.1f}M") |
| |
| # Forward pass (predict velocity for flow matching) |
| x = torch.randn(4, 16, 32, 32).cuda() # 256px latent |
| t = torch.rand(4).cuda() # Timesteps |
| labels = torch.randint(0, 27, (4,)).cuda() |
| v = model(x, t, labels) # Predicted velocity |
| ``` |
|
|
| ## π§ Training |
|
|
| ### Default Configuration |
| ```python |
| from train import TrainConfig, train |
| |
| config = TrainConfig( |
| model_size="base", # "small", "base", or "large" |
| image_size=256, # 256 or 512 |
| dataset_name="huggan/wikiart", |
| label_column="style", # 27 art styles |
| num_classes=27, |
| batch_size=8, |
| gradient_accumulation_steps=4, |
| learning_rate=1e-4, |
| num_epochs=50, |
| ) |
| train(config) |
| ``` |
|
|
| ### Training Details |
| - **VAE**: FLUX.1-schnell (frozen, 16-channel latent, 8x compression, Apache 2.0) |
| - **Objective**: Flow matching (velocity prediction) β `v = noise - x_0` |
| - **Optimizer**: AdamW (lr=1e-4, weight_decay=0.01) |
| - **Gradient clipping**: 2.0 (critical for stability, from ZigMa paper) |
| - **EMA**: 0.9999 decay |
| - **Sampling**: Euler ODE, 50 steps, classifier-free guidance |
| |
| ## π Files |
| |
| ``` |
| βββ model.py # Complete LiquidGen model architecture |
| βββ train.py # Training pipeline with FlowMatching + EMA |
| βββ LiquidGen_Colab_Notebook.ipynb # Ready-to-run Colab notebook |
| βββ README.md # This file |
| ``` |
| |
| ## π¬ Research Background |
| |
| This architecture synthesizes ideas from multiple research lineages: |
| |
| ### Liquid Neural Networks |
| - **Liquid Time-constant Networks** (Hasani et al., NeurIPS 2020) β ODE-based neurons with input-dependent Ο |
| - **Closed-form Continuous-depth Models** (Hasani et al., Nature Machine Intelligence 2022) β Analytical solution eliminating ODE solvers |
| - **Neural Circuit Policies** (Lechner et al., Nature Machine Intelligence 2020) β Sparse wiring: sensoryβinterβcommandβmotor |
| |
| ### Attention-Free Image Generation |
| - **ZigMa** (ECCV 2024) β Zigzag scanning for SSM-based diffusion (FID 14.27 CelebA-256) |
| - **DiMSUM** (NeurIPS 2024) β Spatial-frequency Mamba (FID 2.11 ImageNet 256) |
| - **DiffuSSM** (2023) β First attention-free diffusion model (FID 2.28 ImageNet 256) |
| - **DiM** (2024) β Multi-directional Mamba with padding tokens |
| |
| ### Parallelization |
| - **LiquidTAD** (2025) β Static decay Ξ±=exp(-softplus(Ο)) for fully parallel liquid dynamics (100Γ speedup vs ODE) |
| |
| ### Flow Matching |
| - **Flow Matching for Generative Modeling** (Lipman et al., 2023) |
| - **SiT** (2024) β Scalable Interpolant Transformers |
| |
| ## π Architecture Diagram |
| |
| ``` |
| Input Latent [B, 16, H/8, W/8] |
| β |
| ββββ Patch Embed (Conv2d, stride=2) βββ [B, D, H/16, W/16] |
| ββββ + Learnable Position Embedding |
| ββββ Input Projection (DW-Conv + PW-Conv + GELU) |
| β |
| ββββ LiquidBlock Γ (depth/2) βββ save skip connections |
| β βββ AdaGN (timestep conditioned) |
| β βββ GatedDepthwiseStimulusConv (local spatial) |
| β βββ + ZigzagScan1D (global context) |
| β βββ LiquidTimeConstant #1 (CfC blend) |
| β βββ AdaGN (timestep conditioned) |
| β βββ ChannelMixMLP (GELU) |
| β βββ LiquidTimeConstant #2 (CfC blend) |
| β |
| ββββ LiquidBlock Γ (depth/2) βββ add skip connections |
| β βββ (same structure as above) |
| β |
| ββββ GroupNorm + Conv + GELU |
| ββββ Unpatchify (ConvTranspose2d) βββ [B, 16, H/8, W/8] |
| ``` |
| |
| ## β‘ Key Design Decisions |
| |
| 1. **No Attention** β O(n) vs O(nΒ²). Enables training on longer sequences / higher resolution latents. |
| 2. **Liquid Dynamics over Residual** β Instead of `x + f(x)`, we use `Ξ±Β·x + (1-Ξ±)Β·f(x)` where Ξ± is learned per-channel. This gives the model explicit control over how much old vs new information to retain. |
| 3. **Zigzag Scanning** β Preserves spatial continuity (adjacent pixels stay adjacent in sequence). Simple raster scan breaks this at row boundaries. |
| 4. **Frozen Flux VAE** β 16-channel latent with best-in-class reconstruction quality. Only 160MB, ~1GB VRAM. |
| 5. **Flow Matching** β Straighter ODE trajectories than DDPM β fewer sampling steps needed, better quality. |
| |
| ## π License |
| |
| MIT |
| |
| ## π Acknowledgments |
| |
| - MIT CSAIL for Liquid Neural Networks research |
| - Black Forest Labs for FLUX.1-schnell VAE (Apache 2.0) |
| - WikiArt dataset contributors |
| - ZigMa, DiMSUM, DiffuSSM, DiM authors for attention-free diffusion insights |
| |