Add README with architecture docs and usage guide

fe0d9c3 verified 11 days ago

7.14 kB

	# 🧪 LiquidGen: Liquid Neural Network Image Generator

	A novel attention-free image generation model based on Liquid Neural Network dynamics from MIT CSAIL.

	LiquidGen replaces self-attention in diffusion models with Closed-form Continuous-depth (CfC) liquid dynamics — making it fully parallelizable, memory-efficient, and trainable on a single consumer GPU (Colab free tier T4).

	## 🏗️ Architecture

	```
	Input Image → Flux VAE Encoder → Noisy Latent → LiquidGen Backbone → Predicted Velocity → Euler ODE → Clean Latent → VAE Decoder → Output Image
	```

	### Key Components

	\| Component \| What it does \| Replaces \|
	\|-----------\|-------------\|----------\|
	\| LiquidTimeConstant \| `α·x + (1-α)·stimulus` with learnable decay α = exp(-softplus(ρ)) \| Residual connections \|
	\| GatedDepthwiseStimulusConv \| Local spatial context via gated DW-conv \| Self-attention (local) \|
	\| ZigzagScan1D \| Global context via zigzag-ordered 1D conv \| Self-attention (global) \|
	\| AdaptiveGroupNorm \| Timestep conditioning via scale/shift \| AdaLN in DiT \|
	\| U-Net Long Skips \| Skip connections from shallow to deep blocks \| Standard residual \|

	### Core Innovation: Liquid Time Constants

	From the CfC paper (Hasani et al., Nature Machine Intelligence 2022):

	```
	x_{t+1} = exp(-Δt/τ_t) · x_t + (1 - exp(-Δt/τ_t)) · h(x_t, u_t)
	```

	Our parallelizable version:
	```python
	α = exp(-softplus(ρ)) # Per-channel learnable retention
	output = α * state + (1 - α) * stimulus # Exponential relaxation
	```

	No sequential ODE solving. No attention. Fully parallelizable.

	## 📊 Model Sizes

	\| Model \| Params \| VRAM (train) \| Best For \|
	\|-------\|--------\|-------------\|----------\|
	\| LiquidGen-S \| ~55M \| ~4-6 GB \| 256px, fast experiments \|
	\| LiquidGen-B \| ~140M \| ~8-10 GB \| 256/512px, balanced \|
	\| LiquidGen-L \| ~280M \| ~12-14 GB \| 512px, high quality \|

	All models fit comfortably in 16GB VRAM (Colab free tier T4 GPU).

	## 🚀 Quick Start

	### Using the Colab Notebook
	Open `LiquidGen_Colab_Notebook.ipynb` in Google Colab and follow the steps. It includes:
	- Complete model code (no external dependencies beyond PyTorch + diffusers)
	- Configurable training on WikiArt dataset (artistic paintings)
	- Support for 256px and 512px generation
	- Class-conditional generation (27 art styles)
	- Loss plotting and sample visualization

	### Using the Python Scripts

	```python
	from model import liquidgen_base
	import torch

	# Create model
	model = liquidgen_base(num_classes=27).cuda()
	print(f"Parameters: {model.count_params()/1e6:.1f}M")

	# Forward pass (predict velocity for flow matching)
	x = torch.randn(4, 16, 32, 32).cuda() # 256px latent
	t = torch.rand(4).cuda() # Timesteps
	labels = torch.randint(0, 27, (4,)).cuda()
	v = model(x, t, labels) # Predicted velocity
	```

	## 🔧 Training

	### Default Configuration
	```python
	from train import TrainConfig, train

	config = TrainConfig(
	model_size="base", # "small", "base", or "large"
	image_size=256, # 256 or 512
	dataset_name="huggan/wikiart",
	label_column="style", # 27 art styles
	num_classes=27,
	batch_size=8,
	gradient_accumulation_steps=4,
	learning_rate=1e-4,
	num_epochs=50,
	)
	train(config)
	```

	### Training Details
	- VAE: FLUX.1-schnell (frozen, 16-channel latent, 8x compression, Apache 2.0)
	- Objective: Flow matching (velocity prediction) — `v = noise - x_0`
	- Optimizer: AdamW (lr=1e-4, weight_decay=0.01)
	- Gradient clipping: 2.0 (critical for stability, from ZigMa paper)
	- EMA: 0.9999 decay
	- Sampling: Euler ODE, 50 steps, classifier-free guidance

	## 📁 Files

	```
	├── model.py # Complete LiquidGen model architecture
	├── train.py # Training pipeline with FlowMatching + EMA
	├── LiquidGen_Colab_Notebook.ipynb # Ready-to-run Colab notebook
	└── README.md # This file
	```

	## 🔬 Research Background

	This architecture synthesizes ideas from multiple research lineages:

	### Liquid Neural Networks
	- Liquid Time-constant Networks (Hasani et al., NeurIPS 2020) — ODE-based neurons with input-dependent τ
	- Closed-form Continuous-depth Models (Hasani et al., Nature Machine Intelligence 2022) — Analytical solution eliminating ODE solvers
	- Neural Circuit Policies (Lechner et al., Nature Machine Intelligence 2020) — Sparse wiring: sensory→inter→command→motor

	### Attention-Free Image Generation
	- ZigMa (ECCV 2024) — Zigzag scanning for SSM-based diffusion (FID 14.27 CelebA-256)
	- DiMSUM (NeurIPS 2024) — Spatial-frequency Mamba (FID 2.11 ImageNet 256)
	- DiffuSSM (2023) — First attention-free diffusion model (FID 2.28 ImageNet 256)
	- DiM (2024) — Multi-directional Mamba with padding tokens

	### Parallelization
	- LiquidTAD (2025) — Static decay α=exp(-softplus(ρ)) for fully parallel liquid dynamics (100× speedup vs ODE)

	### Flow Matching
	- Flow Matching for Generative Modeling (Lipman et al., 2023)
	- SiT (2024) — Scalable Interpolant Transformers

	## 📐 Architecture Diagram

	```
	Input Latent [B, 16, H/8, W/8]
	│
	├─── Patch Embed (Conv2d, stride=2) ──→ [B, D, H/16, W/16]
	├─── + Learnable Position Embedding
	├─── Input Projection (DW-Conv + PW-Conv + GELU)
	│
	├─── LiquidBlock × (depth/2) ←── save skip connections
	│ ├── AdaGN (timestep conditioned)
	│ ├── GatedDepthwiseStimulusConv (local spatial)
	│ ├── + ZigzagScan1D (global context)
	│ ├── LiquidTimeConstant #1 (CfC blend)
	│ ├── AdaGN (timestep conditioned)
	│ ├── ChannelMixMLP (GELU)
	│ └── LiquidTimeConstant #2 (CfC blend)
	│
	├─── LiquidBlock × (depth/2) ←── add skip connections
	│ └── (same structure as above)
	│
	├─── GroupNorm + Conv + GELU
	└─── Unpatchify (ConvTranspose2d) ──→ [B, 16, H/8, W/8]
	```

	## ⚡ Key Design Decisions

	1. No Attention — O(n) vs O(n²). Enables training on longer sequences / higher resolution latents.
	2. Liquid Dynamics over Residual — Instead of `x + f(x)`, we use `α·x + (1-α)·f(x)` where α is learned per-channel. This gives the model explicit control over how much old vs new information to retain.
	3. Zigzag Scanning — Preserves spatial continuity (adjacent pixels stay adjacent in sequence). Simple raster scan breaks this at row boundaries.
	4. Frozen Flux VAE — 16-channel latent with best-in-class reconstruction quality. Only 160MB, ~1GB VRAM.
	5. Flow Matching — Straighter ODE trajectories than DDPM → fewer sampling steps needed, better quality.

	## 📜 License

	MIT

	## 🙏 Acknowledgments

	- MIT CSAIL for Liquid Neural Networks research
	- Black Forest Labs for FLUX.1-schnell VAE (Apache 2.0)
	- WikiArt dataset contributors
	- ZigMa, DiMSUM, DiffuSSM, DiM authors for attention-free diffusion insights