Update README with Colab-optimized training workflow and dataset presets

3063cf6 verified 8 days ago

6.87 kB

	# 🧪 LiquidGen: Liquid Neural Network Image Generator

	A novel attention-free image generation model based on Liquid Neural Network dynamics from MIT CSAIL.

	LiquidGen replaces self-attention in diffusion models with Closed-form Continuous-depth (CfC) liquid dynamics — making it fully parallelizable, memory-efficient, and trainable on a single consumer GPU (Colab free tier T4).

	## 🚀 Quick Start (Colab)

	1. Open `LiquidGen_Colab_Notebook.ipynb` in Google Colab
	2. Select a dataset preset (see table below)
	3. Run all cells — latents are pre-cached automatically, then training starts

	Training is optimized for Colab free tier:
	- Latent pre-caching: Encode all images with VAE once → save to disk → train on pure tensors
	- No VAE during training → saves ~1GB VRAM, enables larger batches (32+)
	- Small curated datasets that download in seconds (not 5GB WikiArt!)

	### Dataset Presets

	\| Preset \| Images \| Download \| Classes \| Description \|
	\|--------\|--------\|----------\|---------\|-------------\|
	\| `paintings_mini` \| ~200 \| 1.7MB \| 27 styles \| Instant smoke test \|
	\| `paintings` \| ~8K \| 204MB \| 27 styles \| Recommended — best quality/speed tradeoff \|
	\| `cartoon` \| ~2.5K \| 181MB \| unconditional \| Cartoon/anime images \|
	\| `flowers` \| ~8K \| 331MB \| unconditional \| Flower photography \|
	\| `wikiart_stream` \| ~80K \| streaming \| 27 styles \| Full WikiArt via streaming (set `max_images`) \|

	## 🏗️ Architecture

	```
	Input Image → Flux VAE Encoder → Noisy Latent → LiquidGen Backbone → Predicted Velocity → Euler ODE → VAE Decoder → Output
	```

	### Key Components

	\| Component \| What it does \| Replaces \|
	\|-----------\|-------------\|----------\|
	\| LiquidTimeConstant \| `α·x + (1-α)·stimulus` with learnable decay α = exp(-softplus(ρ)) \| Residual connections \|
	\| GatedDepthwiseStimulusConv \| Local spatial context via gated DW-conv \| Self-attention (local) \|
	\| ZigzagScan1D \| Global context via zigzag-ordered 1D conv \| Self-attention (global) \|
	\| AdaptiveGroupNorm \| Timestep conditioning via scale/shift \| AdaLN in DiT \|
	\| U-Net Long Skips \| Skip connections from shallow to deep blocks \| Standard residual \|

	### Core Innovation: Liquid Time Constants

	From the CfC paper (Hasani et al., Nature Machine Intelligence 2022):
	```
	x_{t+1} = exp(-Δt/τ_t) · x_t + (1 - exp(-Δt/τ_t)) · h(x_t, u_t)
	```

	Our parallelizable version (inspired by LiquidTAD 2025):
	```python
	α = exp(-softplus(ρ)) # Per-channel learnable retention
	output = α * state + (1 - α) * stimulus # Exponential relaxation
	```

	No sequential ODE solving. No attention. Fully parallelizable.

	## 📊 Model Sizes

	\| Model \| Params \| VRAM (train) \| Best For \|
	\|-------\|--------\|-------------\|----------\|
	\| LiquidGen-S \| ~55M \| ~4-6 GB \| 256px, fast experiments \|
	\| LiquidGen-B \| ~140M \| ~8-10 GB \| 256/512px, balanced \|
	\| LiquidGen-L \| ~280M \| ~12-14 GB \| 512px, high quality \|

	All fit in 16GB VRAM (Colab free T4). Training on cached latents = no VAE overhead.

	## 🔧 Training

	```python
	from train import TrainConfig, train

	config = TrainConfig(
	model_size="small",
	dataset_preset="paintings", # 8K paintings, 204MB, 27 styles
	image_size=256,
	batch_size=32, # Large batches OK with cached latents!
	num_epochs=100,
	learning_rate=1e-4,
	)
	train(config)
	```

	### Training Pipeline
	1. Pre-cache: Load dataset → encode all images with frozen Flux VAE → save latents to disk → unload VAE
	2. Train: Load cached tensors → train LiquidGen backbone with flow matching → fast iterations!
	3. Sample: Load VAE only when generating sample images (lazy loading)

	### Details
	- VAE: FLUX.1-schnell (frozen, 16ch latent, 8x compression, Apache 2.0)
	- Objective: Flow matching (velocity prediction) — `v = noise - x_0`
	- Optimizer: AdamW (lr=1e-4, weight_decay=0.01)
	- Gradient clipping: 2.0 (critical for stability, from ZigMa paper)
	- EMA: 0.9999 decay
	- Sampling: Euler ODE, 50 steps, classifier-free guidance

	## 📁 Files

	```
	├── model.py # LiquidGen model architecture (~55-280M params)
	├── train.py # Training pipeline with latent pre-caching
	├── LiquidGen_Colab_Notebook.ipynb # Ready-to-run Colab notebook
	└── README.md
	```

	## 📐 Architecture Diagram

	```
	Input Latent [B, 16, H/8, W/8]
	│
	├─── Patch Embed (Conv2d, stride=2) ──→ [B, D, H/16, W/16]
	├─── + Learnable Position Embedding
	├─── Input Projection (DW-Conv + PW-Conv + GELU)
	│
	├─── LiquidBlock × (depth/2) ←── save skip connections
	│ ├── AdaGN (timestep conditioned)
	│ ├── GatedDepthwiseStimulusConv (local spatial)
	│ ├── + ZigzagScan1D (global context)
	│ ├── LiquidTimeConstant #1 (CfC blend)
	│ ├── AdaGN
	│ ├── ChannelMixMLP (GELU)
	│ └── LiquidTimeConstant #2 (CfC blend)
	│
	├─── LiquidBlock × (depth/2) ←── add skip connections
	│
	├─── GroupNorm + Conv + GELU
	└─── Unpatchify (ConvTranspose2d) ──→ [B, 16, H/8, W/8]
	```

	## 🔬 Research Background

	### Liquid Neural Networks
	- Liquid Time-constant Networks (Hasani et al., NeurIPS 2020) — ODE-based neurons with input-dependent τ
	- Closed-form Continuous-depth Models (Hasani et al., Nature Machine Intelligence 2022) — Analytical solution eliminating ODE solvers
	- Neural Circuit Policies (Lechner et al., Nature Machine Intelligence 2020) — Sparse wiring: sensory→inter→command→motor
	- LiquidTAD (2025) — Static decay α=exp(-softplus(ρ)) for fully parallel liquid dynamics (100× speedup)

	### Attention-Free Image Generation
	- ZigMa (ECCV 2024) — Zigzag scanning for SSM-based diffusion
	- DiMSUM (NeurIPS 2024) — Spatial-frequency Mamba (FID 2.11 ImageNet 256)
	- DiffuSSM (2023) — First attention-free diffusion model
	- DiM (2024) — Multi-directional Mamba with padding tokens

	### Flow Matching
	- Flow Matching for Generative Modeling (Lipman et al., 2023)
	- SiT (2024) — Scalable Interpolant Transformers

	## ⚡ Design Decisions

	1. No Attention — O(n) complexity. Liquid dynamics + zigzag conv replace self-attention entirely.
	2. Liquid over Residual — `α·x + (1-α)·f(x)` instead of `x + f(x)`. Explicit control over retention per channel.
	3. Zigzag Scanning — Preserves spatial continuity at row boundaries (critical insight from ZigMa).
	4. Latent Pre-caching — Encode once, train forever. No VAE overhead during training.
	5. Flow Matching — Straighter ODE trajectories → fewer sampling steps, better quality.

	## 📜 License

	MIT