krystv
/

liquid-diffusion

Model card Files Files and versions

xet

Community

krystv commited on 8 days ago

Commit

1a08b06

verified ·

1 Parent(s): b48de91

Update README with VAE and verified datasets

Browse files

Files changed (1) hide show

README.md +70 -131

README.md CHANGED Viewed

@@ -1,166 +1,105 @@
-# 🌊 LiquidDiffusion: Attention-Free Image Generation with Liquid Neural Networks
-A **novel image generation architecture** that replaces all attention mechanisms with Parallel CfC (Closed-form Continuous-depth) blocks from Liquid Neural Networks.
-**This is genuinely novel research** — no existing paper uses CfC/LTC as a diffusion model backbone.
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/krystv/liquid-diffusion/blob/main/LiquidDiffusion_Training.ipynb)
-## 🔬 Key Innovations
-| Feature | Description |
-|---------|-------------|
-| **No Attention** | All spatial mixing via multi-scale depthwise convolutions (3×3, 5×5, 7×7) + global average pooling |
-| **Fully Parallelizable** | No sequential ODE solving — CfC closed-form solution eliminates the computational bottleneck of Neural ODEs |
-| **CfC × Diffusion Bridge** | The diffusion noise level `t` IS the liquid time constant — natural mathematical correspondence |
-| **Liquid Relaxation Residuals** | Time-aware skip connections: `α·input + (1-α)·output` where `α = exp(-λ·t)` adapts to noise level |
-| **Fits 16GB VRAM** | Tiny model (8M params) fits in ~4GB; designed for Colab free tier T4 |
-## 📐 Architecture
-```
-Input: noisy image [B, 3, H, W] + timestep t ∈ [0, 1]
-Time Embedding: Sinusoidal PE → MLP → t_emb [B, dim]
-Conv Stem: 3×3 conv → SiLU → 3×3 conv
-Encoder:
-  Stage 1: [LiquidDiffusionBlock × N₁] → DownSample (stride-2 conv)
-  Stage 2: [LiquidDiffusionBlock × N₂] → DownSample
-  Stage 3: [LiquidDiffusionBlock × N₃]
-Bottleneck: [LiquidDiffusionBlock × 2]
-Decoder (mirror of encoder):
-  Stage 3: UpSample → SkipFusion → [LiquidDiffusionBlock × N₃]
-  Stage 2: UpSample → SkipFusion → [LiquidDiffusionBlock × N₂]
-  Stage 1: [LiquidDiffusionBlock × N₁]
-Output: GroupNorm → SiLU → 3×3 conv → velocity prediction [B, 3, H, W]
 ```
-### LiquidDiffusionBlock
-```
-x → AdaLN(t) → ParallelCfC(t) → +residual
-  → MultiScaleSpatialMix(t) → +residual
-  → AdaLN(t) → FeedForward → +residual
 ```
-### ParallelCfC (Core Innovation)
-```python
-# CfC Eq.10 adapted for 2D spatial features:
-backbone = SiLU(Conv1x1(DWConv7x7(x)))     # shared spatial context
-f = Conv1x1(backbone)                        # time-constant gate
-g = DWConv→SiLU→Conv1x1(backbone)           # "from" state
-h = DWConv→SiLU→Conv1x1(backbone)           # "to" state (attractor)
-gate = σ(time_a(t_emb) · f - time_b(t_emb)) # liquid time gate
-cfc_out = gate · g + (1-gate) · h            # CfC interpolation
-# Liquid relaxation residual:
-α = exp(-softplus(ρ) · |t|)                  # time-aware weight
-output = α · input + (1-α) · cfc_out         # noise-adaptive residual
-```
-## 📊 Model Configurations
-| Config | Channels | Blocks | Params | Resolution | VRAM (fp16) |
-|--------|----------|--------|--------|-----------|-------------|
-| **tiny** | [64, 128, 256] | [2, 2, 4] | ~8M | 256×256 | ~4GB |
-| **small** | [96, 192, 384] | [2, 3, 6] | ~25M | 256×256 | ~8GB |
-| **base** | [128, 256, 512] | [2, 4, 8] | ~65M | 512×512 | ~14GB |
-| **large** | [128, 256, 512, 768] | [2, 4, 8, 4] | ~120M | 512×512 | ~24GB |
-## 🏋️ Training
-### Rectified Flow (simplest effective objective)
-```
-x_t = (1-t) · x_data + t · noise,   t ~ U[0,1]
-Loss = ||model(x_t, t) - (noise - x_data)||²
-```
-No noise schedule. No variance. Just MSE on a straight-line velocity.
-### Sampling (Euler ODE)
 ```python
-z = randn(B, 3, H, W)  # start from noise
-for i in range(N, 0, -1):
-    t = i / N
-    z = z - model(z, t) / N  # Euler step
-```
-Typically 25-50 steps.
-### Quick Start
-```python
-from liquid_diffusion import liquid_diffusion_tiny, RectifiedFlowTrainer
-model = liquid_diffusion_tiny()
-trainer = RectifiedFlowTrainer(model, lr=1e-4, device='cuda')
-# Training step
-images = get_batch()  # [B, 3, 256, 256] in [-1, 1]
-metrics = trainer.train_step(images)
-print(f"Loss: {metrics['loss']:.4f}")
-# Generate
-samples = trainer.sample(batch_size=4, image_size=256, num_steps=50)
 ```
-### Recommended Datasets
-- **CelebA-HQ** (`huggan/CelebA-HQ`) — 30K face images, 256px
-- **Flowers-102** (`huggan/flowers-102-categories`) — botanical images
-- **AFHQ** — 15K animal faces (cats, dogs, wildlife)
-- Any folder of images
-## 🧮 Mathematical Foundation
-### Liquid Time-Constant Networks (LTC)
-*Hasani et al., AAAI 2021 — [arxiv:2006.04439](https://arxiv.org/abs/2006.04439)*
-The fundamental ODE:
-```
-dx/dt = -[1/τ + f(x,I,θ)] · x + f(x,I,θ) · A
-```
-Key: system time constant `τ_sys = τ/(1 + τ·f)` is **input-dependent** — neurons adapt their response speed.
-### CfC: Closed-form Solution
-*Hasani et al., Nature Machine Intelligence 2022 — [arxiv:2106.13898](https://arxiv.org/abs/2106.13898)*
-Solves the LTC ODE analytically:
-```
-x(t) = σ(-f(x,I;θf)·t) ⊙ g(x,I;θg) + [1 - σ(-f(x,I;θf)·t)] ⊙ h(x,I;θh)
 ```
-Eliminates ODE solver → **fully parallelizable**, one order of magnitude faster.
-### Our CfC-Diffusion Bridge
-We observe that CfC's time parameter `t` and diffusion's noise level `t` serve analogous roles:
-- CfC: `t` controls interpolation between "from" (g) and "to" (h) states
-- Diffusion: `t` controls the noise level the denoiser must handle
-By using the diffusion timestep directly as CfC's time parameter:
-- `t≈0` (clean): gate ≈ 0.5 → balanced g/h → flexible detail processing
-- `t≈1` (noisy): gate saturates → specialized denoising behavior
-- The gate function `f` is **input-dependent** → each image region gets adaptive time response
-### Parallel Liquid Relaxation (from LiquidTAD)
-*[arxiv:2604.18274](https://arxiv.org/abs/2604.18274)*
 ```
-α = exp(-softplus(ρ) · t_diff)
-output = α · input + (1-α) · gated_transform(input)
 ```
-When `t` is large (noisy): α ≈ 0 → rely on CfC output (needs strong processing).
-When `t` is small (clean): α ≈ 1 → preserve input (only minor refinement needed).
-## ��� References
-1. Hasani et al., "Liquid Time-constant Networks", AAAI 2021 — [arxiv:2006.04439](https://arxiv.org/abs/2006.04439)
-2. Hasani et al., "Closed-form Continuous-time Neural Networks", Nature MI 2022 — [arxiv:2106.13898](https://arxiv.org/abs/2106.13898)
-3. Lechner et al., "Neural Circuit Policies", Nature MI 2020
-4. LiquidTAD: Parallel liquid relaxation — [arxiv:2604.18274](https://arxiv.org/abs/2604.18274)
-5. USM: U-Shape Mamba for diffusion — [arxiv:2504.13499](https://arxiv.org/abs/2504.13499)
-6. DiffuSSM: Diffusion without attention — [arxiv:2311.18257](https://arxiv.org/abs/2311.18257)
-7. Liu et al., "Flow Straight and Fast: Rectified Flow", ICLR 2023 — [arxiv:2209.03003](https://arxiv.org/abs/2209.03003)
-8. Lee et al., "Improving the Training of Rectified Flows" — [arxiv:2405.20320](https://arxiv.org/abs/2405.20320)
 ## License

+# 🌊 LiquidDiffusion
+**A novel attention-free image generation model based on Liquid Neural Networks**
+## What is this?
+LiquidDiffusion is a **first-of-its-kind** image generation model that replaces attention with **Parallel CfC (Closed-form Continuous-depth) blocks** from Liquid Neural Network research. No existing paper combines LNNs with image generation — this fills that gap.
+### Key Properties
+- ✅ **Zero attention layers** — fully convolutional + liquid time-gating
+- ✅ **Fully parallelizable** — no ODE solvers, no sequential scanning, no recurrence
+- ✅ **Pretrained VAE** — uses `stabilityai/sd-vae-ft-mse` for efficient latent-space training
+- ✅ **Fits 16GB VRAM** — tiny config runs 256px at batch=8 on T4 GPU
+- ✅ **Simple training** — Rectified Flow (MSE velocity prediction, no noise schedule)
+- ✅ **6 verified datasets** ready to use
+## Quick Start
+Open the Colab notebook, pick your dataset from the dropdown, run all cells:
+**`LiquidDiffusion_Training.ipynb`**
+### Verified Datasets (all tested ✓)
+| Dataset | Size | Content |
+|---------|------|---------|
+| `nielsr/CelebA-faces` | 202K | Celebrity faces |
+| `huggan/flowers-102-categories` | 8K | Flowers |
+| `reach-vb/pokemon-blip-captions` | 833 | Pokemon art |
+| `huggan/anime-faces` | 21K | Anime faces |
+| `huggan/AFHQv2` | 16K | Cat/dog/wild animals |
+| `Norod78/cartoon-blip-captions` | 2K | Cartoon characters |
+## Architecture
 ```
+Input (noisy latent 4ch) → Conv Stem
+    → Encoder [LiquidDiffusionBlock × N, with downsampling]
+        → Bottleneck [LiquidDiffusionBlock × 2]
+    → Decoder [LiquidDiffusionBlock × N, with upsampling + skip fusion]
+→ Conv Head → Velocity prediction
 ```
+### VAE Integration
+- **Encoder**: `stabilityai/sd-vae-ft-mse` (83M params, frozen)
+- **Latent space**: 4 channels, 8× spatial downscale
+- **256px image → 32×32×4 latent** (64× fewer pixels to process!)
+- **Pre-caching**: Encode dataset once, then train without VAE on GPU (saves ~160MB VRAM)
+### ParallelCfCBlock (Novel Contribution)
+Based on CfC Eq.10: `x(t) = σ(-f·t) ⊙ g + (1 - σ(-f·t)) ⊙ h`
 ```python
+# Three CfC heads from shared backbone
+gate = sigmoid(time_a(t_emb) * f(features) - time_b(t_emb))
+cfc_out = gate * g(features) + (1 - gate) * h(features)
+# Liquid relaxation residual
+α = exp(-softplus(ρ) * |t_emb_mean|)
+output = α * input + (1 - α) * cfc_out
 ```
+**Key insight**: Diffusion timestep `t` IS the liquid time constant. CfC gate naturally adapts to noise level.
+## Model Configs
+| Config | Channels | Blocks | Params | 256px VRAM | Best For |
+|--------|----------|--------|--------|------------|----------|
+| tiny | [64, 128, 256] | [2, 2, 4] | ~23M | ~6 GB | Quick experiments, T4 |
+| small | [96, 192, 384] | [2, 3, 6] | ~69M | ~10 GB | Quality 256px, T4/A10G |
+## Training Objective: Rectified Flow
+```python
+x_t = (1 - t) * x0 + t * noise      # linear interpolation
+v_target = noise - x0                 # constant velocity
+loss = MSE(model(x_t, t), v_target)  # simple MSE — no noise schedule!
 ```
+## References
+| Paper | Contribution |
+|-------|-------------|
+| [CfC Networks (Nature MI 2022)](https://arxiv.org/abs/2106.13898) | CfC Eq.10, parallelizable closed-form |
+| [LTC Networks (AAAI 2021)](https://arxiv.org/abs/2006.04439) | Liquid time-constant ODE, stability |
+| [LiquidTAD (2024)](https://arxiv.org/abs/2604.18274) | Parallel liquid relaxation |
+| [USM (CVPR 2025)](https://arxiv.org/abs/2504.13499) | U-Net + SSM for diffusion |
+| [DiffuSSM (2023)](https://arxiv.org/abs/2311.18257) | SSM beats attention in diffusion |
+| [Rectified Flow (ICLR 2023)](https://arxiv.org/abs/2209.03003) | Simple velocity training |
+## Files
 ```
+├── liquid_diffusion/
+│   ├── __init__.py
+│   ├── model.py             # Full model architecture
+│   └── trainer.py           # Rectified Flow trainer + dataset utils
+├── LiquidDiffusion_Training.ipynb  # Complete Colab notebook (VAE + 6 datasets)
+├── test_model.py
+└── README.md
 ```
 ## License