krystv commited on
Commit
1a08b06
·
verified ·
1 Parent(s): b48de91

Update README with VAE and verified datasets

Browse files
Files changed (1) hide show
  1. README.md +70 -131
README.md CHANGED
@@ -1,166 +1,105 @@
1
- # 🌊 LiquidDiffusion: Attention-Free Image Generation with Liquid Neural Networks
2
 
3
- A **novel image generation architecture** that replaces all attention mechanisms with Parallel CfC (Closed-form Continuous-depth) blocks from Liquid Neural Networks.
4
 
5
- **This is genuinely novel research** — no existing paper uses CfC/LTC as a diffusion model backbone.
6
 
7
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/krystv/liquid-diffusion/blob/main/LiquidDiffusion_Training.ipynb)
8
 
9
- ## 🔬 Key Innovations
 
 
 
 
 
 
10
 
11
- | Feature | Description |
12
- |---------|-------------|
13
- | **No Attention** | All spatial mixing via multi-scale depthwise convolutions (3×3, 5×5, 7×7) + global average pooling |
14
- | **Fully Parallelizable** | No sequential ODE solving — CfC closed-form solution eliminates the computational bottleneck of Neural ODEs |
15
- | **CfC × Diffusion Bridge** | The diffusion noise level `t` IS the liquid time constant — natural mathematical correspondence |
16
- | **Liquid Relaxation Residuals** | Time-aware skip connections: `α·input + (1-α)·output` where `α = exp(-λ·t)` adapts to noise level |
17
- | **Fits 16GB VRAM** | Tiny model (8M params) fits in ~4GB; designed for Colab free tier T4 |
18
 
19
- ## 📐 Architecture
20
 
21
- ```
22
- Input: noisy image [B, 3, H, W] + timestep t ∈ [0, 1]
23
-
24
- Time Embedding: Sinusoidal PE → MLP → t_emb [B, dim]
25
-
26
- Conv Stem: 3×3 conv → SiLU → 3×3 conv
27
 
28
- Encoder:
29
- Stage 1: [LiquidDiffusionBlock × N₁] → DownSample (stride-2 conv)
30
- Stage 2: [LiquidDiffusionBlock × N₂] → DownSample
31
- Stage 3: [LiquidDiffusionBlock × N₃]
32
 
33
- Bottleneck: [LiquidDiffusionBlock × 2]
 
 
 
 
 
 
 
34
 
35
- Decoder (mirror of encoder):
36
- Stage 3: UpSample → SkipFusion → [LiquidDiffusionBlock × N₃]
37
- Stage 2: UpSample → SkipFusion → [LiquidDiffusionBlock × N₂]
38
- Stage 1: [LiquidDiffusionBlock × N₁]
39
 
40
- Output: GroupNorm → SiLU → 3×3 conv → velocity prediction [B, 3, H, W]
41
  ```
42
-
43
- ### LiquidDiffusionBlock
44
- ```
45
- x AdaLN(t) ParallelCfC(t) +residual
46
- MultiScaleSpatialMix(t)+residual
47
- → AdaLN(t) → FeedForward → +residual
48
  ```
49
 
50
- ### ParallelCfC (Core Innovation)
51
- ```python
52
- # CfC Eq.10 adapted for 2D spatial features:
53
- backbone = SiLU(Conv1x1(DWConv7x7(x))) # shared spatial context
54
- f = Conv1x1(backbone) # time-constant gate
55
- g = DWConv→SiLU→Conv1x1(backbone) # "from" state
56
- h = DWConv→SiLU→Conv1x1(backbone) # "to" state (attractor)
57
- gate = σ(time_a(t_emb) · f - time_b(t_emb)) # liquid time gate
58
- cfc_out = gate · g + (1-gate) · h # CfC interpolation
59
-
60
- # Liquid relaxation residual:
61
- α = exp(-softplus(ρ) · |t|) # time-aware weight
62
- output = α · input + (1-α) · cfc_out # noise-adaptive residual
63
- ```
64
-
65
- ## 📊 Model Configurations
66
-
67
- | Config | Channels | Blocks | Params | Resolution | VRAM (fp16) |
68
- |--------|----------|--------|--------|-----------|-------------|
69
- | **tiny** | [64, 128, 256] | [2, 2, 4] | ~8M | 256×256 | ~4GB |
70
- | **small** | [96, 192, 384] | [2, 3, 6] | ~25M | 256×256 | ~8GB |
71
- | **base** | [128, 256, 512] | [2, 4, 8] | ~65M | 512×512 | ~14GB |
72
- | **large** | [128, 256, 512, 768] | [2, 4, 8, 4] | ~120M | 512×512 | ~24GB |
73
 
74
- ## 🏋️ Training
75
 
76
- ### Rectified Flow (simplest effective objective)
77
- ```
78
- x_t = (1-t) · x_data + t · noise, t ~ U[0,1]
79
- Loss = ||model(x_t, t) - (noise - x_data)||²
80
- ```
81
- No noise schedule. No variance. Just MSE on a straight-line velocity.
82
 
83
- ### Sampling (Euler ODE)
84
  ```python
85
- z = randn(B, 3, H, W) # start from noise
86
- for i in range(N, 0, -1):
87
- t = i / N
88
- z = z - model(z, t) / N # Euler step
89
- ```
90
- Typically 25-50 steps.
91
-
92
- ### Quick Start
93
- ```python
94
- from liquid_diffusion import liquid_diffusion_tiny, RectifiedFlowTrainer
95
-
96
- model = liquid_diffusion_tiny()
97
- trainer = RectifiedFlowTrainer(model, lr=1e-4, device='cuda')
98
-
99
- # Training step
100
- images = get_batch() # [B, 3, 256, 256] in [-1, 1]
101
- metrics = trainer.train_step(images)
102
- print(f"Loss: {metrics['loss']:.4f}")
103
 
104
- # Generate
105
- samples = trainer.sample(batch_size=4, image_size=256, num_steps=50)
 
106
  ```
107
 
108
- ### Recommended Datasets
109
- - **CelebA-HQ** (`huggan/CelebA-HQ`) — 30K face images, 256px
110
- - **Flowers-102** (`huggan/flowers-102-categories`) — botanical images
111
- - **AFHQ** — 15K animal faces (cats, dogs, wildlife)
112
- - Any folder of images
113
 
114
- ## 🧮 Mathematical Foundation
115
 
116
- ### Liquid Time-Constant Networks (LTC)
117
- *Hasani et al., AAAI 2021 — [arxiv:2006.04439](https://arxiv.org/abs/2006.04439)*
 
 
118
 
119
- The fundamental ODE:
120
- ```
121
- dx/dt = -[1/τ + f(x,I,θ)] · x + f(x,I,θ) · A
122
- ```
123
- Key: system time constant `τ_sys = τ/(1 + τ·f)` is **input-dependent** — neurons adapt their response speed.
124
-
125
- ### CfC: Closed-form Solution
126
- *Hasani et al., Nature Machine Intelligence 2022 — [arxiv:2106.13898](https://arxiv.org/abs/2106.13898)*
127
 
128
- Solves the LTC ODE analytically:
129
- ```
130
- x(t) = σ(-f(x,I;θf)·t) ⊙ g(x,I;θg) + [1 - σ(-f(x,I;θf)·t)] h(x,I;θh)
 
131
  ```
132
- Eliminates ODE solver → **fully parallelizable**, one order of magnitude faster.
133
 
134
- ### Our CfC-Diffusion Bridge
135
- We observe that CfC's time parameter `t` and diffusion's noise level `t` serve analogous roles:
136
- - CfC: `t` controls interpolation between "from" (g) and "to" (h) states
137
- - Diffusion: `t` controls the noise level the denoiser must handle
138
 
139
- By using the diffusion timestep directly as CfC's time parameter:
140
- - `t≈0` (clean): gate ≈ 0.5 → balanced g/h → flexible detail processing
141
- - `t≈1` (noisy): gate saturates specialized denoising behavior
142
- - The gate function `f` is **input-dependent** each image region gets adaptive time response
 
 
 
 
143
 
144
- ### Parallel Liquid Relaxation (from LiquidTAD)
145
- *[arxiv:2604.18274](https://arxiv.org/abs/2604.18274)*
146
 
147
  ```
148
- α = exp(-softplus(ρ) · t_diff)
149
- output = α · input + (1-α) · gated_transform(input)
 
 
 
 
 
150
  ```
151
- When `t` is large (noisy): α ≈ 0 → rely on CfC output (needs strong processing).
152
- When `t` is small (clean): α ≈ 1 → preserve input (only minor refinement needed).
153
-
154
- ## ��� References
155
-
156
- 1. Hasani et al., "Liquid Time-constant Networks", AAAI 2021 — [arxiv:2006.04439](https://arxiv.org/abs/2006.04439)
157
- 2. Hasani et al., "Closed-form Continuous-time Neural Networks", Nature MI 2022 — [arxiv:2106.13898](https://arxiv.org/abs/2106.13898)
158
- 3. Lechner et al., "Neural Circuit Policies", Nature MI 2020
159
- 4. LiquidTAD: Parallel liquid relaxation — [arxiv:2604.18274](https://arxiv.org/abs/2604.18274)
160
- 5. USM: U-Shape Mamba for diffusion — [arxiv:2504.13499](https://arxiv.org/abs/2504.13499)
161
- 6. DiffuSSM: Diffusion without attention — [arxiv:2311.18257](https://arxiv.org/abs/2311.18257)
162
- 7. Liu et al., "Flow Straight and Fast: Rectified Flow", ICLR 2023 — [arxiv:2209.03003](https://arxiv.org/abs/2209.03003)
163
- 8. Lee et al., "Improving the Training of Rectified Flows" — [arxiv:2405.20320](https://arxiv.org/abs/2405.20320)
164
 
165
  ## License
166
 
 
1
+ # 🌊 LiquidDiffusion
2
 
3
+ **A novel attention-free image generation model based on Liquid Neural Networks**
4
 
5
+ ## What is this?
6
 
7
+ LiquidDiffusion is a **first-of-its-kind** image generation model that replaces attention with **Parallel CfC (Closed-form Continuous-depth) blocks** from Liquid Neural Network research. No existing paper combines LNNs with image generation — this fills that gap.
8
 
9
+ ### Key Properties
10
+ - ✅ **Zero attention layers** — fully convolutional + liquid time-gating
11
+ - ✅ **Fully parallelizable** — no ODE solvers, no sequential scanning, no recurrence
12
+ - ✅ **Pretrained VAE** — uses `stabilityai/sd-vae-ft-mse` for efficient latent-space training
13
+ - ✅ **Fits 16GB VRAM** — tiny config runs 256px at batch=8 on T4 GPU
14
+ - ✅ **Simple training** — Rectified Flow (MSE velocity prediction, no noise schedule)
15
+ - ✅ **6 verified datasets** ready to use
16
 
17
+ ## Quick Start
 
 
 
 
 
 
18
 
19
+ Open the Colab notebook, pick your dataset from the dropdown, run all cells:
20
 
21
+ **`LiquidDiffusion_Training.ipynb`**
 
 
 
 
 
22
 
23
+ ### Verified Datasets (all tested ✓)
 
 
 
24
 
25
+ | Dataset | Size | Content |
26
+ |---------|------|---------|
27
+ | `nielsr/CelebA-faces` | 202K | Celebrity faces |
28
+ | `huggan/flowers-102-categories` | 8K | Flowers |
29
+ | `reach-vb/pokemon-blip-captions` | 833 | Pokemon art |
30
+ | `huggan/anime-faces` | 21K | Anime faces |
31
+ | `huggan/AFHQv2` | 16K | Cat/dog/wild animals |
32
+ | `Norod78/cartoon-blip-captions` | 2K | Cartoon characters |
33
 
34
+ ## Architecture
 
 
 
35
 
 
36
  ```
37
+ Input (noisy latent 4ch) → Conv Stem
38
+ Encoder [LiquidDiffusionBlock × N, with downsampling]
39
+ → Bottleneck [LiquidDiffusionBlock × 2]
40
+ Decoder [LiquidDiffusionBlock × N, with upsampling + skip fusion]
41
+ Conv Head Velocity prediction
 
42
  ```
43
 
44
+ ### VAE Integration
45
+ - **Encoder**: `stabilityai/sd-vae-ft-mse` (83M params, frozen)
46
+ - **Latent space**: 4 channels, spatial downscale
47
+ - **256px image → 32×32×4 latent** (64× fewer pixels to process!)
48
+ - **Pre-caching**: Encode dataset once, then train without VAE on GPU (saves ~160MB VRAM)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
+ ### ParallelCfCBlock (Novel Contribution)
51
 
52
+ Based on CfC Eq.10: `x(t) = σ(-f·t) ⊙ g + (1 - σ(-f·t)) ⊙ h`
 
 
 
 
 
53
 
 
54
  ```python
55
+ # Three CfC heads from shared backbone
56
+ gate = sigmoid(time_a(t_emb) * f(features) - time_b(t_emb))
57
+ cfc_out = gate * g(features) + (1 - gate) * h(features)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
+ # Liquid relaxation residual
60
+ α = exp(-softplus(ρ) * |t_emb_mean|)
61
+ output = α * input + (1 - α) * cfc_out
62
  ```
63
 
64
+ **Key insight**: Diffusion timestep `t` IS the liquid time constant. CfC gate naturally adapts to noise level.
 
 
 
 
65
 
66
+ ## Model Configs
67
 
68
+ | Config | Channels | Blocks | Params | 256px VRAM | Best For |
69
+ |--------|----------|--------|--------|------------|----------|
70
+ | tiny | [64, 128, 256] | [2, 2, 4] | ~23M | ~6 GB | Quick experiments, T4 |
71
+ | small | [96, 192, 384] | [2, 3, 6] | ~69M | ~10 GB | Quality 256px, T4/A10G |
72
 
73
+ ## Training Objective: Rectified Flow
 
 
 
 
 
 
 
74
 
75
+ ```python
76
+ x_t = (1 - t) * x0 + t * noise # linear interpolation
77
+ v_target = noise - x0 # constant velocity
78
+ loss = MSE(model(x_t, t), v_target) # simple MSE — no noise schedule!
79
  ```
 
80
 
81
+ ## References
 
 
 
82
 
83
+ | Paper | Contribution |
84
+ |-------|-------------|
85
+ | [CfC Networks (Nature MI 2022)](https://arxiv.org/abs/2106.13898) | CfC Eq.10, parallelizable closed-form |
86
+ | [LTC Networks (AAAI 2021)](https://arxiv.org/abs/2006.04439) | Liquid time-constant ODE, stability |
87
+ | [LiquidTAD (2024)](https://arxiv.org/abs/2604.18274) | Parallel liquid relaxation |
88
+ | [USM (CVPR 2025)](https://arxiv.org/abs/2504.13499) | U-Net + SSM for diffusion |
89
+ | [DiffuSSM (2023)](https://arxiv.org/abs/2311.18257) | SSM beats attention in diffusion |
90
+ | [Rectified Flow (ICLR 2023)](https://arxiv.org/abs/2209.03003) | Simple velocity training |
91
 
92
+ ## Files
 
93
 
94
  ```
95
+ ├── liquid_diffusion/
96
+ │ ├── __init__.py
97
+ │ ├── model.py # Full model architecture
98
+ │ └── trainer.py # Rectified Flow trainer + dataset utils
99
+ ├── LiquidDiffusion_Training.ipynb # Complete Colab notebook (VAE + 6 datasets)
100
+ ├── test_model.py
101
+ └── README.md
102
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  ## License
105