File size: 6,869 Bytes
fe0d9c3
 
 
 
 
 
3063cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe0d9c3
 
 
3063cf6
fe0d9c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3063cf6
fe0d9c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3063cf6
fe0d9c3
 
 
 
 
 
 
3063cf6
 
 
 
 
fe0d9c3
 
 
 
 
3063cf6
 
 
 
 
 
 
fe0d9c3
 
 
 
 
 
 
 
 
3063cf6
 
fe0d9c3
3063cf6
fe0d9c3
 
 
 
 
 
 
 
 
 
 
 
 
 
3063cf6
fe0d9c3
3063cf6
fe0d9c3
 
 
 
 
 
 
 
 
3063cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe0d9c3
3063cf6
 
 
 
 
 
 
fe0d9c3
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# πŸ§ͺ LiquidGen: Liquid Neural Network Image Generator

**A novel attention-free image generation model based on Liquid Neural Network dynamics from MIT CSAIL.**

LiquidGen replaces self-attention in diffusion models with **Closed-form Continuous-depth (CfC)** liquid dynamics β€” making it fully parallelizable, memory-efficient, and trainable on a single consumer GPU (Colab free tier T4).

## πŸš€ Quick Start (Colab)

1. Open `LiquidGen_Colab_Notebook.ipynb` in Google Colab
2. Select a dataset preset (see table below)
3. Run all cells β€” latents are pre-cached automatically, then training starts

**Training is optimized for Colab free tier:**
- **Latent pre-caching**: Encode all images with VAE once β†’ save to disk β†’ train on pure tensors
- **No VAE during training** β†’ saves ~1GB VRAM, enables larger batches (32+)
- **Small curated datasets** that download in seconds (not 5GB WikiArt!)

### Dataset Presets

| Preset | Images | Download | Classes | Description |
|--------|--------|----------|---------|-------------|
| `paintings_mini` | ~200 | 1.7MB | 27 styles | Instant smoke test |
| `paintings` | ~8K | 204MB | 27 styles | **Recommended** β€” best quality/speed tradeoff |
| `cartoon` | ~2.5K | 181MB | unconditional | Cartoon/anime images |
| `flowers` | ~8K | 331MB | unconditional | Flower photography |
| `wikiart_stream` | ~80K | streaming | 27 styles | Full WikiArt via streaming (set `max_images`) |

## πŸ—οΈ Architecture

```
Input Image β†’ Flux VAE Encoder β†’ Noisy Latent β†’ LiquidGen Backbone β†’ Predicted Velocity β†’ Euler ODE β†’ VAE Decoder β†’ Output
```

### Key Components

| Component | What it does | Replaces |
|-----------|-------------|----------|
| **LiquidTimeConstant** | `α·x + (1-α)·stimulus` with learnable decay α = exp(-softplus(ρ)) | Residual connections |
| **GatedDepthwiseStimulusConv** | Local spatial context via gated DW-conv | Self-attention (local) |
| **ZigzagScan1D** | Global context via zigzag-ordered 1D conv | Self-attention (global) |
| **AdaptiveGroupNorm** | Timestep conditioning via scale/shift | AdaLN in DiT |
| **U-Net Long Skips** | Skip connections from shallow to deep blocks | Standard residual |

### Core Innovation: Liquid Time Constants

From the CfC paper (Hasani et al., Nature Machine Intelligence 2022):
```
x_{t+1} = exp(-Ξ”t/Ο„_t) Β· x_t + (1 - exp(-Ξ”t/Ο„_t)) Β· h(x_t, u_t)
```

Our parallelizable version (inspired by LiquidTAD 2025):
```python
α = exp(-softplus(ρ))              # Per-channel learnable retention
output = Ξ± * state + (1 - Ξ±) * stimulus  # Exponential relaxation
```

**No sequential ODE solving.** No attention. Fully parallelizable.

## πŸ“Š Model Sizes

| Model | Params | VRAM (train) | Best For |
|-------|--------|-------------|----------|
| **LiquidGen-S** | ~55M | ~4-6 GB | 256px, fast experiments |
| **LiquidGen-B** | ~140M | ~8-10 GB | 256/512px, balanced |
| **LiquidGen-L** | ~280M | ~12-14 GB | 512px, high quality |

All fit in **16GB VRAM** (Colab free T4). Training on cached latents = no VAE overhead.

## πŸ”§ Training

```python
from train import TrainConfig, train

config = TrainConfig(
    model_size="small",
    dataset_preset="paintings",   # 8K paintings, 204MB, 27 styles
    image_size=256,
    batch_size=32,                # Large batches OK with cached latents!
    num_epochs=100,
    learning_rate=1e-4,
)
train(config)
```

### Training Pipeline
1. **Pre-cache**: Load dataset β†’ encode all images with frozen Flux VAE β†’ save latents to disk β†’ unload VAE
2. **Train**: Load cached tensors β†’ train LiquidGen backbone with flow matching β†’ fast iterations!
3. **Sample**: Load VAE only when generating sample images (lazy loading)

### Details
- **VAE**: FLUX.1-schnell (frozen, 16ch latent, 8x compression, Apache 2.0)
- **Objective**: Flow matching (velocity prediction) β€” `v = noise - x_0`
- **Optimizer**: AdamW (lr=1e-4, weight_decay=0.01)
- **Gradient clipping**: 2.0 (critical for stability, from ZigMa paper)
- **EMA**: 0.9999 decay
- **Sampling**: Euler ODE, 50 steps, classifier-free guidance

## πŸ“ Files

```
β”œβ”€β”€ model.py                        # LiquidGen model architecture (~55-280M params)
β”œβ”€β”€ train.py                        # Training pipeline with latent pre-caching
β”œβ”€β”€ LiquidGen_Colab_Notebook.ipynb  # Ready-to-run Colab notebook
└── README.md
```

## πŸ“ Architecture Diagram

```
Input Latent [B, 16, H/8, W/8]
    β”‚
    β”œβ”€β”€β”€ Patch Embed (Conv2d, stride=2) ──→ [B, D, H/16, W/16]
    β”œβ”€β”€β”€ + Learnable Position Embedding
    β”œβ”€β”€β”€ Input Projection (DW-Conv + PW-Conv + GELU)
    β”‚
    β”œβ”€β”€β”€ LiquidBlock Γ— (depth/2)  ←── save skip connections
    β”‚       β”œβ”€β”€ AdaGN (timestep conditioned)
    β”‚       β”œβ”€β”€ GatedDepthwiseStimulusConv (local spatial)
    β”‚       β”œβ”€β”€ + ZigzagScan1D (global context)  
    β”‚       β”œβ”€β”€ LiquidTimeConstant #1 (CfC blend)
    β”‚       β”œβ”€β”€ AdaGN
    β”‚       β”œβ”€β”€ ChannelMixMLP (GELU)
    β”‚       └── LiquidTimeConstant #2 (CfC blend)
    β”‚
    β”œβ”€β”€β”€ LiquidBlock Γ— (depth/2)  ←── add skip connections
    β”‚
    β”œβ”€β”€β”€ GroupNorm + Conv + GELU
    └─── Unpatchify (ConvTranspose2d) ──→ [B, 16, H/8, W/8]
```

## πŸ”¬ Research Background

### Liquid Neural Networks
- **Liquid Time-constant Networks** (Hasani et al., NeurIPS 2020) β€” ODE-based neurons with input-dependent Ο„
- **Closed-form Continuous-depth Models** (Hasani et al., Nature Machine Intelligence 2022) β€” Analytical solution eliminating ODE solvers
- **Neural Circuit Policies** (Lechner et al., Nature Machine Intelligence 2020) — Sparse wiring: sensory→inter→command→motor
- **LiquidTAD** (2025) β€” Static decay Ξ±=exp(-softplus(ρ)) for fully parallel liquid dynamics (100Γ— speedup)

### Attention-Free Image Generation  
- **ZigMa** (ECCV 2024) β€” Zigzag scanning for SSM-based diffusion
- **DiMSUM** (NeurIPS 2024) β€” Spatial-frequency Mamba (FID 2.11 ImageNet 256)
- **DiffuSSM** (2023) β€” First attention-free diffusion model
- **DiM** (2024) β€” Multi-directional Mamba with padding tokens

### Flow Matching
- **Flow Matching for Generative Modeling** (Lipman et al., 2023)
- **SiT** (2024) β€” Scalable Interpolant Transformers

## ⚑ Design Decisions

1. **No Attention** β€” O(n) complexity. Liquid dynamics + zigzag conv replace self-attention entirely.
2. **Liquid over Residual** β€” `Ξ±Β·x + (1-Ξ±)Β·f(x)` instead of `x + f(x)`. Explicit control over retention per channel.
3. **Zigzag Scanning** β€” Preserves spatial continuity at row boundaries (critical insight from ZigMa).
4. **Latent Pre-caching** β€” Encode once, train forever. No VAE overhead during training.
5. **Flow Matching** β€” Straighter ODE trajectories β†’ fewer sampling steps, better quality.

## πŸ“œ License

MIT