File size: 12,410 Bytes
1470d5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
# πŸ”¨ MicroForge: A Novel Mobile-First Image Generation Architecture

> **Recurrent Latent Planning Γ— SSM-Conv Hybrid Backbone Γ— Deep Compression**

MicroForge is a genuinely new image generation architecture designed from scratch for consumer devices (3-4 GB RAM), trainable on a single 16 GB GPU. It combines the best ideas from recent research into an efficient, compact, editing-ready system.

**Key numbers:**
- **MicroForge-tiny**: 28M params, ~56 MB fp16, ~0.13s/image on CPU
- **MicroForge-small**: 114M params, ~228 MB fp16
- **MicroForge-base**: 193M params, ~386 MB fp16
- **Editing-ready**: Same backbone handles generation, editing, inpainting, super-res

---

## Table of Contents

1. [Architecture Overview](#1-architecture-overview)
2. [Paper Shortlist & Critique](#2-paper-shortlist--critique)
3. [Module-by-Module Design](#3-module-by-module-design)
4. [Mathematical Formulation](#4-mathematical-formulation)
5. [Training Objective](#5-training-objective)
6. [Memory & Compute Budget](#6-memory--compute-budget)
7. [Training Curriculum](#7-training-curriculum)
8. [Mobile Deployment Plan](#8-mobile-deployment-plan)
9. [Failure Mode Analysis](#9-failure-mode-analysis)
10. [Ablation Plan](#10-ablation-plan)
11. [Editing Roadmap](#11-editing-roadmap)
12. [Quick Start](#12-quick-start)

---

## 1. Architecture Overview

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     MicroForge Pipeline                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                  β”‚
β”‚  Text ──→ [Text Encoder (CLIP/TinyCLIP)] ──→ text_emb, pooled   β”‚
β”‚                     β”‚                                            β”‚
β”‚                     β–Ό                                            β”‚
β”‚  Noise z_T ──→ [Recurrent Latent Planner]                       β”‚
β”‚                  β”‚  K=32 plan tokens (49 KB state)               β”‚
β”‚                  β”‚  READ: cross-attn(plan, z_t) β€” O(KΒ·N)        β”‚
β”‚                  β”‚  REASON: self-attn(plan) β€” O(KΒ²)             β”‚
β”‚                  β”‚  Self-condition from previous step            β”‚
β”‚                  β–Ό                                               β”‚
β”‚  z_t ──→ [SSM-Conv Hybrid Backbone] ◄── planner_tokens          β”‚
β”‚           β”‚ Per block (Γ—6/12/18):                                β”‚
β”‚           β”‚   1. AdaLN-Group(z_t, t_emb + text_pool)            β”‚
β”‚           β”‚   2. BiSSM(zigzag scan) β€” O(N)                      β”‚
β”‚           β”‚   3. CrossAttn(z_t, text_emb βˆ₯ plan) β€” O(NΒ·M)      β”‚
β”‚           β”‚   4. FFN(expansion=3) β€” O(NΒ·D)                      β”‚
β”‚           β”‚ Every K blocks: SharedMQA(z_t) β€” single instance    β”‚
β”‚           β–Ό                                                      β”‚
β”‚  v_pred = backbone(z_t, t, text, plan)                          β”‚
β”‚  z_{t-1} = z_t + Ξ”t Β· v_pred       (Euler ODE step)            β”‚
β”‚                                                                  β”‚
β”‚  z_0 ──→ [DC-VAE Decoder (32Γ— upsample)] ──→ Image [3,H,W]    β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€ Editing Mode (same backbone) ────────────────────┐        β”‚
β”‚  β”‚ z_input = [z_target_noise βˆ₯ z_source] (width-concat) β”‚        β”‚
β”‚  β”‚ Task token: [Generate] / [Edit] / [Inpaint] / [SR]   β”‚        β”‚
β”‚  β”‚ No extra parameters needed                            β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### What's Novel

1. **Recurrent Latent Planner (RLP)**: Persistent latent tokens that carry "memory" across denoising steps. The planner reasons at a higher level before the backbone commits to pixel changes. Inspired by RIN (Jabri et al., 2022) but adapted for diffusion: plan tokens READ from the noised latent, REASON internally via self-attention, then inject guidance into the backbone via cross-attention. Self-conditioning carries plan state across steps.

2. **SSM-Conv Hybrid Backbone**: Replaces O(NΒ²) self-attention with bidirectional SSM scanning (O(N)) plus local DWConv. One globally-shared lightweight MQA attention block provides in-context learning capability. This hybrid achieves the global receptive field of attention with linear complexity.

3. **Deep Compression VAE with Residual Shortcuts**: 32Γ— spatial compression using space-to-channel rearrangement as non-parametric skip connections. 512px β†’ 16Γ—16Γ—32 latent = only 256 spatial tokens (vs 4096 in SD-VAE).

4. **Editing by Design**: DreamLite-style spatial concatenation enables generation, editing, inpainting, and super-resolution with zero extra parameters. The same backbone processes all tasks.

---

## 2. Paper Shortlist & Critique

### A. Efficient Image Generation

| Paper | Problem Solved | What to Borrow | Failure Modes |
|-------|---------------|----------------|---------------|
| **SANA-Sprint** (2503.09641) | 1-step generation, 0.6B params | Linear DiT + DC-AE latent + sCM+LADD distillation | Text encoder dominates memory |
| **SnapGen** (2412.09619) | Mobile T2I, 0.38B, iPhone 15 | Remove SA from high-res, MQA, expanded separable conv | No public weights |
| **SnapGen++** (2601.08303) | 360ms/step iPhone, 0.4B | ASSA, elastic supernetwork, tiny VAE | Proprietary |
| **DreamLite** (2603.28713) | Mobile gen+edit unified | Spatial concat, task-progressive training | No public weights |

### B. Subquadratic Backbones

| Paper | Problem Solved | What to Borrow | Failure Modes |
|-------|---------------|----------------|---------------|
| **DiMSUM** (2411.04168) | Best FID with Mamba, 3Γ— faster convergence | Wavelet+Mamba, shared attention block | Complex implementation |
| **ZigMa** (2403.13802) | Spatial continuity for SSM | Zigzag-8 scan, heterogeneous layers | Only class-conditional |
| **LiT** (2501.12976) | Pure linear DiT | DWConv inside linear attn, weight inheritance | Small quality drop at low res |

### C. Compact Latent Spaces

| Paper | Problem Solved | What to Borrow | Failure Modes |
|-------|---------------|----------------|---------------|
| **DC-AE** (2410.10733) | 32-128Γ— compression | Residual space-to-channel shortcuts | High-channel needs bigger backbone |
| **TiTok** (2406.07550) | 32-128 1D tokens | Break 2D grid, proxy-code VQ | Resolution-fixed |

### D. Editing Patterns

| Paper | Problem Solved | What to Borrow | Failure Modes |
|-------|---------------|----------------|---------------|
| **DreamLite** (2603.28713) | Mobile gen+edit | Spatial concat (+14 GenEval vs channel) | Editing data at scale |
| **FLUX Kontext** (2506.15742) | Best editing quality | 3D RoPE offset, multi-reference | 12B, not mobile |
| **RIN** (2212.11972) | Decoupled computation | Latent tokens + cross-attn, self-cond | Pixel-space only |

---

## 3. Module-by-Module Design

### Module A: Deep Compression VAE (`microforge/vae.py`)

32Γ— spatial compression with space-to-channel residual shortcuts (DC-AE technique).

| Config | Channels | Latent C | Params | FP16 |
|--------|----------|----------|--------|------|
| tiny | [32,64,128,256] | 16 | 16M | 32 MB |
| small | [64,128,256,512] | 32 | 77M | 154 MB |
| base | [128,256,512,512] | 32 | 110M | 220 MB |

### Module B: SSM-Conv Hybrid Backbone (`microforge/backbone.py`)

Bidirectional SSM + local DWConv + one globally-shared MQA attention.

| Config | Depth | Dim | Params | FP16 |
|--------|-------|-----|--------|------|
| tiny | 6 | 256 | 8M | 16 MB |
| small | 12 | 384 | 29M | 58 MB |
| base | 18 | 512 | 71M | 142 MB |

### Module C: Recurrent Latent Planner (`microforge/planner.py`)

32 persistent plan tokens, 49 KB state per plan. O(KΒ²+KΒ·N) per layer.

### Module D: Text Encoder (pluggable)
- Mobile: TinyCLIP ~60M
- Quality: CLIP-L ~428M
- Best: Gemma-2-2B ~2B

---

## 4. Mathematical Formulation

**Rectified Flow**: z_t = (1-t)Β·z_0 + tΒ·Ξ΅

**Velocity target**: v* = Ξ΅ - z_0

**Training loss**: L = E[w(t) Β· ||v_ΞΈ(z_t, t, c) - v*||Β²] where w(t) = 1/(1+|2t-1|)

**Sampling**: z_{t-Ξ”t} = z_t + Ξ”t Β· v_ΞΈ(z_t, t, c)

**Planner self-conditioning**: p_t = Οƒ(w)Β·p_{t+1} + (1-Οƒ(w))Β·p_init(text)

**CFG**: vΜ‚ = v_βˆ… + sΒ·(v_c - v_βˆ…)

---

## 5. Training Objective

- **Stage 1 (VAE)**: L1 + Ξ»_KLΒ·KL + LPIPS + GAN
- **Stage 2-3 (Flow)**: w(t)Β·||v_ΞΈ - v*||Β²
- **Stage 4 (KD)**: L_flow + Ξ»_tΒ·Ξ±(t)Β·||v_student - v_teacher||Β²
- **Stage 5 (Edit)**: ||v_ΞΈ([z_t|z_src], t, c_edit) - v*||Β²
- **Stage 6 (Distill)**: ||f_θ(z_t, t) - f_{θ⁻}(z_t', t')||²

---

## 6. Memory & Compute Budget

### Total System Memory (FP16, no text encoder)
- **Tiny**: ~76 MB inference @ 512px
- **Small**: ~308 MB inference @ 512px
- **Base**: ~530 MB inference @ 512px

With TinyCLIP (+120 MB) β†’ under 500 MB for tiny config.

---

## 7. Training Curriculum (16 GB GPU)

| Stage | Freeze | Train | Data | Res | Steps | LR | Time (T4) |
|-------|--------|-------|------|-----|-------|----|-----------|
| 1. VAE | β€” | VAE | ImageNet-50K | 128β†’256 | 50K | 1e-4 | 6h |
| 2. Low-Res | VAE | Backbone+Plan | Synthetic 100K | 128β†’256 | 100K | 1e-4 | 12h |
| 3. High-Res | VAE | Backbone+Plan | Same+high-res | 256β†’512 | 50K | 5e-5 | 8h |
| 4. Distill | VAE | Backbone+Plan | Teacher cached | 512 | 30K | 2e-5 | 6h |
| 5. Edit | VAE | All (low LR) | IP2P+MagicBrush | 256β†’512 | 20K | 1e-5 | 4h |

---

## 8. Mobile Deployment

1. Step distill to 4 steps (consistency/LADD)
2. Export ONNX with static shapes
3. INT8 weight quantization
4. Convert to CoreML/NNAPI/QNN
5. Profile on-device

---

## 9. Failure Modes

| Failure | Fix |
|---------|-----|
| SSM scan artifacts | More scan directions + larger DWConv |
| Planner collapse | Diversity loss on plan tokens |
| VAE blur | Reduce Ξ»_KL + adversarial loss |
| Training instability | Grad clip=2.0 + separate SSM LR |
| Editing forgetting | Spatial concat + progressive training |

---

## 10. Ablation Plan

| ID | Change | Expected |
|----|--------|----------|
| A1 | No Planner | -2-5% FID |
| A2 | Full attention (no SSM) | Better@256, worse@1024, 2-4Γ— slower |
| A3 | No shared MQA | -1-3% FID |
| A4 | No DWConv in SSM | -2-4% FID |
| A5 | No self-conditioning | More step jitter |
| A6 | Full vs grouped adaLN | +46% params, marginal gain |
| A7 | f16 vs f32 vs f64 VAE | f32 sweet spot |
| A8 | Spatial vs channel concat | Spatial preserves gen quality |

---

## 11. Editing Roadmap

- βœ… Phase 1: Architecture supports spatial concatenation
- Phase 2: Image editing (InstructPix2Pix data)
- Phase 3: Inpainting (masked spatial concat)
- Phase 4: Super-resolution
- Phase 5: Style/reference (add IP-Adapter, +22M params)
- Phase 6: Local editing (region-aware planner)

---

## 12. Quick Start

```python
import torch
from microforge.vae import MicroForgeVAE
from microforge.backbone import MicroForgeBackbone
from microforge.planner import RecurrentLatentPlanner
from microforge.pipeline import MicroForgePipeline, SimpleTextEncoder

vae = MicroForgeVAE(config='tiny')
backbone = MicroForgeBackbone(latent_channels=16, config='tiny')
planner = RecurrentLatentPlanner(num_plan_tokens=16, dim=256, text_dim=768, latent_channels=16)
text_enc = SimpleTextEncoder(embed_dim=768, num_layers=2)
pipeline = MicroForgePipeline(vae, backbone, text_enc, planner)

tokens = torch.randint(0, 8192, (1, 10))
images = pipeline.text2img(tokens, height=256, width=256, num_steps=4)
```

---

## License

MIT License

## Citation

```bibtex
@software{microforge2025,
  title={MicroForge: Mobile-First Image Generation with Recurrent Latent Planning},
  year={2025},
  url={https://huggingface.co/asdf98/microforge}
}
```