Add README.md

1470d5b verified 8 days ago

12.4 kB

	# 🔨 MicroForge: A Novel Mobile-First Image Generation Architecture

	> Recurrent Latent Planning × SSM-Conv Hybrid Backbone × Deep Compression

	MicroForge is a genuinely new image generation architecture designed from scratch for consumer devices (3-4 GB RAM), trainable on a single 16 GB GPU. It combines the best ideas from recent research into an efficient, compact, editing-ready system.

	Key numbers:
	- MicroForge-tiny: 28M params, ~56 MB fp16, ~0.13s/image on CPU
	- MicroForge-small: 114M params, ~228 MB fp16
	- MicroForge-base: 193M params, ~386 MB fp16
	- Editing-ready: Same backbone handles generation, editing, inpainting, super-res

	---

	## Table of Contents

	1. [Architecture Overview](#1-architecture-overview)
	2. [Paper Shortlist & Critique](#2-paper-shortlist--critique)
	3. [Module-by-Module Design](#3-module-by-module-design)
	4. [Mathematical Formulation](#4-mathematical-formulation)
	5. [Training Objective](#5-training-objective)
	6. [Memory & Compute Budget](#6-memory--compute-budget)
	7. [Training Curriculum](#7-training-curriculum)
	8. [Mobile Deployment Plan](#8-mobile-deployment-plan)
	9. [Failure Mode Analysis](#9-failure-mode-analysis)
	10. [Ablation Plan](#10-ablation-plan)
	11. [Editing Roadmap](#11-editing-roadmap)
	12. [Quick Start](#12-quick-start)

	---

	## 1. Architecture Overview

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ MicroForge Pipeline │
	├─────────────────────────────────────────────────────────────────┤
	│ │
	│ Text ──→ [Text Encoder (CLIP/TinyCLIP)] ──→ text_emb, pooled │
	│ │ │
	│ ▼ │
	│ Noise z_T ──→ [Recurrent Latent Planner] │
	│ │ K=32 plan tokens (49 KB state) │
	│ │ READ: cross-attn(plan, z_t) — O(K·N) │
	│ │ REASON: self-attn(plan) — O(K²) │
	│ │ Self-condition from previous step │
	│ ▼ │
	│ z_t ──→ [SSM-Conv Hybrid Backbone] ◄── planner_tokens │
	│ │ Per block (×6/12/18): │
	│ │ 1. AdaLN-Group(z_t, t_emb + text_pool) │
	│ │ 2. BiSSM(zigzag scan) — O(N) │
	│ │ 3. CrossAttn(z_t, text_emb ∥ plan) — O(N·M) │
	│ │ 4. FFN(expansion=3) — O(N·D) │
	│ │ Every K blocks: SharedMQA(z_t) — single instance │
	│ ▼ │
	│ v_pred = backbone(z_t, t, text, plan) │
	│ z_{t-1} = z_t + Δt · v_pred (Euler ODE step) │
	│ │
	│ z_0 ──→ [DC-VAE Decoder (32× upsample)] ──→ Image [3,H,W] │
	│ │
	│ ┌─── Editing Mode (same backbone) ────────────────────┐ │
	│ │ z_input = [z_target_noise ∥ z_source] (width-concat) │ │
	│ │ Task token: [Generate] / [Edit] / [Inpaint] / [SR] │ │
	│ │ No extra parameters needed │ │
	│ └──────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────────┘
	```

	### What's Novel

	1. Recurrent Latent Planner (RLP): Persistent latent tokens that carry "memory" across denoising steps. The planner reasons at a higher level before the backbone commits to pixel changes. Inspired by RIN (Jabri et al., 2022) but adapted for diffusion: plan tokens READ from the noised latent, REASON internally via self-attention, then inject guidance into the backbone via cross-attention. Self-conditioning carries plan state across steps.

	2. SSM-Conv Hybrid Backbone: Replaces O(N²) self-attention with bidirectional SSM scanning (O(N)) plus local DWConv. One globally-shared lightweight MQA attention block provides in-context learning capability. This hybrid achieves the global receptive field of attention with linear complexity.

	3. Deep Compression VAE with Residual Shortcuts: 32× spatial compression using space-to-channel rearrangement as non-parametric skip connections. 512px → 16×16×32 latent = only 256 spatial tokens (vs 4096 in SD-VAE).

	4. Editing by Design: DreamLite-style spatial concatenation enables generation, editing, inpainting, and super-resolution with zero extra parameters. The same backbone processes all tasks.

	---

	## 2. Paper Shortlist & Critique

	### A. Efficient Image Generation

	\| Paper \| Problem Solved \| What to Borrow \| Failure Modes \|
	\|-------\|---------------\|----------------\|---------------\|
	\| SANA-Sprint (2503.09641) \| 1-step generation, 0.6B params \| Linear DiT + DC-AE latent + sCM+LADD distillation \| Text encoder dominates memory \|
	\| SnapGen (2412.09619) \| Mobile T2I, 0.38B, iPhone 15 \| Remove SA from high-res, MQA, expanded separable conv \| No public weights \|
	\| SnapGen++ (2601.08303) \| 360ms/step iPhone, 0.4B \| ASSA, elastic supernetwork, tiny VAE \| Proprietary \|
	\| DreamLite (2603.28713) \| Mobile gen+edit unified \| Spatial concat, task-progressive training \| No public weights \|

	### B. Subquadratic Backbones

	\| Paper \| Problem Solved \| What to Borrow \| Failure Modes \|
	\|-------\|---------------\|----------------\|---------------\|
	\| DiMSUM (2411.04168) \| Best FID with Mamba, 3× faster convergence \| Wavelet+Mamba, shared attention block \| Complex implementation \|
	\| ZigMa (2403.13802) \| Spatial continuity for SSM \| Zigzag-8 scan, heterogeneous layers \| Only class-conditional \|
	\| LiT (2501.12976) \| Pure linear DiT \| DWConv inside linear attn, weight inheritance \| Small quality drop at low res \|

	### C. Compact Latent Spaces

	\| Paper \| Problem Solved \| What to Borrow \| Failure Modes \|
	\|-------\|---------------\|----------------\|---------------\|
	\| DC-AE (2410.10733) \| 32-128× compression \| Residual space-to-channel shortcuts \| High-channel needs bigger backbone \|
	\| TiTok (2406.07550) \| 32-128 1D tokens \| Break 2D grid, proxy-code VQ \| Resolution-fixed \|

	### D. Editing Patterns

	\| Paper \| Problem Solved \| What to Borrow \| Failure Modes \|
	\|-------\|---------------\|----------------\|---------------\|
	\| DreamLite (2603.28713) \| Mobile gen+edit \| Spatial concat (+14 GenEval vs channel) \| Editing data at scale \|
	\| FLUX Kontext (2506.15742) \| Best editing quality \| 3D RoPE offset, multi-reference \| 12B, not mobile \|
	\| RIN (2212.11972) \| Decoupled computation \| Latent tokens + cross-attn, self-cond \| Pixel-space only \|

	---

	## 3. Module-by-Module Design

	### Module A: Deep Compression VAE (`microforge/vae.py`)

	32× spatial compression with space-to-channel residual shortcuts (DC-AE technique).

	\| Config \| Channels \| Latent C \| Params \| FP16 \|
	\|--------\|----------\|----------\|--------\|------\|
	\| tiny \| [32,64,128,256] \| 16 \| 16M \| 32 MB \|
	\| small \| [64,128,256,512] \| 32 \| 77M \| 154 MB \|
	\| base \| [128,256,512,512] \| 32 \| 110M \| 220 MB \|

	### Module B: SSM-Conv Hybrid Backbone (`microforge/backbone.py`)

	Bidirectional SSM + local DWConv + one globally-shared MQA attention.

	\| Config \| Depth \| Dim \| Params \| FP16 \|
	\|--------\|-------\|-----\|--------\|------\|
	\| tiny \| 6 \| 256 \| 8M \| 16 MB \|
	\| small \| 12 \| 384 \| 29M \| 58 MB \|
	\| base \| 18 \| 512 \| 71M \| 142 MB \|

	### Module C: Recurrent Latent Planner (`microforge/planner.py`)

	32 persistent plan tokens, 49 KB state per plan. O(K²+K·N) per layer.

	### Module D: Text Encoder (pluggable)
	- Mobile: TinyCLIP ~60M
	- Quality: CLIP-L ~428M
	- Best: Gemma-2-2B ~2B

	---

	## 4. Mathematical Formulation

	Rectified Flow: z_t = (1-t)·z_0 + t·ε

	Velocity target: v* = ε - z_0

	Training loss: L = E[w(t) · \|\|v_θ(z_t, t, c) - v*\|\|²] where w(t) = 1/(1+\|2t-1\|)

	Sampling: z_{t-Δt} = z_t + Δt · v_θ(z_t, t, c)

	Planner self-conditioning: p_t = σ(w)·p_{t+1} + (1-σ(w))·p_init(text)

	CFG: v̂ = v_∅ + s·(v_c - v_∅)

	---

	## 5. Training Objective

	- Stage 1 (VAE): L1 + λ_KL·KL + LPIPS + GAN
	- Stage 2-3 (Flow): w(t)·\|\|v_θ - v*\|\|²
	- Stage 4 (KD): L_flow + λ_t·α(t)·\|\|v_student - v_teacher\|\|²
	- Stage 5 (Edit): \|\|v_θ([z_t\|z_src], t, c_edit) - v*\|\|²
	- Stage 6 (Distill): \|\|f_θ(z_t, t) - f_{θ⁻}(z_t', t')\|\|²

	---

	## 6. Memory & Compute Budget

	### Total System Memory (FP16, no text encoder)
	- Tiny: ~76 MB inference @ 512px
	- Small: ~308 MB inference @ 512px
	- Base: ~530 MB inference @ 512px

	With TinyCLIP (+120 MB) → under 500 MB for tiny config.

	---

	## 7. Training Curriculum (16 GB GPU)

	\| Stage \| Freeze \| Train \| Data \| Res \| Steps \| LR \| Time (T4) \|
	\|-------\|--------\|-------\|------\|-----\|-------\|----\|-----------\|
	\| 1. VAE \| — \| VAE \| ImageNet-50K \| 128→256 \| 50K \| 1e-4 \| 6h \|
	\| 2. Low-Res \| VAE \| Backbone+Plan \| Synthetic 100K \| 128→256 \| 100K \| 1e-4 \| 12h \|
	\| 3. High-Res \| VAE \| Backbone+Plan \| Same+high-res \| 256→512 \| 50K \| 5e-5 \| 8h \|
	\| 4. Distill \| VAE \| Backbone+Plan \| Teacher cached \| 512 \| 30K \| 2e-5 \| 6h \|
	\| 5. Edit \| VAE \| All (low LR) \| IP2P+MagicBrush \| 256→512 \| 20K \| 1e-5 \| 4h \|

	---

	## 8. Mobile Deployment

	1. Step distill to 4 steps (consistency/LADD)
	2. Export ONNX with static shapes
	3. INT8 weight quantization
	4. Convert to CoreML/NNAPI/QNN
	5. Profile on-device

	---

	## 9. Failure Modes

	\| Failure \| Fix \|
	\|---------\|-----\|
	\| SSM scan artifacts \| More scan directions + larger DWConv \|
	\| Planner collapse \| Diversity loss on plan tokens \|
	\| VAE blur \| Reduce λ_KL + adversarial loss \|
	\| Training instability \| Grad clip=2.0 + separate SSM LR \|
	\| Editing forgetting \| Spatial concat + progressive training \|

	---

	## 10. Ablation Plan

	\| ID \| Change \| Expected \|
	\|----\|--------\|----------\|
	\| A1 \| No Planner \| -2-5% FID \|
	\| A2 \| Full attention (no SSM) \| Better@256, worse@1024, 2-4× slower \|
	\| A3 \| No shared MQA \| -1-3% FID \|
	\| A4 \| No DWConv in SSM \| -2-4% FID \|
	\| A5 \| No self-conditioning \| More step jitter \|
	\| A6 \| Full vs grouped adaLN \| +46% params, marginal gain \|
	\| A7 \| f16 vs f32 vs f64 VAE \| f32 sweet spot \|
	\| A8 \| Spatial vs channel concat \| Spatial preserves gen quality \|

	---

	## 11. Editing Roadmap

	- ✅ Phase 1: Architecture supports spatial concatenation
	- Phase 2: Image editing (InstructPix2Pix data)
	- Phase 3: Inpainting (masked spatial concat)
	- Phase 4: Super-resolution
	- Phase 5: Style/reference (add IP-Adapter, +22M params)
	- Phase 6: Local editing (region-aware planner)

	---

	## 12. Quick Start

	```python
	import torch
	from microforge.vae import MicroForgeVAE
	from microforge.backbone import MicroForgeBackbone
	from microforge.planner import RecurrentLatentPlanner
	from microforge.pipeline import MicroForgePipeline, SimpleTextEncoder

	vae = MicroForgeVAE(config='tiny')
	backbone = MicroForgeBackbone(latent_channels=16, config='tiny')
	planner = RecurrentLatentPlanner(num_plan_tokens=16, dim=256, text_dim=768, latent_channels=16)
	text_enc = SimpleTextEncoder(embed_dim=768, num_layers=2)
	pipeline = MicroForgePipeline(vae, backbone, text_enc, planner)

	tokens = torch.randint(0, 8192, (1, 10))
	images = pipeline.text2img(tokens, height=256, width=256, num_steps=4)
	```

	---

	## License

	MIT License

	## Citation

	```bibtex
	@software{microforge2025,
	title={MicroForge: Mobile-First Image Generation with Recurrent Latent Planning},
	year={2025},
	url={https://huggingface.co/asdf98/microforge}
	}
	```