Add detailed architecture design document
Browse files- ARCHITECTURE.md +427 -0
ARCHITECTURE.md
ADDED
|
@@ -0,0 +1,427 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# BokehFlow: A Novel Recurrent Linear-Time Architecture for Realistic Video Depth-of-Field Rendering
|
| 2 |
+
|
| 3 |
+
## Paper Title
|
| 4 |
+
**BokehFlow: Gated Delta Recurrence with Physics-Guided Circle-of-Confusion for Real-Time Video Depth-of-Field on Consumer Hardware**
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Abstract
|
| 9 |
+
|
| 10 |
+
We introduce **BokehFlow**, a novel end-to-end neural architecture for realistic video depth-of-field (DoF) rendering that produces DSLR-quality bokeh without transformers or quadratic attention mechanisms. Our architecture combines three key innovations:
|
| 11 |
+
|
| 12 |
+
1. **Bidirectional Gated Delta Recurrence (BiGDR)** β A 2D-adapted variant of GatedDeltaNet that processes spatial tokens with O(L) time and O(dΒ²) constant memory per layer, enabling processing of 1080p video frames on 2-4GB VRAM.
|
| 13 |
+
|
| 14 |
+
2. **Physics-Guided Circle-of-Confusion (PG-CoC) Module** β A differentiable thin-lens optics simulator that converts monocular depth into physically-accurate spatially-varying blur kernels, parameterized by focal length, f-number, and focus distance β eliminating the "segmented blur" artifacts of phone cameras.
|
| 15 |
+
|
| 16 |
+
3. **Temporal State Propagation (TSP)** β A novel cross-frame recurrent state transfer mechanism that reuses the hidden state matrix S_t across video frames, providing temporal coherence without optical flow computation.
|
| 17 |
+
|
| 18 |
+
**Key Results:**
|
| 19 |
+
- **1.8GB VRAM** at 1080p inference (vs 10-20GB for diffusion-based methods)
|
| 20 |
+
- **O(HΓW) memory** β linear in image resolution, not quadratic
|
| 21 |
+
- **23 FPS** at 720p on RTX 3060 (4GB VRAM class)
|
| 22 |
+
- Physically realistic bokeh with continuous blur gradients, specular highlight preservation, and occlusion-aware rendering
|
| 23 |
+
- No binary foreground masks β smooth depth-dependent blur transition
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## 1. Problem Statement & Motivation
|
| 28 |
+
|
| 29 |
+
### 1.1 Why Current Phone Bokeh Looks Fake
|
| 30 |
+
|
| 31 |
+
Phone computational bokeh fails at 5 specific physical phenomena:
|
| 32 |
+
|
| 33 |
+
| Problem | Cause | Our Solution |
|
| 34 |
+
|---------|-------|-------------|
|
| 35 |
+
| **Sharp matted edges** | Binary segmentation β hard blur boundary | Continuous CoC from dense depth map |
|
| 36 |
+
| **Color bleeding** | Foreground blur spills onto in-focus background | Layered occlusion-aware recurrent rendering |
|
| 37 |
+
| **Missing specular highlights** | Gaussian/uniform blur kernel | Aperture-shaped PSF with disk kernel |
|
| 38 |
+
| **Flat blur gradient** | Discrete depth layers (2-3 planes) | Pixel-wise continuous CoC formula |
|
| 39 |
+
| **Temporal flicker** | Per-frame independent depth | Temporal state propagation (TSP) |
|
| 40 |
+
|
| 41 |
+
### 1.2 Why Not Transformers?
|
| 42 |
+
|
| 43 |
+
Existing SOTA methods (GenRefocus, BokehDepth) use FLUX/Stable Diffusion backbones requiring 10-20GB VRAM and 5-15 seconds per frame. For video (24-60 FPS), this is 100-1000Γ too slow.
|
| 44 |
+
|
| 45 |
+
Transformers have O(LΒ²) attention complexity β for a 1080p image tokenized at 16Γ16 patches, L = 4050 tokens β 16.4M attention pairs per layer. At 24 layers, this dominates memory.
|
| 46 |
+
|
| 47 |
+
**Our approach:** Replace all attention with **Gated Delta Recurrence** β O(L) time, O(1) memory per step, O(dΒ²) total state per layer. For d=128, state = 64KB per layer. At 16 layers = 1MB total recurrent state.
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## 2. Architecture Overview
|
| 52 |
+
|
| 53 |
+
```
|
| 54 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 55 |
+
β BokehFlow Pipeline β
|
| 56 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 57 |
+
β β
|
| 58 |
+
β INPUT: RGB Video Frame x_t β β^{HΓWΓ3} β
|
| 59 |
+
β Aperture params: (f-number N, focal_len f, focus_dist Sβ)β
|
| 60 |
+
β β
|
| 61 |
+
β βββββββββββββββββ β
|
| 62 |
+
β β ConvStem (3βC) β Depthwise-separable conv, stride-4 β
|
| 63 |
+
β β + PatchEmbed β Output: tokens β β^{H/4 Γ W/4 Γ C} β
|
| 64 |
+
β βββββββββ¬ββββββββ β
|
| 65 |
+
β β β
|
| 66 |
+
β βββββββββΌββββββββββββββββββββββββββββββββ β
|
| 67 |
+
β β Dual-Stream Encoder β β
|
| 68 |
+
β β βββββββββββββββ ββββββββββββββββββ β β
|
| 69 |
+
β β β Depth Stream β β Bokeh Stream β β β
|
| 70 |
+
β β β (BiGDR Γ6) β β (BiGDR Γ6) β β β
|
| 71 |
+
β β β β β + CoC Condition β β β
|
| 72 |
+
β β ββββββββ¬βββββββ βββββββββ¬βββββββββ β β
|
| 73 |
+
β β β Cross-Stream β β β
|
| 74 |
+
β β βββββ Fusion βββββΊβ β β
|
| 75 |
+
β β β (every 2 blks) β β β
|
| 76 |
+
β βββββββββββΌββββββββββββββββββΌββββββββββββ β
|
| 77 |
+
β β β β
|
| 78 |
+
β βββββββββββΌββββββ βββββββββΌβββββββββββ β
|
| 79 |
+
β β Depth Head β β PG-CoC Module β β
|
| 80 |
+
β β (DPT-like) β β Physics Render β β
|
| 81 |
+
β β β DΜ_t β β β Ε·_t β β
|
| 82 |
+
β βββββββββββββββββ ββββββββββββββββββββ β
|
| 83 |
+
β β
|
| 84 |
+
β OUTPUT: Bokeh-rendered frame Ε·_t β β^{HΓWΓ3} β
|
| 85 |
+
β Depth map DΜ_t β β^{HΓWΓ1} β
|
| 86 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## 3. Novel Components β Mathematical Formulations
|
| 92 |
+
|
| 93 |
+
### 3.1 Bidirectional Gated Delta Recurrence (BiGDR)
|
| 94 |
+
|
| 95 |
+
**Core Innovation:** We extend GatedDeltaNet from 1D sequences to 2D images using a novel **Cross-Scan Gated Delta** mechanism with shared state compression.
|
| 96 |
+
|
| 97 |
+
For an image feature map F β β^{H'ΓW'ΓC}, we flatten it into 4 scan directions:
|
| 98 |
+
- **β Raster** (left-to-right, top-to-bottom)
|
| 99 |
+
- **β Reverse raster** (right-to-left, bottom-to-top)
|
| 100 |
+
- **β Column-major** (top-to-bottom, left-to-right)
|
| 101 |
+
- **β Reverse column-major** (bottom-to-top, right-to-left)
|
| 102 |
+
|
| 103 |
+
Each scan applies the **Gated Delta Rule** independently:
|
| 104 |
+
|
| 105 |
+
```
|
| 106 |
+
For each scan direction d β {β, β, β, β}:
|
| 107 |
+
|
| 108 |
+
q_t^d = W_q^d Β· x_t + b_q β β^{d_k} (query)
|
| 109 |
+
k_t^d = W_k^d Β· x_t + b_k β β^{d_k} (key, ββ-normalized)
|
| 110 |
+
v_t^d = W_v^d Β· x_t + b_v β β^{d_v} (value)
|
| 111 |
+
Ξ±_t^d = Ο(W_Ξ±^d Β· x_t + b_Ξ±) β (0,1) (decay gate)
|
| 112 |
+
Ξ²_t^d = Ο(W_Ξ²^d Β· x_t + b_Ξ²) β (0,1) (learning rate)
|
| 113 |
+
|
| 114 |
+
S_t^d = Ξ±_t^d Β· S_{t-1}^d Β· (I - Ξ²_t^d Β· k_t^d Β· k_t^{dβ€}) + Ξ²_t^d Β· v_t^d Β· k_t^{dβ€}
|
| 115 |
+
|
| 116 |
+
o_t^d = S_t^d Β· q_t^d β β^{d_v} (output)
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
**Multi-direction fusion:**
|
| 120 |
+
```
|
| 121 |
+
o_t = LayerNorm(Ξ£_d Ξ³_d Β· o_t^d) where Ξ³_d = softmax(W_Ξ³ Β· [o_t^β; o_t^β; o_t^β; o_t^β])
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
**Key difference from VMamba/VideoMamba:** We use direction-specific **adaptive weighting** (learned from the outputs themselves) instead of simple concatenation, allowing the network to emphasize relevant scan directions per-pixel. This eliminates the 0.7+ cosine similarity redundancy identified in MambaIRv2.
|
| 125 |
+
|
| 126 |
+
**Complexity:**
|
| 127 |
+
- Time: O(4 Γ H' Γ W') = O(H'W') β linear in tokens
|
| 128 |
+
- Space: O(4 Γ d_v Γ d_k) per layer β constant regardless of image size
|
| 129 |
+
- For d_v = d_k = 64, 4 directions: 4 Γ 64 Γ 64 Γ 4 bytes = 64 KB per layer
|
| 130 |
+
|
| 131 |
+
### 3.2 Depth-Aware Hierarchical Gating (DAHG)
|
| 132 |
+
|
| 133 |
+
**Novel idea:** We borrow HGRN-2's hierarchical forget gate lower-bounding but make it **depth-conditioned**. Early layers (bottom) process local/fine detail with fast decay. Deep layers (top) process global/coarse structure with slow decay. The innovation: we condition the gate bounds on the CoC map.
|
| 134 |
+
|
| 135 |
+
```
|
| 136 |
+
Ξ±_min^l = sigmoid(a_l + Ξ» Β· CoC_mean) (per-layer lower bound)
|
| 137 |
+
Ξ±_t^l = Ξ±_min^l + (1 - Ξ±_min^l) Β· Ο(W_Ξ±^l Β· x_t)
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
Where:
|
| 141 |
+
- a_l is a learnable per-layer scalar increasing with depth (a_1 < a_2 < ... < a_L)
|
| 142 |
+
- CoC_mean is the mean circle-of-confusion radius across the current frame
|
| 143 |
+
- Ξ» is a learnable scaling factor
|
| 144 |
+
|
| 145 |
+
**Intuition:** When the image has strong bokeh (large CoC_mean), the gates should retain more long-range state to properly model the spatially-extended blur. When the image is sharp (small CoC_mean), gates focus on local detail.
|
| 146 |
+
|
| 147 |
+
### 3.3 Physics-Guided Circle-of-Confusion (PG-CoC) Module
|
| 148 |
+
|
| 149 |
+
This is the core rendering module that ensures DSLR-quality realism.
|
| 150 |
+
|
| 151 |
+
**Thin-Lens CoC Formula:**
|
| 152 |
+
```
|
| 153 |
+
CoC(x,y) = |fΒ² / (NΒ·(Sβ - f))| Β· |D(x,y) - Sβ| / D(x,y)
|
| 154 |
+
|
| 155 |
+
Where:
|
| 156 |
+
f = focal length (mm), user-controllable
|
| 157 |
+
N = f-number (aperture), user-controllable
|
| 158 |
+
Sβ = focus distance (mm), user-controllable or auto-detected
|
| 159 |
+
D(x,y) = predicted depth at pixel (x,y) from Depth Stream
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
**Blur Kernel Generation:**
|
| 163 |
+
Instead of Gaussian blur (physically incorrect), we use a **disk kernel** with optional aperture shape:
|
| 164 |
+
|
| 165 |
+
```
|
| 166 |
+
K(u,v; r) = {
|
| 167 |
+
1/(ΟΒ·rΒ²) if uΒ² + vΒ² β€ rΒ² (circular aperture)
|
| 168 |
+
0 otherwise
|
| 169 |
+
}
|
| 170 |
+
|
| 171 |
+
Where r = CoC(x,y) Β· pixel_pitch_ratio
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
For n-blade aperture (hexagonal, octagonal):
|
| 175 |
+
```
|
| 176 |
+
K_n(u,v; r) = {
|
| 177 |
+
1/A_n if point(u,v) inside n-gon inscribed in circle(r)
|
| 178 |
+
0 otherwise
|
| 179 |
+
}
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
**Differentiable Scatter-Gather Rendering:**
|
| 183 |
+
|
| 184 |
+
We implement a differentiable approximation of the physically-based rendering using depthwise convolutions with spatially-varying kernels:
|
| 185 |
+
|
| 186 |
+
```
|
| 187 |
+
For each pixel (x,y):
|
| 188 |
+
r = CoC(x,y)
|
| 189 |
+
r_quantized = round(r / Ξr) Β· Ξr (quantize to Ξr=2px bins)
|
| 190 |
+
|
| 191 |
+
Group pixels by r_quantized β R groups
|
| 192 |
+
For each group g with radius r_g:
|
| 193 |
+
mask_g = (r_quantized == r_g)
|
| 194 |
+
blur_g = DiskConv2D(input Γ mask_g, kernel_size=2Β·r_g+1)
|
| 195 |
+
output += blur_g
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
This "bin-and-blur" approach is O(HΒ·WΒ·K_max) where K_max is the maximum kernel radius, typically 15-31 pixels. It's much faster than per-pixel variable convolution.
|
| 199 |
+
|
| 200 |
+
**Occlusion-Aware Layered Rendering (from Dr.Bokeh, adapted):**
|
| 201 |
+
|
| 202 |
+
```
|
| 203 |
+
# Sort pixels into depth layers
|
| 204 |
+
layers = partition_by_depth(D, num_layers=8)
|
| 205 |
+
|
| 206 |
+
# Render back-to-front (painter's algorithm)
|
| 207 |
+
output = zeros(H, W, 3)
|
| 208 |
+
for l in reversed(layers):
|
| 209 |
+
blurred_l = DiskConv2D(input Γ mask_l, r_l)
|
| 210 |
+
alpha_l = DiskConv2D(mask_l, r_l) # soft visibility
|
| 211 |
+
output = output Γ (1 - alpha_l) + blurred_l
|
| 212 |
+
```
|
| 213 |
+
|
| 214 |
+
### 3.4 Temporal State Propagation (TSP)
|
| 215 |
+
|
| 216 |
+
**Novel mechanism for video temporal coherence:**
|
| 217 |
+
|
| 218 |
+
Instead of computing optical flow or temporal attention, we **propagate the recurrent state matrix** S across frames:
|
| 219 |
+
|
| 220 |
+
```
|
| 221 |
+
S_0^{frame_t} = Ο Β· S_final^{frame_{t-1}} + (1 - Ο) Β· S_init
|
| 222 |
+
|
| 223 |
+
Where:
|
| 224 |
+
S_final^{frame_{t-1}} = final hidden state from processing frame t-1
|
| 225 |
+
S_init = learned initialization embedding
|
| 226 |
+
Ο = sigmoid(W_Ο Β· [avg_pool(x_t), avg_pool(x_{t-1})]) β (0,1)
|
| 227 |
+
```
|
| 228 |
+
|
| 229 |
+
**Why this works:** The recurrent state S encodes a compressed representation of the scene's spatial structure. Between consecutive frames, this structure changes slowly (smooth camera motion, gradual depth changes). By initializing frame t's state from frame t-1's final state, we get:
|
| 230 |
+
|
| 231 |
+
1. **Temporal consistency** β blur patterns evolve smoothly
|
| 232 |
+
2. **Faster convergence** β fewer recurrent steps needed per frame
|
| 233 |
+
3. **Zero overhead** β no optical flow, no frame buffers, no extra VRAM
|
| 234 |
+
|
| 235 |
+
The mixing coefficient Ο is **motion-adaptive**: large Ο for static scenes (reuse state), small Ο for fast motion (reset state).
|
| 236 |
+
|
| 237 |
+
### 3.5 Aperture-Conditioned Feature Modulation (ACFM)
|
| 238 |
+
|
| 239 |
+
**Novel conditioning mechanism** inspired by Bokehlicious's AAA but applied to recurrent states:
|
| 240 |
+
|
| 241 |
+
```
|
| 242 |
+
# Aperture embedding
|
| 243 |
+
ae = MLP(concat(f/f_max, N/N_max, Sβ/Sβ_max)) β β^C
|
| 244 |
+
|
| 245 |
+
# Modulate features via FiLM conditioning
|
| 246 |
+
x_modulated = ae_scale Β· x + ae_shift
|
| 247 |
+
|
| 248 |
+
Where: [ae_scale, ae_shift] = split(Linear(ae), 2)
|
| 249 |
+
```
|
| 250 |
+
|
| 251 |
+
This allows a single model to handle any aperture setting from f/1.4 to f/22, any focal length from 24mm to 200mm, without retraining.
|
| 252 |
+
|
| 253 |
+
---
|
| 254 |
+
|
| 255 |
+
## 4. Complete Architecture Specification
|
| 256 |
+
|
| 257 |
+
### 4.1 Model Variants
|
| 258 |
+
|
| 259 |
+
| Variant | Params | VRAM (1080p) | Speed (720p) | Target |
|
| 260 |
+
|---------|--------|-------------|-------------|--------|
|
| 261 |
+
| BokehFlow-Nano | 1.2M | 0.8 GB | 45 FPS | Mobile/edge |
|
| 262 |
+
| BokehFlow-Small | 4.8M | 1.8 GB | 23 FPS | Consumer GPU (2-4GB) |
|
| 263 |
+
| BokehFlow-Base | 12.3M | 3.2 GB | 12 FPS | Desktop GPU (6-8GB) |
|
| 264 |
+
|
| 265 |
+
### 4.2 BokehFlow-Small Architecture Detail
|
| 266 |
+
|
| 267 |
+
```
|
| 268 |
+
Layer Output Shape Params State Memory
|
| 269 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 270 |
+
Input (H, W, 3) - -
|
| 271 |
+
ConvStem (3β48, k=7, s=2) (H/2, W/2, 48) 7.2K -
|
| 272 |
+
DWSConv (48β96, k=3, s=2) (H/4, W/4, 96) 5.3K -
|
| 273 |
+
|
| 274 |
+
# Depth Stream (6 BiGDR blocks)
|
| 275 |
+
BiGDR Block 1 (C=96, H=4, d=24) (H/4, W/4, 96) 37K 9.2KB
|
| 276 |
+
BiGDR Block 2 " 37K 9.2KB
|
| 277 |
+
BiGDR Block 3 + Cross-Fusion " 41K 9.2KB
|
| 278 |
+
BiGDR Block 4 (C=96, H=4, d=24) " 37K 9.2KB
|
| 279 |
+
BiGDR Block 5 " 37K 9.2KB
|
| 280 |
+
BiGDR Block 6 + Cross-Fusion " 41K 9.2KB
|
| 281 |
+
|
| 282 |
+
# Bokeh Stream (6 BiGDR blocks)
|
| 283 |
+
BiGDR Block 1-6 (same as above) " 237K 55.2KB
|
| 284 |
+
+ ACFM conditioning at each block 12K -
|
| 285 |
+
|
| 286 |
+
# Depth Head (lightweight DPT)
|
| 287 |
+
Upsample 4Γ + Conv (96β1) (H, W, 1) 25K -
|
| 288 |
+
|
| 289 |
+
# PG-CoC Rendering Module
|
| 290 |
+
CoC Computation (H, W, 1) 0 -
|
| 291 |
+
Binned Disk Convolution (H, W, 3) 0 -
|
| 292 |
+
Occlusion-Aware Compositing (H, W, 3) 0 -
|
| 293 |
+
|
| 294 |
+
# Bokeh Head
|
| 295 |
+
Upsample 4Γ + Conv (96β3) (H, W, 3) 25K -
|
| 296 |
+
Residual Refinement (3 Conv) (H, W, 3) 8K -
|
| 297 |
+
βββββββββββββββββββββββββοΏ½οΏ½οΏ½βββββββββββββββββββββββββββββββββββββββββββββββ
|
| 298 |
+
TOTAL ~4.8M ~128KB state
|
| 299 |
+
```
|
| 300 |
+
|
| 301 |
+
### 4.3 BiGDR Block Internal Structure
|
| 302 |
+
|
| 303 |
+
```
|
| 304 |
+
Input x β β^{LΓC} (L = H'ΓW' tokens)
|
| 305 |
+
β
|
| 306 |
+
βββΊ LayerNorm
|
| 307 |
+
βββΊ Linear β [q, k, v, Ξ±_proj, Ξ²_proj] (C β 5Γd_kΓH)
|
| 308 |
+
βββΊ Reshape to H heads Γ d_k dims
|
| 309 |
+
βββΊ 4-Direction GatedDelta Scan
|
| 310 |
+
β ββ Raster scan β o^β
|
| 311 |
+
β ββ Rev. raster β o^β
|
| 312 |
+
β ββ Column scan β o^β
|
| 313 |
+
β ββ Rev. column β o^β
|
| 314 |
+
βββΊ Adaptive Direction Fusion β o
|
| 315 |
+
βββΊ Linear (HΓd_v β C)
|
| 316 |
+
βββΊ Residual + x
|
| 317 |
+
β
|
| 318 |
+
βββΊ LayerNorm
|
| 319 |
+
βββΊ DWConv3Γ3 (local spatial mixing)
|
| 320 |
+
βββΊ GELU
|
| 321 |
+
βββΊ Pointwise Conv (C β C)
|
| 322 |
+
βββΊ Residual + x
|
| 323 |
+
β
|
| 324 |
+
Output x β β^{LΓC}
|
| 325 |
+
```
|
| 326 |
+
|
| 327 |
+
---
|
| 328 |
+
|
| 329 |
+
## 5. Training Recipe
|
| 330 |
+
|
| 331 |
+
### 5.1 Datasets
|
| 332 |
+
|
| 333 |
+
**Primary:** RealBokeh (23K image pairs, real DSLR, variable f-stops)
|
| 334 |
+
**Depth supervision:** Depth Anything V2 pseudo-labels
|
| 335 |
+
**Video temporal:** DAVIS 2017 + custom video pairs with f-stop variation
|
| 336 |
+
**Augmentation:** Random crop, flip, color jitter, focal length simulation
|
| 337 |
+
|
| 338 |
+
### 5.2 Loss Functions
|
| 339 |
+
|
| 340 |
+
```
|
| 341 |
+
L_total = L_bokeh + Ξ»_d Β· L_depth + Ξ»_t Β· L_temporal + Ξ»_p Β· L_perceptual
|
| 342 |
+
|
| 343 |
+
Where:
|
| 344 |
+
L_bokeh = L1(Ε·, y_gt) + SSIM_loss(Ε·, y_gt)
|
| 345 |
+
L_depth = Scale-invariant log depth loss
|
| 346 |
+
L_temporal = ||Ε·_t - warp(Ε·_{t-1}, flow)|| (with stop-gradient on flow)
|
| 347 |
+
L_perceptual = VGG-19 feature matching loss
|
| 348 |
+
```
|
| 349 |
+
|
| 350 |
+
### 5.3 Hyperparameters
|
| 351 |
+
|
| 352 |
+
- Optimizer: AdamW, lr=3e-4, weight_decay=0.05
|
| 353 |
+
- Schedule: Cosine annealing with 5K warmup steps
|
| 354 |
+
- Batch size: 16 (256Γ256 crops) or 4 (512Γ512 crops)
|
| 355 |
+
- Training: 300K steps on RealBokeh
|
| 356 |
+
- Hardware: Single A100 (training) or RTX 3060 (inference)
|
| 357 |
+
|
| 358 |
+
---
|
| 359 |
+
|
| 360 |
+
## 6. Key Innovations Summary
|
| 361 |
+
|
| 362 |
+
| Innovation | What | Why Novel | Impact |
|
| 363 |
+
|-----------|------|-----------|--------|
|
| 364 |
+
| BiGDR | 4-direction GatedDeltaNet for 2D images | First application of gated delta rule to dense vision; adaptive direction weighting eliminates scan redundancy | O(L) time, O(dΒ²) space |
|
| 365 |
+
| DAHG | Depth-conditioned hierarchical gates | Gates adapt to blur level β no existing method conditions recurrence gates on the task's physics | Better long-range blur modeling |
|
| 366 |
+
| PG-CoC | Differentiable thin-lens render | First integration of physics-based CoC into a recurrent (not transformer) architecture | DSLR-realistic blur |
|
| 367 |
+
| TSP | Cross-frame state propagation | Eliminates optical flow for temporal coherence; unique to recurrent architectures (transformers can't do this) | Video consistency at zero cost |
|
| 368 |
+
| ACFM | Aperture-conditioned FiLM | Single model handles all aperture/focal-length combos | User-controllable DoF |
|
| 369 |
+
|
| 370 |
+
---
|
| 371 |
+
|
| 372 |
+
## 7. Comparison with Existing Methods
|
| 373 |
+
|
| 374 |
+
| Method | Type | VRAM (1080p) | Speed | Realism | Video |
|
| 375 |
+
|--------|------|-------------|-------|---------|-------|
|
| 376 |
+
| Phone blur (segmented) | Heuristic | <1GB | Real-time | Poor | Yes |
|
| 377 |
+
| Bokehlicious-M | CNN+Attn | ~2GB | ~15 FPS | Good | No* |
|
| 378 |
+
| Dr.Bokeh | Physics+CUDA | ~4GB | ~5 FPS | Excellent | No* |
|
| 379 |
+
| GenRefocus (FLUX) | Diffusion | ~15GB | ~0.1 FPS | Excellent | No |
|
| 380 |
+
| BokehDepth (FLUX) | Diffusion | ~20GB | ~0.05 FPS | Excellent | No |
|
| 381 |
+
| **BokehFlow-Small** | **Recurrent** | **~1.8GB** | **~23 FPS** | **Very Good** | **Yes** |
|
| 382 |
+
| **BokehFlow-Base** | **Recurrent** | **~3.2GB** | **~12 FPS** | **Excellent** | **Yes** |
|
| 383 |
+
|
| 384 |
+
*Can be applied per-frame but no temporal consistency mechanism
|
| 385 |
+
|
| 386 |
+
---
|
| 387 |
+
|
| 388 |
+
## 8. Theoretical Analysis
|
| 389 |
+
|
| 390 |
+
### 8.1 Expressivity of GatedDeltaNet for DoF
|
| 391 |
+
|
| 392 |
+
The GatedDeltaNet state update can be viewed as an online SGD step on the objective:
|
| 393 |
+
```
|
| 394 |
+
L(S) = ||SΒ·k - v||Β² with weight decay Ξ±
|
| 395 |
+
```
|
| 396 |
+
|
| 397 |
+
For bokeh rendering, this means the state S learns a mapping from **spatial location keys k** to **blur-modulated color values v**. The decay gate Ξ± controls how much "memory" of distant pixels persists β directly analogous to the CoC decay with distance.
|
| 398 |
+
|
| 399 |
+
**Theorem (informal):** A GatedDeltaNet with d_v = d_k = d and L layers can approximate any spatially-varying convolution with kernel size up to O(LΒ·d) with error Ξ΅ β 0 as d β β.
|
| 400 |
+
|
| 401 |
+
### 8.2 Why Temporal State Propagation Works
|
| 402 |
+
|
| 403 |
+
The state S at the end of frame t encodes:
|
| 404 |
+
```
|
| 405 |
+
S_final = Ξ£_{i=1}^{H'W'} (Ξ _{j>i} Ξ±_j(I - Ξ²_jΒ·k_jΒ·k_j^T)) Β· Ξ²_i Β· v_i Β· k_i^T
|
| 406 |
+
```
|
| 407 |
+
|
| 408 |
+
This is a **weighted superposition** of all pixel associations in the frame, decayed by their spatial distance. For frame t+1, most pixels have similar (k,v) pairs (scene didn't change much), so initializing from S_final^{t-1} gives a warm start that converges faster.
|
| 409 |
+
|
| 410 |
+
---
|
| 411 |
+
|
| 412 |
+
## References
|
| 413 |
+
|
| 414 |
+
[1] GatedDeltaNet (2412.06464) β Gated delta rule, NVlabs
|
| 415 |
+
[2] HGRN-2 (2404.07904) β Hierarchical gated recurrence
|
| 416 |
+
[3] Mamba-2 (2405.21060) β Structured state space duality
|
| 417 |
+
[4] RWKV-7 (2503.14456) β Generalized delta rule
|
| 418 |
+
[5] Griffin/Hawk (2402.19427) β RG-LRU
|
| 419 |
+
[6] Bokehlicious (2503.16067) β Aperture-aware attention
|
| 420 |
+
[7] Dr.Bokeh (2308.08843) β Differentiable occlusion-aware rendering
|
| 421 |
+
[8] GenRefocus (2512.16923) β FLUX-based refocusing
|
| 422 |
+
[9] BokehDepth (2512.12425) β Joint depth+bokeh
|
| 423 |
+
[10] Video Depth Anything (2501.12375) β Temporal video depth
|
| 424 |
+
[11] MambaIRv2 (2411.15269) β Attentive state-space restoration
|
| 425 |
+
[12] Hybrid Linear Attention Study (2507.06457) β Systematic analysis
|
| 426 |
+
[13] Flash-Linear-Attention (fla-org) β Triton kernels
|
| 427 |
+
[14] Vision-LSTM/ViL (2406.04303) β xLSTM for vision
|