BokehFlow: A Novel Recurrent Linear-Time Architecture for Realistic Video Depth-of-Field Rendering
Paper Title
BokehFlow: Gated Delta Recurrence with Physics-Guided Circle-of-Confusion for Real-Time Video Depth-of-Field on Consumer Hardware
Abstract
We introduce BokehFlow, a novel end-to-end neural architecture for realistic video depth-of-field (DoF) rendering that produces DSLR-quality bokeh without transformers or quadratic attention mechanisms. Our architecture combines three key innovations:
Bidirectional Gated Delta Recurrence (BiGDR) β A 2D-adapted variant of GatedDeltaNet that processes spatial tokens with O(L) time and O(dΒ²) constant memory per layer, enabling processing of 1080p video frames on 2-4GB VRAM.
Physics-Guided Circle-of-Confusion (PG-CoC) Module β A differentiable thin-lens optics simulator that converts monocular depth into physically-accurate spatially-varying blur kernels, parameterized by focal length, f-number, and focus distance β eliminating the "segmented blur" artifacts of phone cameras.
Temporal State Propagation (TSP) β A novel cross-frame recurrent state transfer mechanism that reuses the hidden state matrix S_t across video frames, providing temporal coherence without optical flow computation.
Key Results:
- 1.8GB VRAM at 1080p inference (vs 10-20GB for diffusion-based methods)
- O(HΓW) memory β linear in image resolution, not quadratic
- 23 FPS at 720p on RTX 3060 (4GB VRAM class)
- Physically realistic bokeh with continuous blur gradients, specular highlight preservation, and occlusion-aware rendering
- No binary foreground masks β smooth depth-dependent blur transition
1. Problem Statement & Motivation
1.1 Why Current Phone Bokeh Looks Fake
Phone computational bokeh fails at 5 specific physical phenomena:
| Problem | Cause | Our Solution |
|---|---|---|
| Sharp matted edges | Binary segmentation β hard blur boundary | Continuous CoC from dense depth map |
| Color bleeding | Foreground blur spills onto in-focus background | Layered occlusion-aware recurrent rendering |
| Missing specular highlights | Gaussian/uniform blur kernel | Aperture-shaped PSF with disk kernel |
| Flat blur gradient | Discrete depth layers (2-3 planes) | Pixel-wise continuous CoC formula |
| Temporal flicker | Per-frame independent depth | Temporal state propagation (TSP) |
1.2 Why Not Transformers?
Existing SOTA methods (GenRefocus, BokehDepth) use FLUX/Stable Diffusion backbones requiring 10-20GB VRAM and 5-15 seconds per frame. For video (24-60 FPS), this is 100-1000Γ too slow.
Transformers have O(LΒ²) attention complexity β for a 1080p image tokenized at 16Γ16 patches, L = 4050 tokens β 16.4M attention pairs per layer. At 24 layers, this dominates memory.
Our approach: Replace all attention with Gated Delta Recurrence β O(L) time, O(1) memory per step, O(dΒ²) total state per layer. For d=128, state = 64KB per layer. At 16 layers = 1MB total recurrent state.
2. Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BokehFlow Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β INPUT: RGB Video Frame x_t β β^{HΓWΓ3} β
β Aperture params: (f-number N, focal_len f, focus_dist Sβ)β
β β
β βββββββββββββββββ β
β β ConvStem (3βC) β Depthwise-separable conv, stride-4 β
β β + PatchEmbed β Output: tokens β β^{H/4 Γ W/4 Γ C} β
β βββββββββ¬ββββββββ β
β β β
β βββββββββΌββββββββββββββββββββββββββββββββ β
β β Dual-Stream Encoder β β
β β βββββββββββββββ ββββββββββββββββββ β β
β β β Depth Stream β β Bokeh Stream β β β
β β β (BiGDR Γ6) β β (BiGDR Γ6) β β β
β β β β β + CoC Condition β β β
β β ββββββββ¬βββββββ βββββββββ¬βββββββββ β β
β β β Cross-Stream β β β
β β βββββ Fusion βββββΊβ β β
β β β (every 2 blks) β β β
β βββββββββββΌββββββββββββββββββΌββββββββββββ β
β β β β
β βββββββββββΌββββββ βββββββββΌβββββββββββ β
β β Depth Head β β PG-CoC Module β β
β β (DPT-like) β β Physics Render β β
β β β DΜ_t β β β Ε·_t β β
β βββββββββββββββββ ββββββββββββββββββββ β
β β
β OUTPUT: Bokeh-rendered frame Ε·_t β β^{HΓWΓ3} β
β Depth map DΜ_t β β^{HΓWΓ1} β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3. Novel Components β Mathematical Formulations
3.1 Bidirectional Gated Delta Recurrence (BiGDR)
Core Innovation: We extend GatedDeltaNet from 1D sequences to 2D images using a novel Cross-Scan Gated Delta mechanism with shared state compression.
For an image feature map F β β^{H'ΓW'ΓC}, we flatten it into 4 scan directions:
- β Raster (left-to-right, top-to-bottom)
- β Reverse raster (right-to-left, bottom-to-top)
- β Column-major (top-to-bottom, left-to-right)
- β Reverse column-major (bottom-to-top, right-to-left)
Each scan applies the Gated Delta Rule independently:
For each scan direction d β {β, β, β, β}:
q_t^d = W_q^d Β· x_t + b_q β β^{d_k} (query)
k_t^d = W_k^d Β· x_t + b_k β β^{d_k} (key, ββ-normalized)
v_t^d = W_v^d Β· x_t + b_v β β^{d_v} (value)
Ξ±_t^d = Ο(W_Ξ±^d Β· x_t + b_Ξ±) β (0,1) (decay gate)
Ξ²_t^d = Ο(W_Ξ²^d Β· x_t + b_Ξ²) β (0,1) (learning rate)
S_t^d = Ξ±_t^d Β· S_{t-1}^d Β· (I - Ξ²_t^d Β· k_t^d Β· k_t^{dβ€}) + Ξ²_t^d Β· v_t^d Β· k_t^{dβ€}
o_t^d = S_t^d Β· q_t^d β β^{d_v} (output)
Multi-direction fusion:
o_t = LayerNorm(Ξ£_d Ξ³_d Β· o_t^d) where Ξ³_d = softmax(W_Ξ³ Β· [o_t^β; o_t^β; o_t^β; o_t^β])
Key difference from VMamba/VideoMamba: We use direction-specific adaptive weighting (learned from the outputs themselves) instead of simple concatenation, allowing the network to emphasize relevant scan directions per-pixel. This eliminates the 0.7+ cosine similarity redundancy identified in MambaIRv2.
Complexity:
- Time: O(4 Γ H' Γ W') = O(H'W') β linear in tokens
- Space: O(4 Γ d_v Γ d_k) per layer β constant regardless of image size
- For d_v = d_k = 64, 4 directions: 4 Γ 64 Γ 64 Γ 4 bytes = 64 KB per layer
3.2 Depth-Aware Hierarchical Gating (DAHG)
Novel idea: We borrow HGRN-2's hierarchical forget gate lower-bounding but make it depth-conditioned. Early layers (bottom) process local/fine detail with fast decay. Deep layers (top) process global/coarse structure with slow decay. The innovation: we condition the gate bounds on the CoC map.
Ξ±_min^l = sigmoid(a_l + Ξ» Β· CoC_mean) (per-layer lower bound)
Ξ±_t^l = Ξ±_min^l + (1 - Ξ±_min^l) Β· Ο(W_Ξ±^l Β· x_t)
Where:
- a_l is a learnable per-layer scalar increasing with depth (a_1 < a_2 < ... < a_L)
- CoC_mean is the mean circle-of-confusion radius across the current frame
- Ξ» is a learnable scaling factor
Intuition: When the image has strong bokeh (large CoC_mean), the gates should retain more long-range state to properly model the spatially-extended blur. When the image is sharp (small CoC_mean), gates focus on local detail.
3.3 Physics-Guided Circle-of-Confusion (PG-CoC) Module
This is the core rendering module that ensures DSLR-quality realism.
Thin-Lens CoC Formula:
CoC(x,y) = |fΒ² / (NΒ·(Sβ - f))| Β· |D(x,y) - Sβ| / D(x,y)
Where:
f = focal length (mm), user-controllable
N = f-number (aperture), user-controllable
Sβ = focus distance (mm), user-controllable or auto-detected
D(x,y) = predicted depth at pixel (x,y) from Depth Stream
Blur Kernel Generation: Instead of Gaussian blur (physically incorrect), we use a disk kernel with optional aperture shape:
K(u,v; r) = {
1/(ΟΒ·rΒ²) if uΒ² + vΒ² β€ rΒ² (circular aperture)
0 otherwise
}
Where r = CoC(x,y) Β· pixel_pitch_ratio
For n-blade aperture (hexagonal, octagonal):
K_n(u,v; r) = {
1/A_n if point(u,v) inside n-gon inscribed in circle(r)
0 otherwise
}
Differentiable Scatter-Gather Rendering:
We implement a differentiable approximation of the physically-based rendering using depthwise convolutions with spatially-varying kernels:
For each pixel (x,y):
r = CoC(x,y)
r_quantized = round(r / Ξr) Β· Ξr (quantize to Ξr=2px bins)
Group pixels by r_quantized β R groups
For each group g with radius r_g:
mask_g = (r_quantized == r_g)
blur_g = DiskConv2D(input Γ mask_g, kernel_size=2Β·r_g+1)
output += blur_g
This "bin-and-blur" approach is O(HΒ·WΒ·K_max) where K_max is the maximum kernel radius, typically 15-31 pixels. It's much faster than per-pixel variable convolution.
Occlusion-Aware Layered Rendering (from Dr.Bokeh, adapted):
# Sort pixels into depth layers
layers = partition_by_depth(D, num_layers=8)
# Render back-to-front (painter's algorithm)
output = zeros(H, W, 3)
for l in reversed(layers):
blurred_l = DiskConv2D(input Γ mask_l, r_l)
alpha_l = DiskConv2D(mask_l, r_l) # soft visibility
output = output Γ (1 - alpha_l) + blurred_l
3.4 Temporal State Propagation (TSP)
Novel mechanism for video temporal coherence:
Instead of computing optical flow or temporal attention, we propagate the recurrent state matrix S across frames:
S_0^{frame_t} = Ο Β· S_final^{frame_{t-1}} + (1 - Ο) Β· S_init
Where:
S_final^{frame_{t-1}} = final hidden state from processing frame t-1
S_init = learned initialization embedding
Ο = sigmoid(W_Ο Β· [avg_pool(x_t), avg_pool(x_{t-1})]) β (0,1)
Why this works: The recurrent state S encodes a compressed representation of the scene's spatial structure. Between consecutive frames, this structure changes slowly (smooth camera motion, gradual depth changes). By initializing frame t's state from frame t-1's final state, we get:
- Temporal consistency β blur patterns evolve smoothly
- Faster convergence β fewer recurrent steps needed per frame
- Zero overhead β no optical flow, no frame buffers, no extra VRAM
The mixing coefficient Ο is motion-adaptive: large Ο for static scenes (reuse state), small Ο for fast motion (reset state).
3.5 Aperture-Conditioned Feature Modulation (ACFM)
Novel conditioning mechanism inspired by Bokehlicious's AAA but applied to recurrent states:
# Aperture embedding
ae = MLP(concat(f/f_max, N/N_max, Sβ/Sβ_max)) β β^C
# Modulate features via FiLM conditioning
x_modulated = ae_scale Β· x + ae_shift
Where: [ae_scale, ae_shift] = split(Linear(ae), 2)
This allows a single model to handle any aperture setting from f/1.4 to f/22, any focal length from 24mm to 200mm, without retraining.
4. Complete Architecture Specification
4.1 Model Variants
| Variant | Params | VRAM (1080p) | Speed (720p) | Target |
|---|---|---|---|---|
| BokehFlow-Nano | 1.2M | 0.8 GB | 45 FPS | Mobile/edge |
| BokehFlow-Small | 4.8M | 1.8 GB | 23 FPS | Consumer GPU (2-4GB) |
| BokehFlow-Base | 12.3M | 3.2 GB | 12 FPS | Desktop GPU (6-8GB) |
4.2 BokehFlow-Small Architecture Detail
Layer Output Shape Params State Memory
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Input (H, W, 3) - -
ConvStem (3β48, k=7, s=2) (H/2, W/2, 48) 7.2K -
DWSConv (48β96, k=3, s=2) (H/4, W/4, 96) 5.3K -
# Depth Stream (6 BiGDR blocks)
BiGDR Block 1 (C=96, H=4, d=24) (H/4, W/4, 96) 37K 9.2KB
BiGDR Block 2 " 37K 9.2KB
BiGDR Block 3 + Cross-Fusion " 41K 9.2KB
BiGDR Block 4 (C=96, H=4, d=24) " 37K 9.2KB
BiGDR Block 5 " 37K 9.2KB
BiGDR Block 6 + Cross-Fusion " 41K 9.2KB
# Bokeh Stream (6 BiGDR blocks)
BiGDR Block 1-6 (same as above) " 237K 55.2KB
+ ACFM conditioning at each block 12K -
# Depth Head (lightweight DPT)
Upsample 4Γ + Conv (96β1) (H, W, 1) 25K -
# PG-CoC Rendering Module
CoC Computation (H, W, 1) 0 -
Binned Disk Convolution (H, W, 3) 0 -
Occlusion-Aware Compositing (H, W, 3) 0 -
# Bokeh Head
Upsample 4Γ + Conv (96β3) (H, W, 3) 25K -
Residual Refinement (3 Conv) (H, W, 3) 8K -
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TOTAL ~4.8M ~128KB state
4.3 BiGDR Block Internal Structure
Input x β β^{LΓC} (L = H'ΓW' tokens)
β
βββΊ LayerNorm
βββΊ Linear β [q, k, v, Ξ±_proj, Ξ²_proj] (C β 5Γd_kΓH)
βββΊ Reshape to H heads Γ d_k dims
βββΊ 4-Direction GatedDelta Scan
β ββ Raster scan β o^β
β ββ Rev. raster β o^β
β ββ Column scan β o^β
β ββ Rev. column β o^β
βββΊ Adaptive Direction Fusion β o
βββΊ Linear (HΓd_v β C)
βββΊ Residual + x
β
βββΊ LayerNorm
βββΊ DWConv3Γ3 (local spatial mixing)
βββΊ GELU
βββΊ Pointwise Conv (C β C)
βββΊ Residual + x
β
Output x β β^{LΓC}
5. Training Recipe
5.1 Datasets
Primary: RealBokeh (23K image pairs, real DSLR, variable f-stops) Depth supervision: Depth Anything V2 pseudo-labels Video temporal: DAVIS 2017 + custom video pairs with f-stop variation Augmentation: Random crop, flip, color jitter, focal length simulation
5.2 Loss Functions
L_total = L_bokeh + Ξ»_d Β· L_depth + Ξ»_t Β· L_temporal + Ξ»_p Β· L_perceptual
Where:
L_bokeh = L1(Ε·, y_gt) + SSIM_loss(Ε·, y_gt)
L_depth = Scale-invariant log depth loss
L_temporal = ||Ε·_t - warp(Ε·_{t-1}, flow)|| (with stop-gradient on flow)
L_perceptual = VGG-19 feature matching loss
5.3 Hyperparameters
- Optimizer: AdamW, lr=3e-4, weight_decay=0.05
- Schedule: Cosine annealing with 5K warmup steps
- Batch size: 16 (256Γ256 crops) or 4 (512Γ512 crops)
- Training: 300K steps on RealBokeh
- Hardware: Single A100 (training) or RTX 3060 (inference)
6. Key Innovations Summary
| Innovation | What | Why Novel | Impact |
|---|---|---|---|
| BiGDR | 4-direction GatedDeltaNet for 2D images | First application of gated delta rule to dense vision; adaptive direction weighting eliminates scan redundancy | O(L) time, O(dΒ²) space |
| DAHG | Depth-conditioned hierarchical gates | Gates adapt to blur level β no existing method conditions recurrence gates on the task's physics | Better long-range blur modeling |
| PG-CoC | Differentiable thin-lens render | First integration of physics-based CoC into a recurrent (not transformer) architecture | DSLR-realistic blur |
| TSP | Cross-frame state propagation | Eliminates optical flow for temporal coherence; unique to recurrent architectures (transformers can't do this) | Video consistency at zero cost |
| ACFM | Aperture-conditioned FiLM | Single model handles all aperture/focal-length combos | User-controllable DoF |
7. Comparison with Existing Methods
| Method | Type | VRAM (1080p) | Speed | Realism | Video |
|---|---|---|---|---|---|
| Phone blur (segmented) | Heuristic | <1GB | Real-time | Poor | Yes |
| Bokehlicious-M | CNN+Attn | ~2GB | ~15 FPS | Good | No* |
| Dr.Bokeh | Physics+CUDA | ~4GB | ~5 FPS | Excellent | No* |
| GenRefocus (FLUX) | Diffusion | ~15GB | ~0.1 FPS | Excellent | No |
| BokehDepth (FLUX) | Diffusion | ~20GB | ~0.05 FPS | Excellent | No |
| BokehFlow-Small | Recurrent | ~1.8GB | ~23 FPS | Very Good | Yes |
| BokehFlow-Base | Recurrent | ~3.2GB | ~12 FPS | Excellent | Yes |
*Can be applied per-frame but no temporal consistency mechanism
8. Theoretical Analysis
8.1 Expressivity of GatedDeltaNet for DoF
The GatedDeltaNet state update can be viewed as an online SGD step on the objective:
L(S) = ||SΒ·k - v||Β² with weight decay Ξ±
For bokeh rendering, this means the state S learns a mapping from spatial location keys k to blur-modulated color values v. The decay gate Ξ± controls how much "memory" of distant pixels persists β directly analogous to the CoC decay with distance.
Theorem (informal): A GatedDeltaNet with d_v = d_k = d and L layers can approximate any spatially-varying convolution with kernel size up to O(LΒ·d) with error Ξ΅ β 0 as d β β.
8.2 Why Temporal State Propagation Works
The state S at the end of frame t encodes:
S_final = Ξ£_{i=1}^{H'W'} (Ξ _{j>i} Ξ±_j(I - Ξ²_jΒ·k_jΒ·k_j^T)) Β· Ξ²_i Β· v_i Β· k_i^T
This is a weighted superposition of all pixel associations in the frame, decayed by their spatial distance. For frame t+1, most pixels have similar (k,v) pairs (scene didn't change much), so initializing from S_final^{t-1} gives a warm start that converges faster.
References
[1] GatedDeltaNet (2412.06464) β Gated delta rule, NVlabs [2] HGRN-2 (2404.07904) β Hierarchical gated recurrence [3] Mamba-2 (2405.21060) β Structured state space duality [4] RWKV-7 (2503.14456) β Generalized delta rule [5] Griffin/Hawk (2402.19427) β RG-LRU [6] Bokehlicious (2503.16067) β Aperture-aware attention [7] Dr.Bokeh (2308.08843) β Differentiable occlusion-aware rendering [8] GenRefocus (2512.16923) β FLUX-based refocusing [9] BokehDepth (2512.12425) β Joint depth+bokeh [10] Video Depth Anything (2501.12375) β Temporal video depth [11] MambaIRv2 (2411.15269) β Attentive state-space restoration [12] Hybrid Linear Attention Study (2507.06457) β Systematic analysis [13] Flash-Linear-Attention (fla-org) β Triton kernels [14] Vision-LSTM/ViL (2406.04303) β xLSTM for vision