phanerozoic
/

deep-plantain

@@ -14,48 +14,37 @@ pipeline_tag: depth-estimation
 # deep-plantain
-A LoRA adapter on FLUX.2 Klein (4B) for monocular depth estimation. Tests one claim from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) using parameter-efficient tuning.
 ## The paper's claim
-Vision Banana argues that image generation training plays the same foundational role for vision that next-token pretraining plays for language. The latent capability for visual understanding is already inside any sufficiently strong image generator; lightweight instruction-tuning aligns it to produce decodable RGB outputs (segmentation masks, depth maps, surface normals, etc.). The paper demonstrates this on Nano Banana Pro across five tasks — referring, semantic, and instance segmentation; metric depth; surface normals — and matches or beats domain specialists (SAM 3, Depth Anything 3, Lotus-2) without sacrificing the base model's generation quality. The thesis is paradigm-level: **image generation as a universal interface for vision**, analogous to text generation in language.
 ## What this LoRA tests
-One axis of the paper's claim:
 - One task of the five (monocular depth)
 - Open base (FLUX.2 Klein 4B)
 - LoRA, not full instruction-tuning of the original training mixture
-Question: does the thesis — depth understanding latent in image generators, surfaced by instruction-tuning — survive parameter-efficient adaptation on an open backbone?
 ## Method
-Both pieces preserved exactly from the paper:
-1. **Reframe depth as image-to-image generation.** Input RGB → output RGB depth visualization.
-2. **Bijective RGB↔depth encoding.** Barron (2025) power transform compresses metric depth to a curve parameter `u ∈ [0, 1)`; piecewise-linear interpolation along a 7-segment Hamiltonian path through the corners of the RGB cube produces the visualization (black → blue → cyan → green → yellow → red → magenta → white). Decoded by projecting predicted RGB onto the nearest cube edge.
-Training data: Hypersim (synthetic indoor) + NYU Depth V2 train split (real indoor). Maximum encoded depth 15 m by bijection cap.
 ## Demos
-Three pictures the model has never seen.
 ![cat](readme/cat.jpg)
-*Indoor portrait, close to training distribution. The cat is read as foreground (cyan, ~1–2 m), the wall as background (green, ~3 m), the blanket as nearer foreground (deep blue). Internal depth ordering and subject/background separation correct.*
 ![beach](readme/beach.png)
-*Outdoor scene, outside the indoor training distribution. Sky and ground are misencoded — the model has no learned representation for "sky" and pins it to ~5 m yellow rather than infinity. But the salient subjects survive: each distant figure, the kite, and the foreground bucket are individually segmented from the global gradient.*
 ![skier](readme/skier.jpg)
-*Outdoor mountain scene, also out-of-distribution. The subject is crisply isolated from snow, mountain, sky. Relative depth ordering of background layers is roughly correct (sky > mountain > snow > subject), compressed into the bijection's 15 m range.*
-Across all three, a recurring pattern: the visually prominent subject reads more prominently than its actual metric depth would predict (most clearly the cat's tie). When the depth signal is ambiguous or out-of-distribution, the model falls back on saliency-shaped outputs rather than predicting noise. The behavior is consistent with the paper's argument that the base model carries latent representations of image structure — subjects, prominence, attention — which a depth-only LoRA inherits but does not overwrite.
 ## Training
@@ -78,7 +67,7 @@ Across all three, a recurring pattern: the visually prominent subject reads more
 ## Status
-This is an early checkpoint. Improved weights from broader training data (full 47 k NYU + outdoor sources) and longer schedules will replace it as they land.
 ## Usage
@@ -95,21 +84,17 @@ pipe.load_lora_weights("phanerozoic/deep-plantain")
 prompt = (
     "Generate a metric depth visualization of this image. Color scheme: "
     "0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, "
-    "~8.7 m red, ~16.5 m magenta, far approaching white. Smooth gradients "
-    "along this path; every pixel follows this depth-to-color scheme."
 )
 depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0]
 ```
-The decoder for predicted RGB → metric depth (nearest-segment projection + inverse Barron transform) is in `decode_rgb_to_depth.py`.
-## License
-Apache 2.0 — matches base FLUX.2 Klein 4B.
 ## References
 - Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
 - Barron, J. T. *A Power Transform.* [arXiv:2502.10647](https://arxiv.org/abs/2502.10647) (2025).
-- Black Forest Labs. *FLUX.2 Klein.* https://bfl.ai/models/flux-2-klein (2025).

 # deep-plantain
+LoRA on FLUX.2 Klein 4B for monocular metric depth, reproducing the depth task from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) on an open backbone.
 ## The paper's claim
+Vision Banana instruction-tunes Nano Banana Pro on five vision tasks — referring, semantic, and instance segmentation; metric depth; surface normals — by reframing outputs as decodable RGB visualizations. It beats SAM 3 / Depth Anything 3 / Lotus-2 while preserving the base model's generation quality.
 ## What this LoRA tests
+One axis of the claim:
 - One task of the five (monocular depth)
 - Open base (FLUX.2 Klein 4B)
 - LoRA, not full instruction-tuning of the original training mixture
 ## Method
+Barron (2025) power transform (λ=−3, c=10/3) maps metric depth to `u ∈ [0, 1)`; `u` is piecewise-linearly interpolated along a 7-segment Hamiltonian path through the RGB cube corners (black → blue → cyan → green → yellow → red → magenta → white). The decoder projects predicted RGB onto the nearest cube edge and inverts the transform. Decoder: `decode_rgb_to_depth.py`.
 ## Demos
 ![cat](readme/cat.jpg)
+*Indoor, in-distribution. Depth ordering correct: cat ~1–2 m (cyan), wall ~3 m (green), blanket nearer (deep blue).*
 ![beach](readme/beach.png)
+*Outdoor OOD. Sky pins to ~5 m yellow rather than infinity. Salient subjects (figures, kite, bucket) still segment cleanly from the gradient.*
 ![skier](readme/skier.jpg)
+*Outdoor OOD. Subject isolated from snow/mountain/sky; background ordering roughly right, compressed to 15 m.*
 ## Training
 ## Status
+Early checkpoint. Improved weights from broader training data and longer schedules coming.
 ## Usage
 prompt = (
     "Generate a metric depth visualization of this image. Color scheme: "
     "0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, "
+    "~8.7 m red, ~16.5 m magenta, far approaching white."
 )
 depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0]
 ```
 ## References
 - Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
 - Barron, J. T. *A Power Transform.* [arXiv:2502.10647](https://arxiv.org/abs/2502.10647) (2025).
+## License
+Apache 2.0.