Tighter README
Browse files
README.md
CHANGED
|
@@ -14,48 +14,37 @@ pipeline_tag: depth-estimation
|
|
| 14 |
|
| 15 |
# deep-plantain
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
## The paper's claim
|
| 20 |
|
| 21 |
-
Vision Banana
|
| 22 |
|
| 23 |
## What this LoRA tests
|
| 24 |
|
| 25 |
-
One axis of the
|
| 26 |
|
| 27 |
- One task of the five (monocular depth)
|
| 28 |
- Open base (FLUX.2 Klein 4B)
|
| 29 |
- LoRA, not full instruction-tuning of the original training mixture
|
| 30 |
|
| 31 |
-
Question: does the thesis β depth understanding latent in image generators, surfaced by instruction-tuning β survive parameter-efficient adaptation on an open backbone?
|
| 32 |
-
|
| 33 |
## Method
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
1. **Reframe depth as image-to-image generation.** Input RGB β output RGB depth visualization.
|
| 38 |
-
2. **Bijective RGBβdepth encoding.** Barron (2025) power transform compresses metric depth to a curve parameter `u β [0, 1)`; piecewise-linear interpolation along a 7-segment Hamiltonian path through the corners of the RGB cube produces the visualization (black β blue β cyan β green β yellow β red β magenta β white). Decoded by projecting predicted RGB onto the nearest cube edge.
|
| 39 |
-
|
| 40 |
-
Training data: Hypersim (synthetic indoor) + NYU Depth V2 train split (real indoor). Maximum encoded depth 15 m by bijection cap.
|
| 41 |
|
| 42 |
## Demos
|
| 43 |
|
| 44 |
-
Three pictures the model has never seen.
|
| 45 |
-
|
| 46 |

|
| 47 |
|
| 48 |
-
*Indoor
|
| 49 |
|
| 50 |

|
| 51 |
|
| 52 |
-
*Outdoor
|
| 53 |
|
| 54 |

|
| 55 |
|
| 56 |
-
*Outdoor
|
| 57 |
-
|
| 58 |
-
Across all three, a recurring pattern: the visually prominent subject reads more prominently than its actual metric depth would predict (most clearly the cat's tie). When the depth signal is ambiguous or out-of-distribution, the model falls back on saliency-shaped outputs rather than predicting noise. The behavior is consistent with the paper's argument that the base model carries latent representations of image structure β subjects, prominence, attention β which a depth-only LoRA inherits but does not overwrite.
|
| 59 |
|
| 60 |
## Training
|
| 61 |
|
|
@@ -78,7 +67,7 @@ Across all three, a recurring pattern: the visually prominent subject reads more
|
|
| 78 |
|
| 79 |
## Status
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
## Usage
|
| 84 |
|
|
@@ -95,21 +84,17 @@ pipe.load_lora_weights("phanerozoic/deep-plantain")
|
|
| 95 |
prompt = (
|
| 96 |
"Generate a metric depth visualization of this image. Color scheme: "
|
| 97 |
"0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, "
|
| 98 |
-
"~8.7 m red, ~16.5 m magenta, far approaching white.
|
| 99 |
-
"along this path; every pixel follows this depth-to-color scheme."
|
| 100 |
)
|
| 101 |
|
| 102 |
depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0]
|
| 103 |
```
|
| 104 |
|
| 105 |
-
The decoder for predicted RGB β metric depth (nearest-segment projection + inverse Barron transform) is in `decode_rgb_to_depth.py`.
|
| 106 |
-
|
| 107 |
-
## License
|
| 108 |
-
|
| 109 |
-
Apache 2.0 β matches base FLUX.2 Klein 4B.
|
| 110 |
-
|
| 111 |
## References
|
| 112 |
|
| 113 |
- Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
|
| 114 |
- Barron, J. T. *A Power Transform.* [arXiv:2502.10647](https://arxiv.org/abs/2502.10647) (2025).
|
| 115 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
# deep-plantain
|
| 16 |
|
| 17 |
+
LoRA on FLUX.2 Klein 4B for monocular metric depth, reproducing the depth task from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) on an open backbone.
|
| 18 |
|
| 19 |
## The paper's claim
|
| 20 |
|
| 21 |
+
Vision Banana instruction-tunes Nano Banana Pro on five vision tasks β referring, semantic, and instance segmentation; metric depth; surface normals β by reframing outputs as decodable RGB visualizations. It beats SAM 3 / Depth Anything 3 / Lotus-2 while preserving the base model's generation quality.
|
| 22 |
|
| 23 |
## What this LoRA tests
|
| 24 |
|
| 25 |
+
One axis of the claim:
|
| 26 |
|
| 27 |
- One task of the five (monocular depth)
|
| 28 |
- Open base (FLUX.2 Klein 4B)
|
| 29 |
- LoRA, not full instruction-tuning of the original training mixture
|
| 30 |
|
|
|
|
|
|
|
| 31 |
## Method
|
| 32 |
|
| 33 |
+
Barron (2025) power transform (Ξ»=β3, c=10/3) maps metric depth to `u β [0, 1)`; `u` is piecewise-linearly interpolated along a 7-segment Hamiltonian path through the RGB cube corners (black β blue β cyan β green β yellow β red β magenta β white). The decoder projects predicted RGB onto the nearest cube edge and inverts the transform. Decoder: `decode_rgb_to_depth.py`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
## Demos
|
| 36 |
|
|
|
|
|
|
|
| 37 |

|
| 38 |
|
| 39 |
+
*Indoor, in-distribution. Depth ordering correct: cat ~1β2 m (cyan), wall ~3 m (green), blanket nearer (deep blue).*
|
| 40 |
|
| 41 |

|
| 42 |
|
| 43 |
+
*Outdoor OOD. Sky pins to ~5 m yellow rather than infinity. Salient subjects (figures, kite, bucket) still segment cleanly from the gradient.*
|
| 44 |
|
| 45 |

|
| 46 |
|
| 47 |
+
*Outdoor OOD. Subject isolated from snow/mountain/sky; background ordering roughly right, compressed to 15 m.*
|
|
|
|
|
|
|
| 48 |
|
| 49 |
## Training
|
| 50 |
|
|
|
|
| 67 |
|
| 68 |
## Status
|
| 69 |
|
| 70 |
+
Early checkpoint. Improved weights from broader training data and longer schedules coming.
|
| 71 |
|
| 72 |
## Usage
|
| 73 |
|
|
|
|
| 84 |
prompt = (
|
| 85 |
"Generate a metric depth visualization of this image. Color scheme: "
|
| 86 |
"0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, "
|
| 87 |
+
"~8.7 m red, ~16.5 m magenta, far approaching white."
|
|
|
|
| 88 |
)
|
| 89 |
|
| 90 |
depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0]
|
| 91 |
```
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
## References
|
| 94 |
|
| 95 |
- Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
|
| 96 |
- Barron, J. T. *A Power Transform.* [arXiv:2502.10647](https://arxiv.org/abs/2502.10647) (2025).
|
| 97 |
+
|
| 98 |
+
## License
|
| 99 |
+
|
| 100 |
+
Apache 2.0.
|