phanerozoic commited on
Commit
07261ff
Β·
verified Β·
1 Parent(s): fa272ca

Tighter README

Browse files
Files changed (1) hide show
  1. README.md +13 -28
README.md CHANGED
@@ -14,48 +14,37 @@ pipeline_tag: depth-estimation
14
 
15
  # deep-plantain
16
 
17
- A LoRA adapter on FLUX.2 Klein (4B) for monocular depth estimation. Tests one claim from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) using parameter-efficient tuning.
18
 
19
  ## The paper's claim
20
 
21
- Vision Banana argues that image generation training plays the same foundational role for vision that next-token pretraining plays for language. The latent capability for visual understanding is already inside any sufficiently strong image generator; lightweight instruction-tuning aligns it to produce decodable RGB outputs (segmentation masks, depth maps, surface normals, etc.). The paper demonstrates this on Nano Banana Pro across five tasks β€” referring, semantic, and instance segmentation; metric depth; surface normals β€” and matches or beats domain specialists (SAM 3, Depth Anything 3, Lotus-2) without sacrificing the base model's generation quality. The thesis is paradigm-level: **image generation as a universal interface for vision**, analogous to text generation in language.
22
 
23
  ## What this LoRA tests
24
 
25
- One axis of the paper's claim:
26
 
27
  - One task of the five (monocular depth)
28
  - Open base (FLUX.2 Klein 4B)
29
  - LoRA, not full instruction-tuning of the original training mixture
30
 
31
- Question: does the thesis β€” depth understanding latent in image generators, surfaced by instruction-tuning β€” survive parameter-efficient adaptation on an open backbone?
32
-
33
  ## Method
34
 
35
- Both pieces preserved exactly from the paper:
36
-
37
- 1. **Reframe depth as image-to-image generation.** Input RGB β†’ output RGB depth visualization.
38
- 2. **Bijective RGB↔depth encoding.** Barron (2025) power transform compresses metric depth to a curve parameter `u ∈ [0, 1)`; piecewise-linear interpolation along a 7-segment Hamiltonian path through the corners of the RGB cube produces the visualization (black β†’ blue β†’ cyan β†’ green β†’ yellow β†’ red β†’ magenta β†’ white). Decoded by projecting predicted RGB onto the nearest cube edge.
39
-
40
- Training data: Hypersim (synthetic indoor) + NYU Depth V2 train split (real indoor). Maximum encoded depth 15 m by bijection cap.
41
 
42
  ## Demos
43
 
44
- Three pictures the model has never seen.
45
-
46
  ![cat](readme/cat.jpg)
47
 
48
- *Indoor portrait, close to training distribution. The cat is read as foreground (cyan, ~1–2 m), the wall as background (green, ~3 m), the blanket as nearer foreground (deep blue). Internal depth ordering and subject/background separation correct.*
49
 
50
  ![beach](readme/beach.png)
51
 
52
- *Outdoor scene, outside the indoor training distribution. Sky and ground are misencoded β€” the model has no learned representation for "sky" and pins it to ~5 m yellow rather than infinity. But the salient subjects survive: each distant figure, the kite, and the foreground bucket are individually segmented from the global gradient.*
53
 
54
  ![skier](readme/skier.jpg)
55
 
56
- *Outdoor mountain scene, also out-of-distribution. The subject is crisply isolated from snow, mountain, sky. Relative depth ordering of background layers is roughly correct (sky > mountain > snow > subject), compressed into the bijection's 15 m range.*
57
-
58
- Across all three, a recurring pattern: the visually prominent subject reads more prominently than its actual metric depth would predict (most clearly the cat's tie). When the depth signal is ambiguous or out-of-distribution, the model falls back on saliency-shaped outputs rather than predicting noise. The behavior is consistent with the paper's argument that the base model carries latent representations of image structure β€” subjects, prominence, attention β€” which a depth-only LoRA inherits but does not overwrite.
59
 
60
  ## Training
61
 
@@ -78,7 +67,7 @@ Across all three, a recurring pattern: the visually prominent subject reads more
78
 
79
  ## Status
80
 
81
- This is an early checkpoint. Improved weights from broader training data (full 47 k NYU + outdoor sources) and longer schedules will replace it as they land.
82
 
83
  ## Usage
84
 
@@ -95,21 +84,17 @@ pipe.load_lora_weights("phanerozoic/deep-plantain")
95
  prompt = (
96
  "Generate a metric depth visualization of this image. Color scheme: "
97
  "0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, "
98
- "~8.7 m red, ~16.5 m magenta, far approaching white. Smooth gradients "
99
- "along this path; every pixel follows this depth-to-color scheme."
100
  )
101
 
102
  depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0]
103
  ```
104
 
105
- The decoder for predicted RGB β†’ metric depth (nearest-segment projection + inverse Barron transform) is in `decode_rgb_to_depth.py`.
106
-
107
- ## License
108
-
109
- Apache 2.0 β€” matches base FLUX.2 Klein 4B.
110
-
111
  ## References
112
 
113
  - Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
114
  - Barron, J. T. *A Power Transform.* [arXiv:2502.10647](https://arxiv.org/abs/2502.10647) (2025).
115
- - Black Forest Labs. *FLUX.2 Klein.* https://bfl.ai/models/flux-2-klein (2025).
 
 
 
 
14
 
15
  # deep-plantain
16
 
17
+ LoRA on FLUX.2 Klein 4B for monocular metric depth, reproducing the depth task from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) on an open backbone.
18
 
19
  ## The paper's claim
20
 
21
+ Vision Banana instruction-tunes Nano Banana Pro on five vision tasks β€” referring, semantic, and instance segmentation; metric depth; surface normals β€” by reframing outputs as decodable RGB visualizations. It beats SAM 3 / Depth Anything 3 / Lotus-2 while preserving the base model's generation quality.
22
 
23
  ## What this LoRA tests
24
 
25
+ One axis of the claim:
26
 
27
  - One task of the five (monocular depth)
28
  - Open base (FLUX.2 Klein 4B)
29
  - LoRA, not full instruction-tuning of the original training mixture
30
 
 
 
31
  ## Method
32
 
33
+ Barron (2025) power transform (Ξ»=βˆ’3, c=10/3) maps metric depth to `u ∈ [0, 1)`; `u` is piecewise-linearly interpolated along a 7-segment Hamiltonian path through the RGB cube corners (black β†’ blue β†’ cyan β†’ green β†’ yellow β†’ red β†’ magenta β†’ white). The decoder projects predicted RGB onto the nearest cube edge and inverts the transform. Decoder: `decode_rgb_to_depth.py`.
 
 
 
 
 
34
 
35
  ## Demos
36
 
 
 
37
  ![cat](readme/cat.jpg)
38
 
39
+ *Indoor, in-distribution. Depth ordering correct: cat ~1–2 m (cyan), wall ~3 m (green), blanket nearer (deep blue).*
40
 
41
  ![beach](readme/beach.png)
42
 
43
+ *Outdoor OOD. Sky pins to ~5 m yellow rather than infinity. Salient subjects (figures, kite, bucket) still segment cleanly from the gradient.*
44
 
45
  ![skier](readme/skier.jpg)
46
 
47
+ *Outdoor OOD. Subject isolated from snow/mountain/sky; background ordering roughly right, compressed to 15 m.*
 
 
48
 
49
  ## Training
50
 
 
67
 
68
  ## Status
69
 
70
+ Early checkpoint. Improved weights from broader training data and longer schedules coming.
71
 
72
  ## Usage
73
 
 
84
  prompt = (
85
  "Generate a metric depth visualization of this image. Color scheme: "
86
  "0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, "
87
+ "~8.7 m red, ~16.5 m magenta, far approaching white."
 
88
  )
89
 
90
  depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0]
91
  ```
92
 
 
 
 
 
 
 
93
  ## References
94
 
95
  - Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
96
  - Barron, J. T. *A Power Transform.* [arXiv:2502.10647](https://arxiv.org/abs/2502.10647) (2025).
97
+
98
+ ## License
99
+
100
+ Apache 2.0.