File size: 5,710 Bytes

f97c3fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07261ff
f97c3fc
 
 
07261ff
f97c3fc
 
 
07261ff
f97c3fc
 
 
 
 
 
 
07261ff
f97c3fc
 
 
 
 
07261ff
f97c3fc
 
 
07261ff
f97c3fc
 
 
07261ff
f97c3fc
fa272ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f97c3fc
 
058f9da
f97c3fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
07261ff
f97c3fc
 
 
 
 
 
 
 
 
07261ff
 
 
3448823

---
language: en
license: apache-2.0
base_model: black-forest-labs/FLUX.2-klein-base-4B
library_name: diffusers
tags:
  - depth-estimation
  - lora
  - flux2
  - vision-banana
  - arxiv:2604.20329
pipeline_tag: depth-estimation
---

# deep-plantain

LoRA on FLUX.2 Klein 4B for monocular metric depth, reproducing the depth task from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) on an open backbone.

## The paper's claim

Vision Banana instruction-tunes Nano Banana Pro on five vision tasks — referring, semantic, and instance segmentation; metric depth; surface normals — by reframing outputs as decodable RGB visualizations. It beats SAM 3 / Depth Anything 3 / Lotus-2 while preserving the base model's generation quality.

## What this LoRA tests

One axis of the claim:

- One task of the five (monocular depth)
- Open base (FLUX.2 Klein 4B)
- LoRA, not full instruction-tuning of the original training mixture

## Method

Barron (2025) power transform (λ=−3, c=10/3) maps metric depth to `u ∈ [0, 1)`; `u` is piecewise-linearly interpolated along a 7-segment Hamiltonian path through the RGB cube corners (black → blue → cyan → green → yellow → red → magenta → white). The decoder projects predicted RGB onto the nearest cube edge and inverts the transform. Decoder: `decode_rgb_to_depth.py`.

## Demos

![cat](readme/cat.jpg)

*Indoor, in-distribution. Depth ordering correct: cat ~1–2 m (cyan), wall ~3 m (green), blanket nearer (deep blue).*

![beach](readme/beach.png)

*Outdoor OOD. Sky pins to ~5 m yellow rather than infinity. Salient subjects (figures, kite, bucket) still segment cleanly from the gradient.*

![skier](readme/skier.jpg)

*Outdoor OOD. Subject isolated from snow/mountain/sky; background ordering roughly right, compressed to 15 m.*

## Training

| | |
|---|---|
| Base | `black-forest-labs/FLUX.2-klein-base-4B` (4.0 B params, text encoder Qwen3-4B, VAE AutoencoderKLFlux2) |
| Adapter | LoRA, rank 32, alpha 32 |
| Target modules | Transformer attention `to_k`, `to_q`, `to_v`, `to_out.0` (joint blocks) + `to_qkv_mlp_proj` and `attn.to_out` of all 24 single transformer blocks. Text encoder and VAE frozen. |
| Resolution | 768 × 768 |
| Batch size | 2 (no grad accumulation) |
| Optimizer | AdamW, β=(0.9, 0.999), weight decay 1e-4 |
| Learning rate | 1e-4, cosine schedule, 150-step warmup |
| Steps | 4 000 (snapshot of an in-progress 5 000-step run) |
| Samples seen | ~8 000 |
| Mixed precision | bf16 |
| Training data | Hypersim train (10 582 frames, photorealistic synthetic indoor) + NYU Depth V2 train subset (1 500 frames, real Kinect indoor) = 12 082 frames |
| Depth encoding | Barron 2025 power transform (λ=−3, c=10/3), capped at 15 m, then Hamiltonian-path interpolation across the RGB cube |
| Hardware | Single NVIDIA RTX 6000 Ada Generation (46 GB VRAM) |
| Wall time | ~5 hours |

## Status

A substantially better checkpoint is staged in `pending/` — rank-256 + text-encoder LoRA trained on 58 k mixed Hypersim + NYU frames. NYU Eigen test (490 frames evaluated): **0.596 m RMSE / 0.745 δ1 / 0.163 AbsRel**, roughly doubling δ1 and more than halving RMSE versus the rank-32 baseline (1.566 m / 0.370 / 0.461). On the 10 hardest NYU frames where the rank-32 baseline scored 3–4 m, the new checkpoint gets 0.436 m / 0.819 δ1. Vision Banana paper reference (full set, full instruction-tune of NBP): 0.948 δ1 / 0.074 AbsRel. See `pending/README.md` for load instructions; canonical root weights will be replaced once a later step beats this one.

## Usage

```python
from diffusers import Flux2KleinPipeline
import torch

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-base-4B",
    torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/deep-plantain")

prompt = (
    "Generate a metric depth visualization of this image. Color scheme: "
    "0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, "
    "~8.7 m red, ~16.5 m magenta, far approaching white."
)

depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0]
```

## References

- Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026).
- Barron, J. T. *A Power Transform.* [arXiv:2502.10647](https://arxiv.org/abs/2502.10647) (2025).

## License

The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.

### Training data attributions

- **Hypersim** (Roberts et al., 2021). Photorealistic synthetic indoor frames used as a portion of the training data. Licensed under the Apple ML Research License Agreement; non-commercial research use only. The 3D scenes underlying the rendered images were originally licensed from Evermotion. See https://github.com/apple/ml-hypersim for full terms.
- **NYU Depth V2** (Silberman et al., 2012). Real Kinect indoor frames used as a portion of the training data. Released for research use; see https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html.

Use of the LoRA adapter for purposes that would conflict with the source datasets' licenses (e.g., redistributing reconstructed Hypersim imagery commercially) is not authorised by the data holders. The adapter itself does not embed the source images, but downstream use that effectively reconstructs them inherits those constraints.

### Base model

Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card.