| --- |
| language: en |
| license: apache-2.0 |
| base_model: black-forest-labs/FLUX.2-klein-base-4B |
| library_name: diffusers |
| tags: |
| - depth-estimation |
| - lora |
| - flux2 |
| - vision-banana |
| - arxiv:2604.20329 |
| pipeline_tag: depth-estimation |
| --- |
| |
| # deep-plantain |
|
|
| LoRA on FLUX.2 Klein 4B for monocular metric depth, reproducing the depth task from *Image Generators are Generalist Vision Learners* (Gabeur et al., 2026; [arXiv:2604.20329](https://arxiv.org/abs/2604.20329)) on an open backbone. |
|
|
| ## The paper's claim |
|
|
| Vision Banana instruction-tunes Nano Banana Pro on five vision tasks β referring, semantic, and instance segmentation; metric depth; surface normals β by reframing outputs as decodable RGB visualizations. It beats SAM 3 / Depth Anything 3 / Lotus-2 while preserving the base model's generation quality. |
|
|
| ## What this LoRA tests |
|
|
| One axis of the claim: |
|
|
| - One task of the five (monocular depth) |
| - Open base (FLUX.2 Klein 4B) |
| - LoRA, not full instruction-tuning of the original training mixture |
|
|
| ## Method |
|
|
| Barron (2025) power transform (Ξ»=β3, c=10/3) maps metric depth to `u β [0, 1)`; `u` is piecewise-linearly interpolated along a 7-segment Hamiltonian path through the RGB cube corners (black β blue β cyan β green β yellow β red β magenta β white). The decoder projects predicted RGB onto the nearest cube edge and inverts the transform. Decoder: `decode_rgb_to_depth.py`. |
|
|
| ## Demos |
|
|
|  |
|
|
| *Indoor, in-distribution. Depth ordering correct: cat ~1β2 m (cyan), wall ~3 m (green), blanket nearer (deep blue).* |
|
|
|  |
|
|
| *Outdoor OOD. Sky pins to ~5 m yellow rather than infinity. Salient subjects (figures, kite, bucket) still segment cleanly from the gradient.* |
|
|
|  |
|
|
| *Outdoor OOD. Subject isolated from snow/mountain/sky; background ordering roughly right, compressed to 15 m.* |
|
|
| ## Training |
|
|
| | | | |
| |---|---| |
| | Base | `black-forest-labs/FLUX.2-klein-base-4B` (4.0 B params, text encoder Qwen3-4B, VAE AutoencoderKLFlux2) | |
| | Adapter | LoRA, rank 32, alpha 32 | |
| | Target modules | Transformer attention `to_k`, `to_q`, `to_v`, `to_out.0` (joint blocks) + `to_qkv_mlp_proj` and `attn.to_out` of all 24 single transformer blocks. Text encoder and VAE frozen. | |
| | Resolution | 768 Γ 768 | |
| | Batch size | 2 (no grad accumulation) | |
| | Optimizer | AdamW, Ξ²=(0.9, 0.999), weight decay 1e-4 | |
| | Learning rate | 1e-4, cosine schedule, 150-step warmup | |
| | Steps | 4 000 (snapshot of an in-progress 5 000-step run) | |
| | Samples seen | ~8 000 | |
| | Mixed precision | bf16 | |
| | Training data | Hypersim train (10 582 frames, photorealistic synthetic indoor) + NYU Depth V2 train subset (1 500 frames, real Kinect indoor) = 12 082 frames | |
| | Depth encoding | Barron 2025 power transform (Ξ»=β3, c=10/3), capped at 15 m, then Hamiltonian-path interpolation across the RGB cube | |
| | Hardware | Single NVIDIA RTX 6000 Ada Generation (46 GB VRAM) | |
| | Wall time | ~5 hours | |
|
|
| ## Status |
|
|
| A substantially better checkpoint is staged in `pending/` β rank-256 + text-encoder LoRA trained on 58 k mixed Hypersim + NYU frames. NYU Eigen test (490 frames evaluated): **0.596 m RMSE / 0.745 Ξ΄1 / 0.163 AbsRel**, roughly doubling Ξ΄1 and more than halving RMSE versus the rank-32 baseline (1.566 m / 0.370 / 0.461). On the 10 hardest NYU frames where the rank-32 baseline scored 3β4 m, the new checkpoint gets 0.436 m / 0.819 Ξ΄1. Vision Banana paper reference (full set, full instruction-tune of NBP): 0.948 Ξ΄1 / 0.074 AbsRel. See `pending/README.md` for load instructions; canonical root weights will be replaced once a later step beats this one. |
|
|
| ## Usage |
|
|
| ```python |
| from diffusers import Flux2KleinPipeline |
| import torch |
| |
| pipe = Flux2KleinPipeline.from_pretrained( |
| "black-forest-labs/FLUX.2-klein-base-4B", |
| torch_dtype=torch.bfloat16, |
| ).to("cuda") |
| pipe.load_lora_weights("phanerozoic/deep-plantain") |
| |
| prompt = ( |
| "Generate a metric depth visualization of this image. Color scheme: " |
| "0 m black, ~0.8 m blue, ~1.8 m cyan, ~3.2 m green, ~5.3 m yellow, " |
| "~8.7 m red, ~16.5 m magenta, far approaching white." |
| ) |
| |
| depth_pil = pipe(image=src, prompt=prompt, num_inference_steps=20).images[0] |
| ``` |
|
|
| ## References |
|
|
| - Gabeur, Long, Peng, et al. *Image Generators are Generalist Vision Learners.* [arXiv:2604.20329](https://arxiv.org/abs/2604.20329) (2026). |
| - Barron, J. T. *A Power Transform.* [arXiv:2502.10647](https://arxiv.org/abs/2502.10647) (2025). |
|
|
| ## License |
|
|
| The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B. |
|
|
| ### Training data attributions |
|
|
| - **Hypersim** (Roberts et al., 2021). Photorealistic synthetic indoor frames used as a portion of the training data. Licensed under the Apple ML Research License Agreement; non-commercial research use only. The 3D scenes underlying the rendered images were originally licensed from Evermotion. See https://github.com/apple/ml-hypersim for full terms. |
| - **NYU Depth V2** (Silberman et al., 2012). Real Kinect indoor frames used as a portion of the training data. Released for research use; see https://cs.nyu.edu/~fergus/datasets/nyu_depth_v2.html. |
|
|
| Use of the LoRA adapter for purposes that would conflict with the source datasets' licenses (e.g., redistributing reconstructed Hypersim imagery commercially) is not authorised by the data holders. The adapter itself does not embed the source images, but downstream use that effectively reconstructs them inherits those constraints. |
|
|
| ### Base model |
|
|
| Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card. |
|
|