phanerozoic
/

deep-plantain

@@ -57,9 +57,28 @@ Three pictures the model has never seen.
 Across all three, a recurring pattern: the visually prominent subject reads more prominently than its actual metric depth would predict (most clearly the cat's tie). When the depth signal is ambiguous or out-of-distribution, the model falls back on saliency-shaped outputs rather than predicting noise. The behavior is consistent with the paper's argument that the base model carries latent representations of image structure — subjects, prominence, attention — which a depth-only LoRA inherits but does not overwrite.
 ## Status
-This is an early checkpoint. Improved weights from broader training data and longer schedules will replace it as they land.
 ## Usage

 Across all three, a recurring pattern: the visually prominent subject reads more prominently than its actual metric depth would predict (most clearly the cat's tie). When the depth signal is ambiguous or out-of-distribution, the model falls back on saliency-shaped outputs rather than predicting noise. The behavior is consistent with the paper's argument that the base model carries latent representations of image structure — subjects, prominence, attention — which a depth-only LoRA inherits but does not overwrite.
+## Training
+| | |
+|---|---|
+| Base | `black-forest-labs/FLUX.2-klein-base-4B` (4.0 B params, text encoder Qwen3-4B, VAE AutoencoderKLFlux2) |
+| Adapter | LoRA, rank 32, alpha 32 |
+| Target modules | Transformer attention `to_k`, `to_q`, `to_v`, `to_out.0` (joint blocks) + `to_qkv_mlp_proj` and `attn.to_out` of all 24 single transformer blocks. Text encoder and VAE frozen. |
+| Resolution | 768 × 768 |
+| Batch size | 2 (no grad accumulation) |
+| Optimizer | AdamW, β=(0.9, 0.999), weight decay 1e-4 |
+| Learning rate | 1e-4, cosine schedule, 150-step warmup |
+| Steps | 4 000 (snapshot of an in-progress 5 000-step run) |
+| Samples seen | ~8 000 |
+| Mixed precision | bf16 |
+| Training data | Hypersim train (10 582 frames, photorealistic synthetic indoor) + NYU Depth V2 train subset (1 500 frames, real Kinect indoor) = 12 082 frames |
+| Depth encoding | Barron 2025 power transform (λ=−3, c=10/3), capped at 15 m, then Hamiltonian-path interpolation across the RGB cube |
+| Hardware | Single NVIDIA RTX 6000 Ada Generation (46 GB VRAM) |
+| Wall time | ~5 hours |
 ## Status
+This is an early checkpoint. Improved weights from broader training data (full 47 k NYU + outdoor sources) and longer schedules will replace it as they land.
 ## Usage