Add training setup table (rank, targets, LR, batch, hardware)
Browse files
README.md
CHANGED
|
@@ -57,9 +57,28 @@ Three pictures the model has never seen.
|
|
| 57 |
|
| 58 |
Across all three, a recurring pattern: the visually prominent subject reads more prominently than its actual metric depth would predict (most clearly the cat's tie). When the depth signal is ambiguous or out-of-distribution, the model falls back on saliency-shaped outputs rather than predicting noise. The behavior is consistent with the paper's argument that the base model carries latent representations of image structure — subjects, prominence, attention — which a depth-only LoRA inherits but does not overwrite.
|
| 59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
## Status
|
| 61 |
|
| 62 |
-
This is an early checkpoint. Improved weights from broader training data and longer schedules will replace it as they land.
|
| 63 |
|
| 64 |
## Usage
|
| 65 |
|
|
|
|
| 57 |
|
| 58 |
Across all three, a recurring pattern: the visually prominent subject reads more prominently than its actual metric depth would predict (most clearly the cat's tie). When the depth signal is ambiguous or out-of-distribution, the model falls back on saliency-shaped outputs rather than predicting noise. The behavior is consistent with the paper's argument that the base model carries latent representations of image structure — subjects, prominence, attention — which a depth-only LoRA inherits but does not overwrite.
|
| 59 |
|
| 60 |
+
## Training
|
| 61 |
+
|
| 62 |
+
| | |
|
| 63 |
+
|---|---|
|
| 64 |
+
| Base | `black-forest-labs/FLUX.2-klein-base-4B` (4.0 B params, text encoder Qwen3-4B, VAE AutoencoderKLFlux2) |
|
| 65 |
+
| Adapter | LoRA, rank 32, alpha 32 |
|
| 66 |
+
| Target modules | Transformer attention `to_k`, `to_q`, `to_v`, `to_out.0` (joint blocks) + `to_qkv_mlp_proj` and `attn.to_out` of all 24 single transformer blocks. Text encoder and VAE frozen. |
|
| 67 |
+
| Resolution | 768 × 768 |
|
| 68 |
+
| Batch size | 2 (no grad accumulation) |
|
| 69 |
+
| Optimizer | AdamW, β=(0.9, 0.999), weight decay 1e-4 |
|
| 70 |
+
| Learning rate | 1e-4, cosine schedule, 150-step warmup |
|
| 71 |
+
| Steps | 4 000 (snapshot of an in-progress 5 000-step run) |
|
| 72 |
+
| Samples seen | ~8 000 |
|
| 73 |
+
| Mixed precision | bf16 |
|
| 74 |
+
| Training data | Hypersim train (10 582 frames, photorealistic synthetic indoor) + NYU Depth V2 train subset (1 500 frames, real Kinect indoor) = 12 082 frames |
|
| 75 |
+
| Depth encoding | Barron 2025 power transform (λ=−3, c=10/3), capped at 15 m, then Hamiltonian-path interpolation across the RGB cube |
|
| 76 |
+
| Hardware | Single NVIDIA RTX 6000 Ada Generation (46 GB VRAM) |
|
| 77 |
+
| Wall time | ~5 hours |
|
| 78 |
+
|
| 79 |
## Status
|
| 80 |
|
| 81 |
+
This is an early checkpoint. Improved weights from broader training data (full 47 k NYU + outdoor sources) and longer schedules will replace it as they land.
|
| 82 |
|
| 83 |
## Usage
|
| 84 |
|