phanerozoic commited on
Commit
fa272ca
·
verified ·
1 Parent(s): f97c3fc

Add training setup table (rank, targets, LR, batch, hardware)

Browse files
Files changed (1) hide show
  1. README.md +20 -1
README.md CHANGED
@@ -57,9 +57,28 @@ Three pictures the model has never seen.
57
 
58
  Across all three, a recurring pattern: the visually prominent subject reads more prominently than its actual metric depth would predict (most clearly the cat's tie). When the depth signal is ambiguous or out-of-distribution, the model falls back on saliency-shaped outputs rather than predicting noise. The behavior is consistent with the paper's argument that the base model carries latent representations of image structure — subjects, prominence, attention — which a depth-only LoRA inherits but does not overwrite.
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  ## Status
61
 
62
- This is an early checkpoint. Improved weights from broader training data and longer schedules will replace it as they land.
63
 
64
  ## Usage
65
 
 
57
 
58
  Across all three, a recurring pattern: the visually prominent subject reads more prominently than its actual metric depth would predict (most clearly the cat's tie). When the depth signal is ambiguous or out-of-distribution, the model falls back on saliency-shaped outputs rather than predicting noise. The behavior is consistent with the paper's argument that the base model carries latent representations of image structure — subjects, prominence, attention — which a depth-only LoRA inherits but does not overwrite.
59
 
60
+ ## Training
61
+
62
+ | | |
63
+ |---|---|
64
+ | Base | `black-forest-labs/FLUX.2-klein-base-4B` (4.0 B params, text encoder Qwen3-4B, VAE AutoencoderKLFlux2) |
65
+ | Adapter | LoRA, rank 32, alpha 32 |
66
+ | Target modules | Transformer attention `to_k`, `to_q`, `to_v`, `to_out.0` (joint blocks) + `to_qkv_mlp_proj` and `attn.to_out` of all 24 single transformer blocks. Text encoder and VAE frozen. |
67
+ | Resolution | 768 × 768 |
68
+ | Batch size | 2 (no grad accumulation) |
69
+ | Optimizer | AdamW, β=(0.9, 0.999), weight decay 1e-4 |
70
+ | Learning rate | 1e-4, cosine schedule, 150-step warmup |
71
+ | Steps | 4 000 (snapshot of an in-progress 5 000-step run) |
72
+ | Samples seen | ~8 000 |
73
+ | Mixed precision | bf16 |
74
+ | Training data | Hypersim train (10 582 frames, photorealistic synthetic indoor) + NYU Depth V2 train subset (1 500 frames, real Kinect indoor) = 12 082 frames |
75
+ | Depth encoding | Barron 2025 power transform (λ=−3, c=10/3), capped at 15 m, then Hamiltonian-path interpolation across the RGB cube |
76
+ | Hardware | Single NVIDIA RTX 6000 Ada Generation (46 GB VRAM) |
77
+ | Wall time | ~5 hours |
78
+
79
  ## Status
80
 
81
+ This is an early checkpoint. Improved weights from broader training data (full 47 k NYU + outdoor sources) and longer schedules will replace it as they land.
82
 
83
  ## Usage
84