| # Reproducing the Best Checkpoint (HSS=0.382) |
|
|
| ## Quick Start |
|
|
| The `checkpoint.pt` in this repo is the final model. To run inference: |
|
|
| ```bash |
| python script.py |
| ``` |
|
|
| To reproduce from scratch (~3hr on 1x RTX 4090): |
|
|
| ```bash |
| bash reproduce.sh |
| ``` |
|
|
| ## Exact Recipe |
|
|
| Architecture (unchanged across all 3 steps): |
| ``` |
| Perceiver: hidden=256, ff=1024, latent_tokens=256, latent_layers=7 |
| encoder_layers=4, decoder_layers=3, cross_attn_interval=4 |
| num_heads=4, kv_heads_cross=2, kv_heads_self=2 |
| qk_norm=True (L2), rms_norm=True, dropout=0.1 |
| segments=64, segment_param=midpoint_dir_len, segment_conf=True |
| behind_emb_dim=8, vote_features=True, activation=gelu |
| ``` |
|
|
| All shared config lives in `configs/base.json`. |
|
|
| ## Evaluation sets |
|
|
| Three distinct evaluation sets show up in this work. Every HSS / F1 / IoU |
| number below is from one of these three; we try to tag each number with which. |
|
|
| - **Dev val** = the last 1024 scenes of the published training set |
| (`hf://usm3d/s23dr-2026-sampled_*_v2:train`). This is what we actually |
| optimized against during development, and it is the set behind every |
| "HSS=0.382"-style number in this document, in `submitted_2048/README.md`, |
| and in the run-history files under `repro_runs/` and the validation-archive. |
| - **Official validation** = `hf://usm3d/s23dr-2026-sampled_*_v2:validation` |
| (equivalently the `*public*` tars in `usm3d/hoho22k_2026_test_x_anon`). |
| We did *not* eval on this split during development. No HSS number in this |
| repo refers to it. |
| - **Public test** = the `*private*` tars in `usm3d/hoho22k_2026_test_x_anon`, |
| scored by the competition harness and posted to the leaderboard. We have |
| two such numbers, both clearly labeled "public test" wherever they appear: |
| **0.4273** (2048 submission, commit `f4487da`) and **0.4470** (4096 |
| submission, commit `4946666`). |
|
|
| Because we never validated against the official validation split, there is |
| some risk that the dev-val numbers are mildly overfit to the last-1024-train |
| slice. The +0.06 dev-val-to-public-test gap (consistent across both |
| submissions, see `submitted_2048/README.md`) is empirically positive, but it |
| is not a substitute for actually scoring on official val. |
|
|
| ### Step 1: 2048 Phase 1 (from scratch) β ~1.5hr |
|
|
| ``` |
| Data: hf://usm3d/s23dr-2026-sampled_2048_v2:train (16,508 samples) |
| Steps: 0 -> 125,000 (242 epochs) |
| LR: 3e-4, warmup=10,000 |
| Batch size: 32 |
| Optimizer: AdamW, betas=(0.9, 0.95), weight_decay=0.01 |
| Sinkhorn: eps=0.1, iters=20, dustbin=0.3 |
| Conf: weight=0.1, mode=sinkhorn, head_wd=0.1 |
| Endpoint: OFF |
| Aug: rotate=True, flip=True |
| Seed: 353 |
| ``` |
|
|
| Trains the perceiver from random init on 2048-point samples. The sinkhorn |
| optimal transport loss learns to match predicted segments to ground truth. |
|
|
| **Why 2048 first:** Training directly on 4096 overfits (1.47x train/val ratio |
| vs 1.19x for 2048). The 2048 model learns better-generalized representations. |
|
|
| **Output:** dev val HSS ~0.28. |
|
|
| ### Step 2: 4096 finetune (constant LR) β ~15min |
|
|
| ``` |
| Resume: Step 1 -> step125000.pt |
| Data: hf://usm3d/s23dr-2026-sampled_4096_v2:train (15,892 samples) |
| Steps: 125,001 -> 135,000 (10k steps) |
| LR: 3e-5 (constant, no cooldown) |
| Batch size: 64 |
| Endpoint: OFF |
| ``` |
|
|
| Switches input from 2048 to 4096 points, increasing structural coverage from |
| 66% to 74%. The gentle lr (3e-5) preserves learned representations while |
| adapting to the extra input. Higher LR (>1e-4) causes catastrophic forgetting. |
|
|
| Dev val HSS jumps from 0.28 to 0.35 in ~5k steps. Plateaus by 10k steps. |
|
|
| **Output:** dev val HSS ~0.35. |
|
|
| ### Step 3: Cooldown with endpoint loss β ~1hr |
|
|
| ``` |
| Resume: Step 2 -> step135000.pt |
| Data: hf://usm3d/s23dr-2026-sampled_4096_v2:train |
| Steps: 135,001 -> 170,000 (35k steps) |
| LR: 3e-5, cooldown_start=150,000, cooldown_steps=20,000 |
| (constant 3e-5 for 15k steps, then linear decay to ~0 over 20k) |
| Batch size: 64 |
| Endpoint: weight=0.1 |
| ``` |
|
|
| Adds symmetric endpoint L1 loss (using detached sinkhorn assignment) to |
| tighten vertex precision. The sinkhorn loss alone operates on segment |
| midpoint/direction/length and doesn't directly penalize endpoint position error. |
|
|
| **Output:** dev val HSS=0.382, F1=0.414. Public test HSS=0.4470. |
|
|
| ### Key Numbers |
|
|
| Per-stage **dev val** scores from the **original training run** (March 23-26), |
| which produced the shipped `checkpoint.pt`. Re-running `reproduce.sh` from |
| scratch does not hit these exact numbers - see "Reproduction Results" below for |
| the actual ranges. Compiled-mode re-runs of Step 3 land in dev val 0.342-0.379, |
| with the best run from this codebase at 0.376. |
|
|
| | Stage | Steps | Dev val HSS | Dev val F1 | What changed | |
| |-------|-------|-------------|------------|--------------| |
| | After Step 1 | 125k | 0.281 | 0.156 | Learned geometry from 2048 pts | |
| | After Step 2 | 135k | 0.351 | 0.190 | +74% coverage from 4096 pts | |
| | After Step 3 | 170k | **0.382** | **0.411** | Vertex precision from endpoint loss | |
|
|
| ## Why This Works |
|
|
| 1. **2048 training has low overfitting** (1.19x train/val ratio) β the model |
| learns good representations without memorizing training samples. |
|
|
| 2. **4096 data has higher coverage ceiling** (74% vs 66% structural points) β |
| more of the building surface is observed, improving vertex recall. |
|
|
| 3. **Gentle finetuning preserves representations** β at lr=3e-5, the model |
| keeps its learned geometry understanding while adapting to the extra input. |
|
|
| 4. **Endpoint loss tightens vertices** β the symmetric endpoint distance |
| directly penalizes vertex position errors, which sinkhorn loss alone |
| doesn't do (it operates on midpoint/direction/length parametrization). |
|
|
| ## What Doesn't Work (yet) |
|
|
| These are informal observations from one-off experiments during development. |
| The runs, args, and eval logs are mostly [here](https://github.com/JackLangerman/s23dr_2026_example), |
| but not all of them are preserved perfectly. The specific numbers below come from contemaranious notes, |
| but are not all trivially reproducable. Take them as directional guidance, not as benchmarks. |
|
|
| - **Training 4096 from scratch:** observed to overfit (~1.47x train/val loss |
| gap, vs ~1.19x for 2048) and peak around dev val HSS 0.346 in a single run. |
| - **BuildingWorld pretraining:** in one experiment, the representations were |
| near-orthogonal to S23DR (cosine sim ~0.05 between learned features) and |
| did not transfer. |
| - **Mixed BW+S23DR training:** mixing BW data into the S23DR loader hurt |
| dev val HSS in the runs we tried, presumed to be from domain gap. |
| - **High dropout / weight decay:** lowered the train/val gap but also lowered |
| dev val HSS in the configurations we tried. |
| - **High finetune LR (>1e-4):** dropped dev val HSS sharply in Step 2 in |
| single-run observations, consistent with disrupting the Step 1 representations. |
| - **Steeper cooldown (1e-5, 20x drop):** slightly worse than 3e-5 in the one |
| comparison we ran for this checkpoint. |
|
|
| ## Reproduction Results |
|
|
| ### End-to-end reproductions |
|
|
| All HSS / F1 / IoU below are on **dev val**. |
|
|
| | Model | Dev val HSS | Dev val F1 | Dev val IoU | Notes | |
| |-------|-------------|------------|-------------|-------| |
| | Original | 0.382 | 0.414 | 0.370 | Shipped checkpoint, original training run, not reproducible from this codebase | |
| | E2E repro #4 | 0.379 | 0.409 | 0.369 | Closest E2E, `repro_runs/e2e_repro4_hss379/` | |
| | Compiled repro (from submission codebase) | 0.376 | β | β | Best compiled repro from this codebase, `repro_runs/compiled_repro_hss376/` | |
| | E2E repro #3 | 0.375 | 0.404 | 0.367 | | |
| | Deterministic E2E | 0.372 | 0.398 | 0.368 | Bit-reproducible, `repro_runs/deterministic_hss372/` | |
| | E2E repro #5 | 0.349 | 0.373 | β | Compiled, low end of cluster | |
| | `reproduce.sh` smoketest | 0.342 | β | β | Single run of the published script end-to-end (validation-archive `runs/reproduce_smoketest/`) | |
|
|
| ### Partial reproductions (isolating pipeline stages) |
|
|
| | Test | Starting from | Dev val HSS | Gap to original | |
| |------|--------------|-------------|-----------------| |
| | Step 3 from orig Step 2 (run A) | Original step135000.pt | 0.382 | 0.000 | |
| | Step 3 from orig Step 2 (run B) | Original step135000.pt | 0.384 | +0.002 | |
| | Step 2+3 from orig Step 1 | Original step125000.pt | 0.377 | -0.005 | |
| | Step 1 from orig step 100k | Original step100000.pt | 0.285 (Step 1) | +0.004 vs 0.281 | |
|
|
| Step 3 from the same checkpoint reproduces to within 0.002 dev val HSS. The |
| full E2E dev val HSS variance (0.342-0.379, see All benchmarks below) is |
| dominated by torch.compile nondeterminism in Step 1. |
|
|
| ### All benchmarks |
|
|
| The HSS / F1 / IoU columns below are all on **dev val**. Public-test scores |
| appear in the Notes column where available. |
|
|
| | Model | Input | Dev val HSS | Dev val F1 | Dev val IoU | Notes | |
| |-------|-------|-------------|------------|-------------|-------| |
| | Handcrafted baseline | raw views | 0.307 | 0.404 | 0.260 | | |
| | h256+qk+ep (submitted) | 2048 | 0.365 | 0.388 | 0.360 | Public test HSS=0.4273 (commit f4487da) | |
| | Original 3-step | 2048 | 0.373 | 0.404 | 0.363 | | |
| | Original 3-step | 4096 | 0.382 | 0.414 | 0.370 | Best ever, original training. **Public test HSS=0.4470** (commit 4946666) | |
| | Step3 repro from orig S2 | 4096 | 0.384 | 0.414 | β | Near-exact repro from a saved Step 2 ckpt | |
| | E2E repro #4 | 4096 | 0.379 | 0.409 | 0.369 | | |
| | Compiled repro (submission codebase) | 4096 | 0.376 | β | β | Best compiled from this exact codebase | |
| | E2E repro #3 | 4096 | 0.375 | 0.404 | 0.367 | | |
| | Deterministic E2E | 4096 | 0.372 | 0.398 | 0.368 | Bit-reproducible across runs (different trajectory than compiled) | |
| | E2E repro #5 | 4096 | 0.349 | 0.373 | β | Compiled, low end of cluster | |
| | `reproduce.sh` smoketest | 4096 | 0.342 | β | β | Single E2E run of the published `reproduce.sh` | |
|
|
| The shipped 0.382 is by definition not reachable from this codebase (the |
| original training run is gone). Best compiled repro is **0.376**, mode of |
| compiled repros is around 0.375-0.379, and the lower tail extends to 0.342. |
|
|
| ## Code Equivalence Verification |
|
|
| | Test | Result | |
| |------|--------| |
| | Forward pass (same checkpoint, same input) | Bit-identical (0.00 diff) | |
| | Loss computation | Bit-identical (0.00 diff) | |
| | Gradient computation | 5e-8 max diff | |
| | Training from same seed | Bit-identical steps 1-44 | |
| | Step 3 from same checkpoint (2 runs) | Dev val HSS=0.382, 0.384 | |
| | Deterministic mode (2 runs) | Bit-identical (0.00 diff) | |
|
|
| ## Reproducibility Notes |
|
|
| All HSS numbers in this section are on **dev val**. |
|
|
| **Default mode** (`reproduce.sh`): Uses torch.compile (~3x faster). Each run |
| gets different Triton kernels, causing ~1e-8 floating-point divergence at a |
| random step (31-45). This grows through chaotic SGD dynamics. Documented |
| compiled E2E runs from this codebase have produced dev val HSS in |
| **0.342-0.379**; the modal cluster is 0.375-0.379, with low-side excursions |
| at 0.349 and 0.342 (no high-side outliers, so the distribution is one-sided |
| rather than symmetric around a mean). Best compiled repro is 0.376, vs the |
| shipped 0.382 from the original (lost) training run. |
|
|
| **Deterministic mode** (`--deterministic` flag): Disables torch.compile and |
| forces CUDA deterministic ops. Bit-identical across runs with the same seed |
| (verified across 3 independent runs). Dev val HSS=0.372. Note: deterministic |
| mode **diverges from compiled mode at step 1** because eager and compiled |
| forward passes use different floating-point reduction orders - it is a |
| different numerical trajectory entirely, not a reproduction of any compiled run. |
|
|
| **bad_samples.txt**: The shipped file has 156 entries to match original training. |
| (Note: `wc -l` reports 155 because the last line lacks a trailing newline.) |
| Two additional bad samples (`47b0e0ce19b`, `4b2d56eb3ef`) were discovered after |
| the original training run. They are legitimately bad (misaligned GT) but were |
| included in the original training data. Adding them changes the batch iteration |
| order and costs ~0.005 dev val HSS in deterministic mode (0.372 -> 0.367) and |
| ~0.04 in compiled mode (0.376 -> 0.335 in our `validate_155_compiled` vs |
| `validate_158_compiled` runs) due to compounded torch.compile variance. |
| Participants training from scratch may wish to add these 2 entries for cleaner |
| training data, but should expect slightly different scores due to the changed |
| iteration order. |
| |
| The shipped `checkpoint.pt` is from the original training run |
| (dev val HSS=0.382, public test HSS=0.4470). |
| |