Other
PyTorch
3d-reconstruction
wireframe
building
point-cloud
s23dr
cvpr-2026
File size: 12,450 Bytes
0f31e57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57b5d5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0f31e57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57b5d5f
0f31e57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57b5d5f
0f31e57
57b5d5f
0f31e57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57b5d5f
0f31e57
 
 
57b5d5f
 
 
 
 
 
 
 
0f31e57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57b5d5f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0f31e57
 
 
 
 
57b5d5f
 
 
 
 
0f31e57
 
 
 
57b5d5f
 
0f31e57
 
 
57b5d5f
 
0f31e57
 
 
57b5d5f
0f31e57
57b5d5f
 
 
0f31e57
 
 
57b5d5f
 
 
 
 
0f31e57
57b5d5f
0f31e57
57b5d5f
 
0f31e57
 
 
57b5d5f
 
 
 
 
 
 
0f31e57
 
 
 
 
 
 
 
 
57b5d5f
0f31e57
 
 
 
57b5d5f
 
0f31e57
 
57b5d5f
 
 
 
 
 
 
 
 
 
 
 
 
0f31e57
 
 
 
 
 
57b5d5f
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
# Reproducing the Best Checkpoint (HSS=0.382)

## Quick Start

The `checkpoint.pt` in this repo is the final model. To run inference:

```bash
python script.py
```

To reproduce from scratch (~3hr on 1x RTX 4090):

```bash
bash reproduce.sh
```

## Exact Recipe

Architecture (unchanged across all 3 steps):
```
Perceiver: hidden=256, ff=1024, latent_tokens=256, latent_layers=7
  encoder_layers=4, decoder_layers=3, cross_attn_interval=4
  num_heads=4, kv_heads_cross=2, kv_heads_self=2
  qk_norm=True (L2), rms_norm=True, dropout=0.1
  segments=64, segment_param=midpoint_dir_len, segment_conf=True
  behind_emb_dim=8, vote_features=True, activation=gelu
```

All shared config lives in `configs/base.json`.

## Evaluation sets

Three distinct evaluation sets show up in this work. Every HSS / F1 / IoU
number below is from one of these three; we try to tag each number with which.

- **Dev val** = the last 1024 scenes of the published training set
  (`hf://usm3d/s23dr-2026-sampled_*_v2:train`). This is what we actually
  optimized against during development, and it is the set behind every
  "HSS=0.382"-style number in this document, in `submitted_2048/README.md`,
  and in the run-history files under `repro_runs/` and the validation-archive.
- **Official validation** = `hf://usm3d/s23dr-2026-sampled_*_v2:validation`
  (equivalently the `*public*` tars in `usm3d/hoho22k_2026_test_x_anon`).
  We did *not* eval on this split during development. No HSS number in this
  repo refers to it.
- **Public test** = the `*private*` tars in `usm3d/hoho22k_2026_test_x_anon`,
  scored by the competition harness and posted to the leaderboard. We have
  two such numbers, both clearly labeled "public test" wherever they appear:
  **0.4273** (2048 submission, commit `f4487da`) and **0.4470** (4096
  submission, commit `4946666`).

Because we never validated against the official validation split, there is
some risk that the dev-val numbers are mildly overfit to the last-1024-train
slice. The +0.06 dev-val-to-public-test gap (consistent across both
submissions, see `submitted_2048/README.md`) is empirically positive, but it
is not a substitute for actually scoring on official val.

### Step 1: 2048 Phase 1 (from scratch) β€” ~1.5hr

```
Data:       hf://usm3d/s23dr-2026-sampled_2048_v2:train (16,508 samples)
Steps:      0 -> 125,000 (242 epochs)
LR:         3e-4, warmup=10,000
Batch size: 32
Optimizer:  AdamW, betas=(0.9, 0.95), weight_decay=0.01
Sinkhorn:   eps=0.1, iters=20, dustbin=0.3
Conf:       weight=0.1, mode=sinkhorn, head_wd=0.1
Endpoint:   OFF
Aug:        rotate=True, flip=True
Seed:       353
```

Trains the perceiver from random init on 2048-point samples. The sinkhorn
optimal transport loss learns to match predicted segments to ground truth.

**Why 2048 first:** Training directly on 4096 overfits (1.47x train/val ratio
vs 1.19x for 2048). The 2048 model learns better-generalized representations.

**Output:** dev val HSS ~0.28.

### Step 2: 4096 finetune (constant LR) β€” ~15min

```
Resume:     Step 1 -> step125000.pt
Data:       hf://usm3d/s23dr-2026-sampled_4096_v2:train (15,892 samples)
Steps:      125,001 -> 135,000 (10k steps)
LR:         3e-5 (constant, no cooldown)
Batch size: 64
Endpoint:   OFF
```

Switches input from 2048 to 4096 points, increasing structural coverage from
66% to 74%. The gentle lr (3e-5) preserves learned representations while
adapting to the extra input. Higher LR (>1e-4) causes catastrophic forgetting.

Dev val HSS jumps from 0.28 to 0.35 in ~5k steps. Plateaus by 10k steps.

**Output:** dev val HSS ~0.35.

### Step 3: Cooldown with endpoint loss β€” ~1hr

```
Resume:     Step 2 -> step135000.pt
Data:       hf://usm3d/s23dr-2026-sampled_4096_v2:train
Steps:      135,001 -> 170,000 (35k steps)
LR:         3e-5, cooldown_start=150,000, cooldown_steps=20,000
            (constant 3e-5 for 15k steps, then linear decay to ~0 over 20k)
Batch size: 64
Endpoint:   weight=0.1
```

Adds symmetric endpoint L1 loss (using detached sinkhorn assignment) to
tighten vertex precision. The sinkhorn loss alone operates on segment
midpoint/direction/length and doesn't directly penalize endpoint position error.

**Output:** dev val HSS=0.382, F1=0.414. Public test HSS=0.4470.

### Key Numbers

Per-stage **dev val** scores from the **original training run** (March 23-26),
which produced the shipped `checkpoint.pt`. Re-running `reproduce.sh` from
scratch does not hit these exact numbers - see "Reproduction Results" below for
the actual ranges. Compiled-mode re-runs of Step 3 land in dev val 0.342-0.379,
with the best run from this codebase at 0.376.

| Stage | Steps | Dev val HSS | Dev val F1 | What changed |
|-------|-------|-------------|------------|--------------|
| After Step 1 | 125k | 0.281 | 0.156 | Learned geometry from 2048 pts |
| After Step 2 | 135k | 0.351 | 0.190 | +74% coverage from 4096 pts |
| After Step 3 | 170k | **0.382** | **0.411** | Vertex precision from endpoint loss |

## Why This Works

1. **2048 training has low overfitting** (1.19x train/val ratio) β€” the model
   learns good representations without memorizing training samples.

2. **4096 data has higher coverage ceiling** (74% vs 66% structural points) β€”
   more of the building surface is observed, improving vertex recall.

3. **Gentle finetuning preserves representations** β€” at lr=3e-5, the model
   keeps its learned geometry understanding while adapting to the extra input.

4. **Endpoint loss tightens vertices** β€” the symmetric endpoint distance
   directly penalizes vertex position errors, which sinkhorn loss alone
   doesn't do (it operates on midpoint/direction/length parametrization).

## What Doesn't Work (yet)

These are informal observations from one-off experiments during development.
The runs, args, and eval logs are mostly [here](https://github.com/JackLangerman/s23dr_2026_example),
but not all of them are preserved perfectly. The specific numbers below come from contemaranious notes, 
but are not all trivially reproducable. Take them as directional guidance, not as benchmarks.

- **Training 4096 from scratch:** observed to overfit (~1.47x train/val loss
  gap, vs ~1.19x for 2048) and peak around dev val HSS 0.346 in a single run.
- **BuildingWorld pretraining:** in one experiment, the representations were
  near-orthogonal to S23DR (cosine sim ~0.05 between learned features) and
  did not transfer.
- **Mixed BW+S23DR training:** mixing BW data into the S23DR loader hurt
  dev val HSS in the runs we tried, presumed to be from domain gap.
- **High dropout / weight decay:** lowered the train/val gap but also lowered
  dev val HSS in the configurations we tried.
- **High finetune LR (>1e-4):** dropped dev val HSS sharply in Step 2 in
  single-run observations, consistent with disrupting the Step 1 representations.
- **Steeper cooldown (1e-5, 20x drop):** slightly worse than 3e-5 in the one
  comparison we ran for this checkpoint.

## Reproduction Results

### End-to-end reproductions

All HSS / F1 / IoU below are on **dev val**.

| Model | Dev val HSS | Dev val F1 | Dev val IoU | Notes |
|-------|-------------|------------|-------------|-------|
| Original | 0.382 | 0.414 | 0.370 | Shipped checkpoint, original training run, not reproducible from this codebase |
| E2E repro #4 | 0.379 | 0.409 | 0.369 | Closest E2E, `repro_runs/e2e_repro4_hss379/` |
| Compiled repro (from submission codebase) | 0.376 | β€” | β€” | Best compiled repro from this codebase, `repro_runs/compiled_repro_hss376/` |
| E2E repro #3 | 0.375 | 0.404 | 0.367 | |
| Deterministic E2E | 0.372 | 0.398 | 0.368 | Bit-reproducible, `repro_runs/deterministic_hss372/` |
| E2E repro #5 | 0.349 | 0.373 | β€” | Compiled, low end of cluster |
| `reproduce.sh` smoketest | 0.342 | β€” | β€” | Single run of the published script end-to-end (validation-archive `runs/reproduce_smoketest/`) |

### Partial reproductions (isolating pipeline stages)

| Test | Starting from | Dev val HSS | Gap to original |
|------|--------------|-------------|-----------------|
| Step 3 from orig Step 2 (run A) | Original step135000.pt | 0.382 | 0.000 |
| Step 3 from orig Step 2 (run B) | Original step135000.pt | 0.384 | +0.002 |
| Step 2+3 from orig Step 1 | Original step125000.pt | 0.377 | -0.005 |
| Step 1 from orig step 100k | Original step100000.pt | 0.285 (Step 1) | +0.004 vs 0.281 |

Step 3 from the same checkpoint reproduces to within 0.002 dev val HSS. The
full E2E dev val HSS variance (0.342-0.379, see All benchmarks below) is
dominated by torch.compile nondeterminism in Step 1.

### All benchmarks

The HSS / F1 / IoU columns below are all on **dev val**. Public-test scores
appear in the Notes column where available.

| Model | Input | Dev val HSS | Dev val F1 | Dev val IoU | Notes |
|-------|-------|-------------|------------|-------------|-------|
| Handcrafted baseline | raw views | 0.307 | 0.404 | 0.260 | |
| h256+qk+ep (submitted) | 2048 | 0.365 | 0.388 | 0.360 | Public test HSS=0.4273 (commit f4487da) |
| Original 3-step | 2048 | 0.373 | 0.404 | 0.363 | |
| Original 3-step | 4096 | 0.382 | 0.414 | 0.370 | Best ever, original training. **Public test HSS=0.4470** (commit 4946666) |
| Step3 repro from orig S2 | 4096 | 0.384 | 0.414 | β€” | Near-exact repro from a saved Step 2 ckpt |
| E2E repro #4 | 4096 | 0.379 | 0.409 | 0.369 | |
| Compiled repro (submission codebase) | 4096 | 0.376 | β€” | β€” | Best compiled from this exact codebase |
| E2E repro #3 | 4096 | 0.375 | 0.404 | 0.367 | |
| Deterministic E2E | 4096 | 0.372 | 0.398 | 0.368 | Bit-reproducible across runs (different trajectory than compiled) |
| E2E repro #5 | 4096 | 0.349 | 0.373 | β€” | Compiled, low end of cluster |
| `reproduce.sh` smoketest | 4096 | 0.342 | β€” | β€” | Single E2E run of the published `reproduce.sh` |

The shipped 0.382 is by definition not reachable from this codebase (the
original training run is gone). Best compiled repro is **0.376**, mode of
compiled repros is around 0.375-0.379, and the lower tail extends to 0.342.

## Code Equivalence Verification

| Test | Result |
|------|--------|
| Forward pass (same checkpoint, same input) | Bit-identical (0.00 diff) |
| Loss computation | Bit-identical (0.00 diff) |
| Gradient computation | 5e-8 max diff |
| Training from same seed | Bit-identical steps 1-44 |
| Step 3 from same checkpoint (2 runs) | Dev val HSS=0.382, 0.384 |
| Deterministic mode (2 runs) | Bit-identical (0.00 diff) |

## Reproducibility Notes

All HSS numbers in this section are on **dev val**.

**Default mode** (`reproduce.sh`): Uses torch.compile (~3x faster). Each run
gets different Triton kernels, causing ~1e-8 floating-point divergence at a
random step (31-45). This grows through chaotic SGD dynamics. Documented
compiled E2E runs from this codebase have produced dev val HSS in
**0.342-0.379**; the modal cluster is 0.375-0.379, with low-side excursions
at 0.349 and 0.342 (no high-side outliers, so the distribution is one-sided
rather than symmetric around a mean). Best compiled repro is 0.376, vs the
shipped 0.382 from the original (lost) training run.

**Deterministic mode** (`--deterministic` flag): Disables torch.compile and
forces CUDA deterministic ops. Bit-identical across runs with the same seed
(verified across 3 independent runs). Dev val HSS=0.372. Note: deterministic
mode **diverges from compiled mode at step 1** because eager and compiled
forward passes use different floating-point reduction orders - it is a
different numerical trajectory entirely, not a reproduction of any compiled run.

**bad_samples.txt**: The shipped file has 156 entries to match original training.
(Note: `wc -l` reports 155 because the last line lacks a trailing newline.)
Two additional bad samples (`47b0e0ce19b`, `4b2d56eb3ef`) were discovered after
the original training run. They are legitimately bad (misaligned GT) but were
included in the original training data. Adding them changes the batch iteration
order and costs ~0.005 dev val HSS in deterministic mode (0.372 -> 0.367) and
~0.04 in compiled mode (0.376 -> 0.335 in our `validate_155_compiled` vs
`validate_158_compiled` runs) due to compounded torch.compile variance.
Participants training from scratch may wish to add these 2 entries for cleaner
training data, but should expect slightly different scores due to the changed
iteration order.

The shipped `checkpoint.pt` is from the original training run
(dev val HSS=0.382, public test HSS=0.4470).