File size: 12,450 Bytes
0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f 0f31e57 57b5d5f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | # Reproducing the Best Checkpoint (HSS=0.382)
## Quick Start
The `checkpoint.pt` in this repo is the final model. To run inference:
```bash
python script.py
```
To reproduce from scratch (~3hr on 1x RTX 4090):
```bash
bash reproduce.sh
```
## Exact Recipe
Architecture (unchanged across all 3 steps):
```
Perceiver: hidden=256, ff=1024, latent_tokens=256, latent_layers=7
encoder_layers=4, decoder_layers=3, cross_attn_interval=4
num_heads=4, kv_heads_cross=2, kv_heads_self=2
qk_norm=True (L2), rms_norm=True, dropout=0.1
segments=64, segment_param=midpoint_dir_len, segment_conf=True
behind_emb_dim=8, vote_features=True, activation=gelu
```
All shared config lives in `configs/base.json`.
## Evaluation sets
Three distinct evaluation sets show up in this work. Every HSS / F1 / IoU
number below is from one of these three; we try to tag each number with which.
- **Dev val** = the last 1024 scenes of the published training set
(`hf://usm3d/s23dr-2026-sampled_*_v2:train`). This is what we actually
optimized against during development, and it is the set behind every
"HSS=0.382"-style number in this document, in `submitted_2048/README.md`,
and in the run-history files under `repro_runs/` and the validation-archive.
- **Official validation** = `hf://usm3d/s23dr-2026-sampled_*_v2:validation`
(equivalently the `*public*` tars in `usm3d/hoho22k_2026_test_x_anon`).
We did *not* eval on this split during development. No HSS number in this
repo refers to it.
- **Public test** = the `*private*` tars in `usm3d/hoho22k_2026_test_x_anon`,
scored by the competition harness and posted to the leaderboard. We have
two such numbers, both clearly labeled "public test" wherever they appear:
**0.4273** (2048 submission, commit `f4487da`) and **0.4470** (4096
submission, commit `4946666`).
Because we never validated against the official validation split, there is
some risk that the dev-val numbers are mildly overfit to the last-1024-train
slice. The +0.06 dev-val-to-public-test gap (consistent across both
submissions, see `submitted_2048/README.md`) is empirically positive, but it
is not a substitute for actually scoring on official val.
### Step 1: 2048 Phase 1 (from scratch) β ~1.5hr
```
Data: hf://usm3d/s23dr-2026-sampled_2048_v2:train (16,508 samples)
Steps: 0 -> 125,000 (242 epochs)
LR: 3e-4, warmup=10,000
Batch size: 32
Optimizer: AdamW, betas=(0.9, 0.95), weight_decay=0.01
Sinkhorn: eps=0.1, iters=20, dustbin=0.3
Conf: weight=0.1, mode=sinkhorn, head_wd=0.1
Endpoint: OFF
Aug: rotate=True, flip=True
Seed: 353
```
Trains the perceiver from random init on 2048-point samples. The sinkhorn
optimal transport loss learns to match predicted segments to ground truth.
**Why 2048 first:** Training directly on 4096 overfits (1.47x train/val ratio
vs 1.19x for 2048). The 2048 model learns better-generalized representations.
**Output:** dev val HSS ~0.28.
### Step 2: 4096 finetune (constant LR) β ~15min
```
Resume: Step 1 -> step125000.pt
Data: hf://usm3d/s23dr-2026-sampled_4096_v2:train (15,892 samples)
Steps: 125,001 -> 135,000 (10k steps)
LR: 3e-5 (constant, no cooldown)
Batch size: 64
Endpoint: OFF
```
Switches input from 2048 to 4096 points, increasing structural coverage from
66% to 74%. The gentle lr (3e-5) preserves learned representations while
adapting to the extra input. Higher LR (>1e-4) causes catastrophic forgetting.
Dev val HSS jumps from 0.28 to 0.35 in ~5k steps. Plateaus by 10k steps.
**Output:** dev val HSS ~0.35.
### Step 3: Cooldown with endpoint loss β ~1hr
```
Resume: Step 2 -> step135000.pt
Data: hf://usm3d/s23dr-2026-sampled_4096_v2:train
Steps: 135,001 -> 170,000 (35k steps)
LR: 3e-5, cooldown_start=150,000, cooldown_steps=20,000
(constant 3e-5 for 15k steps, then linear decay to ~0 over 20k)
Batch size: 64
Endpoint: weight=0.1
```
Adds symmetric endpoint L1 loss (using detached sinkhorn assignment) to
tighten vertex precision. The sinkhorn loss alone operates on segment
midpoint/direction/length and doesn't directly penalize endpoint position error.
**Output:** dev val HSS=0.382, F1=0.414. Public test HSS=0.4470.
### Key Numbers
Per-stage **dev val** scores from the **original training run** (March 23-26),
which produced the shipped `checkpoint.pt`. Re-running `reproduce.sh` from
scratch does not hit these exact numbers - see "Reproduction Results" below for
the actual ranges. Compiled-mode re-runs of Step 3 land in dev val 0.342-0.379,
with the best run from this codebase at 0.376.
| Stage | Steps | Dev val HSS | Dev val F1 | What changed |
|-------|-------|-------------|------------|--------------|
| After Step 1 | 125k | 0.281 | 0.156 | Learned geometry from 2048 pts |
| After Step 2 | 135k | 0.351 | 0.190 | +74% coverage from 4096 pts |
| After Step 3 | 170k | **0.382** | **0.411** | Vertex precision from endpoint loss |
## Why This Works
1. **2048 training has low overfitting** (1.19x train/val ratio) β the model
learns good representations without memorizing training samples.
2. **4096 data has higher coverage ceiling** (74% vs 66% structural points) β
more of the building surface is observed, improving vertex recall.
3. **Gentle finetuning preserves representations** β at lr=3e-5, the model
keeps its learned geometry understanding while adapting to the extra input.
4. **Endpoint loss tightens vertices** β the symmetric endpoint distance
directly penalizes vertex position errors, which sinkhorn loss alone
doesn't do (it operates on midpoint/direction/length parametrization).
## What Doesn't Work (yet)
These are informal observations from one-off experiments during development.
The runs, args, and eval logs are mostly [here](https://github.com/JackLangerman/s23dr_2026_example),
but not all of them are preserved perfectly. The specific numbers below come from contemaranious notes,
but are not all trivially reproducable. Take them as directional guidance, not as benchmarks.
- **Training 4096 from scratch:** observed to overfit (~1.47x train/val loss
gap, vs ~1.19x for 2048) and peak around dev val HSS 0.346 in a single run.
- **BuildingWorld pretraining:** in one experiment, the representations were
near-orthogonal to S23DR (cosine sim ~0.05 between learned features) and
did not transfer.
- **Mixed BW+S23DR training:** mixing BW data into the S23DR loader hurt
dev val HSS in the runs we tried, presumed to be from domain gap.
- **High dropout / weight decay:** lowered the train/val gap but also lowered
dev val HSS in the configurations we tried.
- **High finetune LR (>1e-4):** dropped dev val HSS sharply in Step 2 in
single-run observations, consistent with disrupting the Step 1 representations.
- **Steeper cooldown (1e-5, 20x drop):** slightly worse than 3e-5 in the one
comparison we ran for this checkpoint.
## Reproduction Results
### End-to-end reproductions
All HSS / F1 / IoU below are on **dev val**.
| Model | Dev val HSS | Dev val F1 | Dev val IoU | Notes |
|-------|-------------|------------|-------------|-------|
| Original | 0.382 | 0.414 | 0.370 | Shipped checkpoint, original training run, not reproducible from this codebase |
| E2E repro #4 | 0.379 | 0.409 | 0.369 | Closest E2E, `repro_runs/e2e_repro4_hss379/` |
| Compiled repro (from submission codebase) | 0.376 | β | β | Best compiled repro from this codebase, `repro_runs/compiled_repro_hss376/` |
| E2E repro #3 | 0.375 | 0.404 | 0.367 | |
| Deterministic E2E | 0.372 | 0.398 | 0.368 | Bit-reproducible, `repro_runs/deterministic_hss372/` |
| E2E repro #5 | 0.349 | 0.373 | β | Compiled, low end of cluster |
| `reproduce.sh` smoketest | 0.342 | β | β | Single run of the published script end-to-end (validation-archive `runs/reproduce_smoketest/`) |
### Partial reproductions (isolating pipeline stages)
| Test | Starting from | Dev val HSS | Gap to original |
|------|--------------|-------------|-----------------|
| Step 3 from orig Step 2 (run A) | Original step135000.pt | 0.382 | 0.000 |
| Step 3 from orig Step 2 (run B) | Original step135000.pt | 0.384 | +0.002 |
| Step 2+3 from orig Step 1 | Original step125000.pt | 0.377 | -0.005 |
| Step 1 from orig step 100k | Original step100000.pt | 0.285 (Step 1) | +0.004 vs 0.281 |
Step 3 from the same checkpoint reproduces to within 0.002 dev val HSS. The
full E2E dev val HSS variance (0.342-0.379, see All benchmarks below) is
dominated by torch.compile nondeterminism in Step 1.
### All benchmarks
The HSS / F1 / IoU columns below are all on **dev val**. Public-test scores
appear in the Notes column where available.
| Model | Input | Dev val HSS | Dev val F1 | Dev val IoU | Notes |
|-------|-------|-------------|------------|-------------|-------|
| Handcrafted baseline | raw views | 0.307 | 0.404 | 0.260 | |
| h256+qk+ep (submitted) | 2048 | 0.365 | 0.388 | 0.360 | Public test HSS=0.4273 (commit f4487da) |
| Original 3-step | 2048 | 0.373 | 0.404 | 0.363 | |
| Original 3-step | 4096 | 0.382 | 0.414 | 0.370 | Best ever, original training. **Public test HSS=0.4470** (commit 4946666) |
| Step3 repro from orig S2 | 4096 | 0.384 | 0.414 | β | Near-exact repro from a saved Step 2 ckpt |
| E2E repro #4 | 4096 | 0.379 | 0.409 | 0.369 | |
| Compiled repro (submission codebase) | 4096 | 0.376 | β | β | Best compiled from this exact codebase |
| E2E repro #3 | 4096 | 0.375 | 0.404 | 0.367 | |
| Deterministic E2E | 4096 | 0.372 | 0.398 | 0.368 | Bit-reproducible across runs (different trajectory than compiled) |
| E2E repro #5 | 4096 | 0.349 | 0.373 | β | Compiled, low end of cluster |
| `reproduce.sh` smoketest | 4096 | 0.342 | β | β | Single E2E run of the published `reproduce.sh` |
The shipped 0.382 is by definition not reachable from this codebase (the
original training run is gone). Best compiled repro is **0.376**, mode of
compiled repros is around 0.375-0.379, and the lower tail extends to 0.342.
## Code Equivalence Verification
| Test | Result |
|------|--------|
| Forward pass (same checkpoint, same input) | Bit-identical (0.00 diff) |
| Loss computation | Bit-identical (0.00 diff) |
| Gradient computation | 5e-8 max diff |
| Training from same seed | Bit-identical steps 1-44 |
| Step 3 from same checkpoint (2 runs) | Dev val HSS=0.382, 0.384 |
| Deterministic mode (2 runs) | Bit-identical (0.00 diff) |
## Reproducibility Notes
All HSS numbers in this section are on **dev val**.
**Default mode** (`reproduce.sh`): Uses torch.compile (~3x faster). Each run
gets different Triton kernels, causing ~1e-8 floating-point divergence at a
random step (31-45). This grows through chaotic SGD dynamics. Documented
compiled E2E runs from this codebase have produced dev val HSS in
**0.342-0.379**; the modal cluster is 0.375-0.379, with low-side excursions
at 0.349 and 0.342 (no high-side outliers, so the distribution is one-sided
rather than symmetric around a mean). Best compiled repro is 0.376, vs the
shipped 0.382 from the original (lost) training run.
**Deterministic mode** (`--deterministic` flag): Disables torch.compile and
forces CUDA deterministic ops. Bit-identical across runs with the same seed
(verified across 3 independent runs). Dev val HSS=0.372. Note: deterministic
mode **diverges from compiled mode at step 1** because eager and compiled
forward passes use different floating-point reduction orders - it is a
different numerical trajectory entirely, not a reproduction of any compiled run.
**bad_samples.txt**: The shipped file has 156 entries to match original training.
(Note: `wc -l` reports 155 because the last line lacks a trailing newline.)
Two additional bad samples (`47b0e0ce19b`, `4b2d56eb3ef`) were discovered after
the original training run. They are legitimately bad (misaligned GT) but were
included in the original training data. Adding them changes the batch iteration
order and costs ~0.005 dev val HSS in deterministic mode (0.372 -> 0.367) and
~0.04 in compiled mode (0.376 -> 0.335 in our `validate_155_compiled` vs
`validate_158_compiled` runs) due to compounded torch.compile variance.
Participants training from scratch may wish to add these 2 entries for cleaner
training data, but should expect slightly different scores due to the changed
iteration order.
The shipped `checkpoint.pt` is from the original training run
(dev val HSS=0.382, public test HSS=0.4470).
|