S23DR 2026 Learned Baseline
A learned baseline for the S23DR 2026 challenge (Structured and Semantic 3D Reconstruction, or S^2 3DR), part of the USM3D workshop at CVPR 2026. The model takes a fused point cloud of a building and predicts its wireframe as a set of 3D line segments.
Headline result: HSS = 0.382 on the 1024-sample validation set (shipped checkpoint).
For context, the handcrafted baseline scores HSS = 0.307 on the same split.
Quick start
Run the submission pipeline directly (matches the competition eval harness):
python script.py
That loads checkpoint.pt, fuses the input views into a 4096-point cloud, runs the model, and writes the predicted wireframe for each scene.
To reproduce the checkpoint from scratch on a single RTX 4090 (~3 hours):
bash reproduce.sh
Or for a bit-identical deterministic run (~5.5 hours, slower because it disables torch.compile):
bash reproduce_deterministic.sh
Both scripts run the full three-stage recipe described below. See REPRODUCE.md for the exact hyperparameters and the reproducibility notes.
Architecture
A Perceiver-style transformer that ingests the point cloud as a sequence of per-point tokens and decodes a fixed set of 3D line segments through cross-attention into a latent.
Perceiver: hidden=256, ff=1024
latent_tokens=256, latent_layers=7
encoder_layers=4, decoder_layers=3, cross_attn_interval=4
num_heads=4, kv_heads_cross=2, kv_heads_self=2
qk_norm=L2, rms_norm=True, dropout=0.1
segments=64, segment_param=midpoint_dir_len, segment_conf=True
behind_emb_dim=8, vote_features=True, activation=gelu
The decoder predicts 64 candidate segments, each parametrized as midpoint + direction + length with a confidence head. Training uses a Sinkhorn optimal-transport loss to match predicted segments to ground-truth, plus a symmetric endpoint L1 term in the cooldown stage.
All architecture and optimizer settings live in configs/base.json.
Training recipe
The model ships with a three-stage recipe. Each stage starts from the previous stage's final checkpoint.
| Stage | Input | Steps | LR | Batch | Notes | HSS |
|---|---|---|---|---|---|---|
| 1. 2048 from scratch | 2048 pts | 0 -> 125k | 3e-4, warmup 10k | 32 | Random init, sinkhorn only | 0.281 |
| 2. 4096 finetune | 4096 pts | 125k -> 135k | 3e-5 constant | 64 | Gentle LR preserves representations | 0.351 |
| 3. Endpoint cooldown | 4096 pts | 135k -> 170k | 3e-5 then linear decay | 64 | Adds endpoint L1 loss, tightens vertices | 0.382 |
Why 2048 first: training directly on 4096 overfits (1.47x train/val ratio vs 1.19x for 2048). Starting on 2048 produces better-generalized representations that the 4096 finetune can then specialize.
Why a gentle LR on finetune: LR > 1e-4 causes catastrophic forgetting of the 2048 geometry understanding.
Why endpoint loss only in stage 3: the Sinkhorn loss operates on the midpoint/direction/length parametrization and doesn't directly penalize vertex position error. Adding a symmetric endpoint L1 against the detached Sinkhorn assignment tightens vertex precision in the cooldown.
Full details, including the "what does not work" list (BuildingWorld pretraining, mixed training, high dropout, etc.), are in REPRODUCE.md.
Evaluation
About the numbers: all val scores below are HSS at confidence threshold 0.7, averaged over the 1024-sample internal validation split we hold out from the published training data (
usm3d/s23dr-2026-sampled_{2048,4096}_v2:validation). They are not test-set numbers. The only test-set number we have is the public leaderboard score of the older 2048 submission (seesubmitted_2048/and the last table below).All numbers below are freshly measured in this release against the checkpoints in this repo.
Shipped model and reproductions
| Model | Checkpoint | HSS @ 4096 | HSS @ 2048 |
|---|---|---|---|
| Handcrafted baseline | — | 0.307 | — |
| Current release (shipped) | checkpoint.pt |
0.3819 | 0.3734 |
| Closest compiled E2E repro (#4) | repro_runs/e2e_repro4_hss379/ |
0.3736 | 0.3675 |
| Best compiled repro from this codebase | repro_runs/compiled_repro_hss376/ |
0.3757 | 0.3670 |
| Deterministic E2E repro (bit-reproducible) | repro_runs/deterministic_hss372/ |
0.3716 | 0.3665 |
All repros use the exact 3-stage recipe on a single RTX 4090. The shipped checkpoint.pt was trained on the same recipe before this release branch was cut; the ~0.005-0.010 HSS gap between shipped and repros is compiled-mode run-to-run variance (see the Reproducibility section).
Training progression (deterministic repro, all stages measured fresh)
| Stage | Steps | HSS @ 4096 | HSS @ 2048 |
|---|---|---|---|
| 1. 2048 from-scratch | 125k | 0.2755 | 0.2812 |
| 2. 4096 finetune | 135k | 0.3557 | 0.3510 |
| 3. Endpoint cooldown | 170k | 0.3716 | 0.3665 |
The stage 1 -> stage 2 jump (+0.08 HSS on 4096) is the biggest single improvement and motivates the 2048 -> 4096 transfer. Stage 3 (endpoint cooldown) adds another +0.016. Note how stage 1 is slightly better at 2048 than at 4096 (because it was only trained on 2048), while stages 2 and 3 invert that ordering after being finetuned on 4096.
Previously submitted model (2048, single-stage)
The submitted_2048/ directory holds the checkpoint we actually sent to the public leaderboard. It was trained in a single stage on 2048-point data and is a direct ancestor of the current release.
| Split | Metric | Score |
|---|---|---|
| Public leaderboard (test) | HSS | 0.427 |
| Internal val @ 2048 | HSS | 0.3692 |
| Internal val @ 4096 | HSS | 0.3665 |
We do not have a test number for the current release, but the val-to-test gap observed on this 2048 submission was about +0.06 HSS (0.37 val -> 0.43 test). A similar gap on the current checkpoint.pt (0.382 val) would suggest a test score in the low 0.44s, though this is extrapolation and unverified.
Reproducibility
| Test | Result |
|---|---|
| Forward pass (same ckpt, same input) | bit-identical (0.00 diff) |
| Deterministic mode, 3 independent runs | bit-identical (162 tensors, max_diff=0.0) |
| Step 3 from same stage-2 ckpt (2 runs) | HSS=0.382, 0.384 |
| Compiled-mode E2E variance across runs | ~0.03 HSS (Triton kernel nondeterminism) |
reproduce_deterministic.sh produces byte-identical weights across runs with the same seed, at the cost of ~2x slower training (no torch.compile). Compiled mode has small run-to-run variance from Triton kernel selection that grows through chaotic SGD dynamics; E2E compiled repros land in the 0.349-0.379 range.
A subtle iteration-order effect: the shipped bad_samples.txt has 156 non-empty entries (the file lacks a trailing newline so wc -l reports 155). Two additional bad samples were discovered after training - they are legitimately bad GT but adding them changes the batch iteration order and costs ~0.005 HSS in deterministic mode and ~0.04 in compiled mode. See the "Reproducibility Notes" section of REPRODUCE.md for the full story.
Repository layout
checkpoint.pt shipped HSS=0.382 model (step 170000), 4096-point input
script.py competition inference entry point (uses checkpoint.pt)
s23dr_2026_example/ training package (model, data, train loop, losses)
configs/base.json shared training config
reproduce.sh compiled-mode E2E reproduction (~3 hr)
reproduce_deterministic.sh bit-reproducible E2E reproduction (~5.5 hr)
REPRODUCE.md detailed recipe, results, ablations, notes
submitted_2048/ the model we actually sent to the public leaderboard (HSS_test=0.427)
checkpoint.pt single-stage 2048 model (step 160000)
args.json full training args
README.md training details and val/test scores
repro_runs/ evidence that the 3-stage recipe reproduces
e2e_repro4_hss379/ closest compiled E2E repro (val HSS=0.374)
compiled_repro_hss376/ best compiled repro from this codebase (val HSS=0.376)
deterministic_hss372/ bit-reproducible deterministic repro (val HSS=0.372)
Each directory under repro_runs/ contains the three stage-final checkpoints (125k / 135k / 170k) plus their args.json, so a participant can resume from any stage. Note the directory names carry the score at the time the directory was created, which may differ by ~0.002 from fresh evals in the table above due to random variation in post-processing and CUDA kernel selection.
Related branches
main- this releasebest-4096-transfer- working branch with full commit history and internal dev notesvalidation-archive- cold archive of all validation runs (logs, final checkpoints, args) used to verify the release
License
CC-BY-NC 4.0. The model weights and code in this repository are released under the Creative Commons Attribution-NonCommercial 4.0 International license. You are free to use, share, and adapt this work for non-commercial purposes, provided you give appropriate attribution. The training and validation datasets (usm3d/s23dr-2026-sampled_*) have their own terms - see the S23DR 2026 competition page for details.
Acknowledgements
This checkpoint is released as a public learned baseline for participants of the S23DR 2026 challenge, part of the USM3D workshop at CVPR 2026.