S23DR 2026 Learned Baseline

A learned baseline for the S23DR 2026 challenge (Structured and Semantic 3D Reconstruction, or S^2 3DR), part of the USM3D workshop at CVPR 2026. The model takes a fused point cloud of a building and predicts its wireframe as a set of 3D line segments.

Headline result: HSS = 0.382 on the 1024-sample validation set (shipped checkpoint).

For context, the handcrafted baseline scores HSS = 0.307 on the same split.

Quick start

Run the submission pipeline directly (matches the competition eval harness):

python script.py

That loads checkpoint.pt, fuses the input views into a 4096-point cloud, runs the model, and writes the predicted wireframe for each scene.

To reproduce the checkpoint from scratch on a single RTX 4090 (~3 hours):

bash reproduce.sh

Or for a bit-identical deterministic run (~5.5 hours, slower because it disables torch.compile):

bash reproduce_deterministic.sh

Both scripts run the full three-stage recipe described below. See REPRODUCE.md for the exact hyperparameters and the reproducibility notes.

Architecture

A Perceiver-style transformer that ingests the point cloud as a sequence of per-point tokens and decodes a fixed set of 3D line segments through cross-attention into a latent.

Perceiver: hidden=256, ff=1024
  latent_tokens=256, latent_layers=7
  encoder_layers=4, decoder_layers=3, cross_attn_interval=4
  num_heads=4, kv_heads_cross=2, kv_heads_self=2
  qk_norm=L2, rms_norm=True, dropout=0.1
  segments=64, segment_param=midpoint_dir_len, segment_conf=True
  behind_emb_dim=8, vote_features=True, activation=gelu

The decoder predicts 64 candidate segments, each parametrized as midpoint + direction + length with a confidence head. Training uses a Sinkhorn optimal-transport loss to match predicted segments to ground-truth, plus a symmetric endpoint L1 term in the cooldown stage.

All architecture and optimizer settings live in configs/base.json.

Training recipe

The model ships with a three-stage recipe. Each stage starts from the previous stage's final checkpoint.

Stage	Input	Steps	LR	Batch	Notes	HSS
1. 2048 from scratch	2048 pts	0 -> 125k	3e-4, warmup 10k	32	Random init, sinkhorn only	0.281
2. 4096 finetune	4096 pts	125k -> 135k	3e-5 constant	64	Gentle LR preserves representations	0.351
3. Endpoint cooldown	4096 pts	135k -> 170k	3e-5 then linear decay	64	Adds endpoint L1 loss, tightens vertices	0.382

Why 2048 first: training directly on 4096 overfits (1.47x train/val ratio vs 1.19x for 2048). Starting on 2048 produces better-generalized representations that the 4096 finetune can then specialize.

Why a gentle LR on finetune: LR > 1e-4 causes catastrophic forgetting of the 2048 geometry understanding.

Why endpoint loss only in stage 3: the Sinkhorn loss operates on the midpoint/direction/length parametrization and doesn't directly penalize vertex position error. Adding a symmetric endpoint L1 against the detached Sinkhorn assignment tightens vertex precision in the cooldown.

Full details, including the "what does not work" list (BuildingWorld pretraining, mixed training, high dropout, etc.), are in REPRODUCE.md.

Evaluation

About the numbers: all val scores below are HSS at confidence threshold 0.7, averaged over the 1024-sample internal validation split we hold out from the published training data (usm3d/s23dr-2026-sampled_{2048,4096}_v2:validation). They are not test-set numbers. The only test-set number we have is the public leaderboard score of the older 2048 submission (see submitted_2048/ and the last table below).

All numbers below are freshly measured in this release against the checkpoints in this repo.

Shipped model and reproductions

Model	Checkpoint	HSS @ 4096	HSS @ 2048
Handcrafted baseline	—	0.307	—
Current release (shipped)	`checkpoint.pt`	0.3819	0.3734
Closest compiled E2E repro (#4)	`repro_runs/e2e_repro4_hss379/`	0.3736	0.3675
Best compiled repro from this codebase	`repro_runs/compiled_repro_hss376/`	0.3757	0.3670
Deterministic E2E repro (bit-reproducible)	`repro_runs/deterministic_hss372/`	0.3716	0.3665

All repros use the exact 3-stage recipe on a single RTX 4090. The shipped checkpoint.pt was trained on the same recipe before this release branch was cut; the ~0.005-0.010 HSS gap between shipped and repros is compiled-mode run-to-run variance (see the Reproducibility section).

Training progression (deterministic repro, all stages measured fresh)

Stage	Steps	HSS @ 4096	HSS @ 2048
1. 2048 from-scratch	125k	0.2755	0.2812
2. 4096 finetune	135k	0.3557	0.3510
3. Endpoint cooldown	170k	0.3716	0.3665

The stage 1 -> stage 2 jump (+0.08 HSS on 4096) is the biggest single improvement and motivates the 2048 -> 4096 transfer. Stage 3 (endpoint cooldown) adds another +0.016. Note how stage 1 is slightly better at 2048 than at 4096 (because it was only trained on 2048), while stages 2 and 3 invert that ordering after being finetuned on 4096.

Previously submitted model (2048, single-stage)

The submitted_2048/ directory holds the checkpoint we actually sent to the public leaderboard. It was trained in a single stage on 2048-point data and is a direct ancestor of the current release.

Split	Metric	Score
Public leaderboard (test)	HSS	0.427
Internal val @ 2048	HSS	0.3692
Internal val @ 4096	HSS	0.3665

We do not have a test number for the current release, but the val-to-test gap observed on this 2048 submission was about +0.06 HSS (0.37 val -> 0.43 test). A similar gap on the current checkpoint.pt (0.382 val) would suggest a test score in the low 0.44s, though this is extrapolation and unverified.

Reproducibility

Test	Result
Forward pass (same ckpt, same input)	bit-identical (0.00 diff)
Deterministic mode, 3 independent runs	bit-identical (162 tensors, max_diff=0.0)
Step 3 from same stage-2 ckpt (2 runs)	HSS=0.382, 0.384
Compiled-mode E2E variance across runs	~0.03 HSS (Triton kernel nondeterminism)

reproduce_deterministic.sh produces byte-identical weights across runs with the same seed, at the cost of ~2x slower training (no torch.compile). Compiled mode has small run-to-run variance from Triton kernel selection that grows through chaotic SGD dynamics; E2E compiled repros land in the 0.349-0.379 range.

A subtle iteration-order effect: the shipped bad_samples.txt has 156 non-empty entries (the file lacks a trailing newline so wc -l reports 155). Two additional bad samples were discovered after training - they are legitimately bad GT but adding them changes the batch iteration order and costs ~0.005 HSS in deterministic mode and ~0.04 in compiled mode. See the "Reproducibility Notes" section of REPRODUCE.md for the full story.

Repository layout

checkpoint.pt                    shipped HSS=0.382 model (step 170000), 4096-point input
script.py                        competition inference entry point (uses checkpoint.pt)
s23dr_2026_example/              training package (model, data, train loop, losses)
configs/base.json                shared training config
reproduce.sh                     compiled-mode E2E reproduction (~3 hr)
reproduce_deterministic.sh       bit-reproducible E2E reproduction (~5.5 hr)
REPRODUCE.md                     detailed recipe, results, ablations, notes

submitted_2048/                  the model we actually sent to the public leaderboard (HSS_test=0.427)
    checkpoint.pt                single-stage 2048 model (step 160000)
    args.json                    full training args
    README.md                    training details and val/test scores

repro_runs/                      evidence that the 3-stage recipe reproduces
    e2e_repro4_hss379/           closest compiled E2E repro (val HSS=0.374)
    compiled_repro_hss376/       best compiled repro from this codebase (val HSS=0.376)
    deterministic_hss372/        bit-reproducible deterministic repro (val HSS=0.372)

Each directory under repro_runs/ contains the three stage-final checkpoints (125k / 135k / 170k) plus their args.json, so a participant can resume from any stage. Note the directory names carry the score at the time the directory was created, which may differ by ~0.002 from fresh evals in the table above due to random variation in post-processing and CUDA kernel selection.

Related branches

main - this release
best-4096-transfer - working branch with full commit history and internal dev notes
validation-archive - cold archive of all validation runs (logs, final checkpoints, args) used to verify the release

License

CC-BY-NC 4.0. The model weights and code in this repository are released under the Creative Commons Attribution-NonCommercial 4.0 International license. You are free to use, share, and adapt this work for non-commercial purposes, provided you give appropriate attribution. The training and validation datasets (usm3d/s23dr-2026-sampled_*) have their own terms - see the S23DR 2026 competition page for details.

Acknowledgements

This checkpoint is released as a public learned baseline for participants of the S23DR 2026 challenge, part of the USM3D workshop at CVPR 2026.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

jacklangerman
/

s23dr-2026-submission