| --- |
| license: cc-by-nc-4.0 |
| library_name: pytorch |
| tags: |
| - 3d-reconstruction |
| - wireframe |
| - building |
| - point-cloud |
| - s23dr |
| - cvpr-2026 |
| datasets: |
| - usm3d/hoho22k_2026_trainval |
| - usm3d/s23dr-2026-sampled_4096_v2 |
| - usm3d/s23dr-2026-sampled_2048_v2 |
| metrics: |
| - HSS |
| pipeline_tag: other |
| --- |
| |
| # S23DR 2026 - Learned Submission |
|
|
| A learned-model baseline for the |
| [S23DR 2026](https://huggingface.co/spaces/usm3d/S23DR2026) wireframe-estimation |
| challenge. Counterpart to |
| [`usm3d/handcrafted_submission_2026`](https://huggingface.co/usm3d/handcrafted_submission_2026): |
| the handcrafted entry is rule-based; this one is a Perceiver transformer |
| trained on fused 3D point clouds. |
|
|
| ## Task |
|
|
| Per scene from |
| [`usm3d/hoho22k_2026_trainval`](https://huggingface.co/datasets/usm3d/hoho22k_2026_trainval): |
| multi-view RGB, ADE20K and Gestalt segmaps, MoGe metric depth, COLMAP SfM |
| points and camera poses (BPO + COLMAP). Predict a 3D building wireframe - |
| a list of vertices and the edges connecting them. Scored by [HSS](https://arxiv.org/abs/2503.08208) on a held-out |
| test split. |
|
|
| ## Run inference |
|
|
| ``` |
| python script.py |
| ``` |
|
|
| The challenge harness provides `params.json` (with the test-dataset id) and |
| runs `script.py`. The script downloads the dataset to `/tmp/data`, loads |
| `checkpoint.pt`, iterates over the validation + test splits, and writes |
| `submission.json` - one entry per sample of the form |
| `{order_id, wf_vertices, wf_edges}`. |
|
|
| Requires a CUDA GPU plus `torch`, `huggingface_hub`, `datasets`, `numpy`, |
| `opencv-python`, `scipy`, and `tqdm`. |
|
|
| ## Pipeline |
|
|
| ``` |
| raw multi-view sample |
| -> point fusion per-view depth unprojection + COLMAP, labeled |
| by ADE/Gestalt (point_fusion.py) |
| -> priority sampling subsample to 4096 pts: 3072 COLMAP + 1024 depth |
| (make_sampled_cache.py) |
| -> Perceiver hidden=256, 256 latents x 7 layers, 64 segment |
| queries; each query -> (midpoint, direction, |
| length) + confidence (model.py) |
| -> postprocess conf > 0.5 -> segments -> iterative vertex |
| merge -> snap to point cloud -> horizontal snap |
| (segment_postprocess.py, |
| postprocess_v2.py) |
| -> {order_id, wf_vertices, wf_edges} -> submission.json |
| ``` |
|
|
| ## Reproduce training |
|
|
| See [REPRODUCE.md](REPRODUCE.md) for the full recipe, ablations, and |
| reproduction notes. Three stages, ~3hr on a single RTX 4090. All HSS numbers |
| in this section are on **dev val** (the last 1024 scenes of the published |
| training set; see "Evaluation sets" in REPRODUCE.md). The per-stage column |
| below is from the **original training run**, not from re-running the |
| script - the original run is gone, and the published recipe does not exactly |
| reproduce it. |
|
|
| | Stage | Steps | LR | Batch | Dev val HSS (orig) | |
| |----------------------------------|------:|-------|------:|--------------------| |
| | 2048 from scratch | 125k | 3e-4 | 32 | 0.281 | |
| | 4096 finetune | 10k | 3e-5 | 64 | 0.351 | |
| | Cooldown + endpoint loss | 35k | 3e-5 | 64 | **0.382** | |
|
|
| ``` |
| bash reproduce.sh # ~3hr, torch.compile, dev val HSS typ. 0.375-0.379 |
| bash reproduce_deterministic.sh # ~5.5hr, no compile, bit-reproducible at 0.372 |
| bash make_datasets.sh # rebuild sampled datasets from raw tars |
| ``` |
|
|
| Documented compiled E2E runs from this codebase fall in **dev val HSS |
| 0.342-0.379**, with the modal cluster at 0.375-0.379 and low-side excursions |
| at 0.349 and 0.342; no run reaches the original 0.382. Best compiled repro is |
| **0.376** (`repro_runs/compiled_repro_hss376/`). The deterministic recipe is |
| bit-identical across runs at 0.372 but follows a different numerical |
| trajectory than compiled mode (it is not a reproduction of any compiled run). |
|
|
| ## Layout |
|
|
| ``` |
| script.py Submission entry: load checkpoint, run pipeline, |
| write submission.json. |
| checkpoint.pt Trained weights (~106 MB, Git LFS). Dev val |
| HSS=0.382 (original training run; not exactly |
| reproducible). Public test HSS=0.4470. |
| configs/base.json Shared training config (architecture + optimizer). |
| reproduce.sh 3-stage retraining (torch.compile, fast). |
| reproduce_deterministic.sh Bit-reproducible variant (no compile, ~2x slower). |
| make_datasets.sh Rebuild sampled datasets from raw hoho22k tars. |
| REPRODUCE.md Recipe, ablations, reproduction numbers. |
| |
| s23dr_2026_example/ The pipeline as an importable package. |
| point_fusion.py Per-view fusion -> labeled 3D point cloud. |
| cache_scenes.py Stream raw shards -> per-scene .pt files. |
| make_sampled_cache.py .pt -> fixed-size .npz priority samples. |
| data.py Dataset wrapper + augmentation. |
| tokenizer.py Per-point tokens (xyz + Fourier + class + src |
| + behind + vote features). |
| model.py Perceiver encoder + segment decoder. |
| attention.py SDPA blocks with optional QK-norm. |
| losses.py Sinkhorn matching + endpoint L1 + confidence BCE. |
| sinkhorn.py Optimal-transport segment matcher. |
| varifold.py Segments -> vertices/edges; varifold loss kernels. |
| wire_varifold_kernels.py Varifold kernels (alternate loss, unused at 0.382). |
| segment_postprocess.py Iterative vertex merging. |
| postprocess_v2.py Snap-to-cloud and horizontal-edge alignment. |
| color_mappings.py ADE20K + Gestalt label palettes. |
| train.py Training loop driven by reproduce.sh. |
| bad_samples.txt 156 scenes excluded from training (misaligned GT). |
| |
| repro_runs/ Three frozen reproduction runs: |
| compiled_repro_hss376/ best compiled run from this codebase |
| e2e_repro4_hss379/ closest end-to-end repro to original |
| deterministic_hss372/ bit-reproducible run |
| |
| submitted_2048/ The earlier public-leaderboard checkpoint |
| (2048-only, public test HSS=0.4273). The current |
| top-level checkpoint.pt scored 0.4470 on test. |
| ``` |
|
|
| ## Submissions and scores |
|
|
| Two submissions have been evaluated on the public test set. Both are kept in |
| this repo. "Dev val" below is the last 1024 scenes of the published training |
| set, which is what we optimized against; we did not eval on the official |
| validation split. See "Evaluation sets" in REPRODUCE.md for the formal |
| definitions. |
|
|
| | Model | Path | script.py settings | Dev val HSS | **Public test HSS** | |
| |---|---|---|---:|---:| |
| | 2048 (single-resolution, earlier) | `submitted_2048/checkpoint.pt` | `SEQ_LEN=2048`, `CONF_THRESH=0.7`, single-pass merge | 0.369 @ 2048 | **0.4273** | |
| | 4096 (3-stage transfer, current) | `checkpoint.pt` | `SEQ_LEN=4096`, `CONF_THRESH=0.5`, iterative merge | 0.382 @ 4096 | **0.4470** | |
|
|
| The 4096 model is the better submission on both dev val (+0.013) and public |
| test (+0.020). The leaderboard is open; the test numbers above are what each |
| checkpoint scored when submitted at its corresponding commit (`f4487da` for |
| 2048, `4946666` for 4096). Note that *three* things change between the two |
| submissions - the model, the inference `SEQ_LEN`, and the merge / confidence |
| postprocessing - so the +0.020 test gain is not attributable to the model |
| alone. |
|
|
| ## Notes |
|
|
| - The model predicts line *segments* in per-scene-normalized coordinates |
| (parametrized as midpoint + half-vector, decoded as endpoint pairs). The |
| script multiplies by the scene scale and adds the center to lift them to |
| world space before extracting vertex/edge lists; merging and snapping |
| postprocessing then run in world space. Output vertices are in the same |
| world frame as the COLMAP / ground-truth reconstruction. |
| - The pipeline returns a 2-vertex / 1-edge dummy wireframe for any sample |
| where fusion fails or no segment passes the confidence threshold. |
| - There is (unfortunately) a fair amount of run-to-run variation: both |
| *interrun* (same seed, different runs of `reproduce.sh` give dev val HSS |
| in 0.342-0.379 due to `torch.compile` picking different Triton kernels - |
| see REPRODUCE.md) and *interseed* (varying `--seed` gives non-trivially |
| different final dev val HSS as well). |
|
|
|
|
| ## Ideas / Suggestions / Thoughts |
|
|
| - There is a messy repo with many experiments and a codebase that has more features, but is messier and full of claudeisms [here](https://github.com/JackLangerman/s23dr_2026_example) |
| - The training protocol, initialization, and architecture could probably all be tweaked to improve stability / interseed variation. Increasing sinkhorn eps at the beginning of training, and longer warmup both seemed to yield less variation between seeds, but worse absolute performance; can we improve both? |
| - Can training on [Building World](https://huggingface.co/datasets/BuildingWorld/BuildingWorld) help? One attempt at "pretraining" run on BuildingWorld didn't help, but maybe you can make it work or perhaps mixing would be better? |
| - What about training on image (Gestalt/ADE) patches + rays instead of fused points? Can you do better with less inductive bias? Quick measurement to motivate this: the current priority-sampling-to-2048-points pipeline can only reach ~67% of GT vertices (within 0.5m), vs ~85% if the full COLMAP PCD were available - so the input pipeline is a hard ceiling on vertex recall, regardless of how good the model gets. |
| - What about mixed 2048 / 4096 training (instead of sequential)? does it help? |
| - What about other loss functions? a better softHSS? |
| - Does a bigger model with more data (eg from building world) help? what about with better / different regularization or data augmentation? |
| - What about other modern DETR tricks? |
| - Can you output a graph directly instead of post-hoc vertex merging? |
| - bad_samples.txt doesn't include all the bad samples, but in quick tests, adding the two bad samples shown in REPRODUCE.md made scores worse. Can you make the score better while excluding these? |
| |
| ### I'm sure there is a ton more cool stuff to do! Be creative! Excited to see what you come up with! |