update docs

57b5d5f 11 days ago

12.5 kB

	# Reproducing the Best Checkpoint (HSS=0.382)

	## Quick Start

	The `checkpoint.pt` in this repo is the final model. To run inference:

	```bash
	python script.py
	```

	To reproduce from scratch (~3hr on 1x RTX 4090):

	```bash
	bash reproduce.sh
	```

	## Exact Recipe

	Architecture (unchanged across all 3 steps):
	```
	Perceiver: hidden=256, ff=1024, latent_tokens=256, latent_layers=7
	encoder_layers=4, decoder_layers=3, cross_attn_interval=4
	num_heads=4, kv_heads_cross=2, kv_heads_self=2
	qk_norm=True (L2), rms_norm=True, dropout=0.1
	segments=64, segment_param=midpoint_dir_len, segment_conf=True
	behind_emb_dim=8, vote_features=True, activation=gelu
	```

	All shared config lives in `configs/base.json`.

	## Evaluation sets

	Three distinct evaluation sets show up in this work. Every HSS / F1 / IoU
	number below is from one of these three; we try to tag each number with which.

	- Dev val = the last 1024 scenes of the published training set
	(`hf://usm3d/s23dr-2026-sampled_*_v2:train`). This is what we actually
	optimized against during development, and it is the set behind every
	"HSS=0.382"-style number in this document, in `submitted_2048/README.md`,
	and in the run-history files under `repro_runs/` and the validation-archive.
	- Official validation = `hf://usm3d/s23dr-2026-sampled_*_v2:validation`
	(equivalently the `public` tars in `usm3d/hoho22k_2026_test_x_anon`).
	We did not eval on this split during development. No HSS number in this
	repo refers to it.
	- Public test = the `private` tars in `usm3d/hoho22k_2026_test_x_anon`,
	scored by the competition harness and posted to the leaderboard. We have
	two such numbers, both clearly labeled "public test" wherever they appear:
	0.4273 (2048 submission, commit `f4487da`) and 0.4470 (4096
	submission, commit `4946666`).

	Because we never validated against the official validation split, there is
	some risk that the dev-val numbers are mildly overfit to the last-1024-train
	slice. The +0.06 dev-val-to-public-test gap (consistent across both
	submissions, see `submitted_2048/README.md`) is empirically positive, but it
	is not a substitute for actually scoring on official val.

	### Step 1: 2048 Phase 1 (from scratch) — ~1.5hr

	```
	Data: hf://usm3d/s23dr-2026-sampled_2048_v2:train (16,508 samples)
	Steps: 0 -> 125,000 (242 epochs)
	LR: 3e-4, warmup=10,000
	Batch size: 32
	Optimizer: AdamW, betas=(0.9, 0.95), weight_decay=0.01
	Sinkhorn: eps=0.1, iters=20, dustbin=0.3
	Conf: weight=0.1, mode=sinkhorn, head_wd=0.1
	Endpoint: OFF
	Aug: rotate=True, flip=True
	Seed: 353
	```

	Trains the perceiver from random init on 2048-point samples. The sinkhorn
	optimal transport loss learns to match predicted segments to ground truth.

	Why 2048 first: Training directly on 4096 overfits (1.47x train/val ratio
	vs 1.19x for 2048). The 2048 model learns better-generalized representations.

	Output: dev val HSS ~0.28.

	### Step 2: 4096 finetune (constant LR) — ~15min

	```
	Resume: Step 1 -> step125000.pt
	Data: hf://usm3d/s23dr-2026-sampled_4096_v2:train (15,892 samples)
	Steps: 125,001 -> 135,000 (10k steps)
	LR: 3e-5 (constant, no cooldown)
	Batch size: 64
	Endpoint: OFF
	```

	Switches input from 2048 to 4096 points, increasing structural coverage from
	66% to 74%. The gentle lr (3e-5) preserves learned representations while
	adapting to the extra input. Higher LR (>1e-4) causes catastrophic forgetting.

	Dev val HSS jumps from 0.28 to 0.35 in ~5k steps. Plateaus by 10k steps.

	Output: dev val HSS ~0.35.

	### Step 3: Cooldown with endpoint loss — ~1hr

	```
	Resume: Step 2 -> step135000.pt
	Data: hf://usm3d/s23dr-2026-sampled_4096_v2:train
	Steps: 135,001 -> 170,000 (35k steps)
	LR: 3e-5, cooldown_start=150,000, cooldown_steps=20,000
	(constant 3e-5 for 15k steps, then linear decay to ~0 over 20k)
	Batch size: 64
	Endpoint: weight=0.1
	```

	Adds symmetric endpoint L1 loss (using detached sinkhorn assignment) to
	tighten vertex precision. The sinkhorn loss alone operates on segment
	midpoint/direction/length and doesn't directly penalize endpoint position error.

	Output: dev val HSS=0.382, F1=0.414. Public test HSS=0.4470.

	### Key Numbers

	Per-stage dev val scores from the original training run (March 23-26),
	which produced the shipped `checkpoint.pt`. Re-running `reproduce.sh` from
	scratch does not hit these exact numbers - see "Reproduction Results" below for
	the actual ranges. Compiled-mode re-runs of Step 3 land in dev val 0.342-0.379,
	with the best run from this codebase at 0.376.

	\| Stage \| Steps \| Dev val HSS \| Dev val F1 \| What changed \|
	\|-------\|-------\|-------------\|------------\|--------------\|
	\| After Step 1 \| 125k \| 0.281 \| 0.156 \| Learned geometry from 2048 pts \|
	\| After Step 2 \| 135k \| 0.351 \| 0.190 \| +74% coverage from 4096 pts \|
	\| After Step 3 \| 170k \| 0.382 \| 0.411 \| Vertex precision from endpoint loss \|

	## Why This Works

	1. 2048 training has low overfitting (1.19x train/val ratio) — the model
	learns good representations without memorizing training samples.

	2. 4096 data has higher coverage ceiling (74% vs 66% structural points) —
	more of the building surface is observed, improving vertex recall.

	3. Gentle finetuning preserves representations — at lr=3e-5, the model
	keeps its learned geometry understanding while adapting to the extra input.

	4. Endpoint loss tightens vertices — the symmetric endpoint distance
	directly penalizes vertex position errors, which sinkhorn loss alone
	doesn't do (it operates on midpoint/direction/length parametrization).

	## What Doesn't Work (yet)

	These are informal observations from one-off experiments during development.
	The runs, args, and eval logs are mostly [here](https://github.com/JackLangerman/s23dr_2026_example),
	but not all of them are preserved perfectly. The specific numbers below come from contemaranious notes,
	but are not all trivially reproducable. Take them as directional guidance, not as benchmarks.

	- Training 4096 from scratch: observed to overfit (~1.47x train/val loss
	gap, vs ~1.19x for 2048) and peak around dev val HSS 0.346 in a single run.
	- BuildingWorld pretraining: in one experiment, the representations were
	near-orthogonal to S23DR (cosine sim ~0.05 between learned features) and
	did not transfer.
	- Mixed BW+S23DR training: mixing BW data into the S23DR loader hurt
	dev val HSS in the runs we tried, presumed to be from domain gap.
	- High dropout / weight decay: lowered the train/val gap but also lowered
	dev val HSS in the configurations we tried.
	- High finetune LR (>1e-4): dropped dev val HSS sharply in Step 2 in
	single-run observations, consistent with disrupting the Step 1 representations.
	- Steeper cooldown (1e-5, 20x drop): slightly worse than 3e-5 in the one
	comparison we ran for this checkpoint.

	## Reproduction Results

	### End-to-end reproductions

	All HSS / F1 / IoU below are on dev val.

	\| Model \| Dev val HSS \| Dev val F1 \| Dev val IoU \| Notes \|
	\|-------\|-------------\|------------\|-------------\|-------\|
	\| Original \| 0.382 \| 0.414 \| 0.370 \| Shipped checkpoint, original training run, not reproducible from this codebase \|
	\| E2E repro #4 \| 0.379 \| 0.409 \| 0.369 \| Closest E2E, `repro_runs/e2e_repro4_hss379/` \|
	\| Compiled repro (from submission codebase) \| 0.376 \| — \| — \| Best compiled repro from this codebase, `repro_runs/compiled_repro_hss376/` \|
	\| E2E repro #3 \| 0.375 \| 0.404 \| 0.367 \| \|
	\| Deterministic E2E \| 0.372 \| 0.398 \| 0.368 \| Bit-reproducible, `repro_runs/deterministic_hss372/` \|
	\| E2E repro #5 \| 0.349 \| 0.373 \| — \| Compiled, low end of cluster \|
	\| `reproduce.sh` smoketest \| 0.342 \| — \| — \| Single run of the published script end-to-end (validation-archive `runs/reproduce_smoketest/`) \|

	### Partial reproductions (isolating pipeline stages)

	\| Test \| Starting from \| Dev val HSS \| Gap to original \|
	\|------\|--------------\|-------------\|-----------------\|
	\| Step 3 from orig Step 2 (run A) \| Original step135000.pt \| 0.382 \| 0.000 \|
	\| Step 3 from orig Step 2 (run B) \| Original step135000.pt \| 0.384 \| +0.002 \|
	\| Step 2+3 from orig Step 1 \| Original step125000.pt \| 0.377 \| -0.005 \|
	\| Step 1 from orig step 100k \| Original step100000.pt \| 0.285 (Step 1) \| +0.004 vs 0.281 \|

	Step 3 from the same checkpoint reproduces to within 0.002 dev val HSS. The
	full E2E dev val HSS variance (0.342-0.379, see All benchmarks below) is
	dominated by torch.compile nondeterminism in Step 1.

	### All benchmarks

	The HSS / F1 / IoU columns below are all on dev val. Public-test scores
	appear in the Notes column where available.

	\| Model \| Input \| Dev val HSS \| Dev val F1 \| Dev val IoU \| Notes \|
	\|-------\|-------\|-------------\|------------\|-------------\|-------\|
	\| Handcrafted baseline \| raw views \| 0.307 \| 0.404 \| 0.260 \| \|
	\| h256+qk+ep (submitted) \| 2048 \| 0.365 \| 0.388 \| 0.360 \| Public test HSS=0.4273 (commit f4487da) \|
	\| Original 3-step \| 2048 \| 0.373 \| 0.404 \| 0.363 \| \|
	\| Original 3-step \| 4096 \| 0.382 \| 0.414 \| 0.370 \| Best ever, original training. Public test HSS=0.4470 (commit 4946666) \|
	\| Step3 repro from orig S2 \| 4096 \| 0.384 \| 0.414 \| — \| Near-exact repro from a saved Step 2 ckpt \|
	\| E2E repro #4 \| 4096 \| 0.379 \| 0.409 \| 0.369 \| \|
	\| Compiled repro (submission codebase) \| 4096 \| 0.376 \| — \| — \| Best compiled from this exact codebase \|
	\| E2E repro #3 \| 4096 \| 0.375 \| 0.404 \| 0.367 \| \|
	\| Deterministic E2E \| 4096 \| 0.372 \| 0.398 \| 0.368 \| Bit-reproducible across runs (different trajectory than compiled) \|
	\| E2E repro #5 \| 4096 \| 0.349 \| 0.373 \| — \| Compiled, low end of cluster \|
	\| `reproduce.sh` smoketest \| 4096 \| 0.342 \| — \| — \| Single E2E run of the published `reproduce.sh` \|

	The shipped 0.382 is by definition not reachable from this codebase (the
	original training run is gone). Best compiled repro is 0.376, mode of
	compiled repros is around 0.375-0.379, and the lower tail extends to 0.342.

	## Code Equivalence Verification

	\| Test \| Result \|
	\|------\|--------\|
	\| Forward pass (same checkpoint, same input) \| Bit-identical (0.00 diff) \|
	\| Loss computation \| Bit-identical (0.00 diff) \|
	\| Gradient computation \| 5e-8 max diff \|
	\| Training from same seed \| Bit-identical steps 1-44 \|
	\| Step 3 from same checkpoint (2 runs) \| Dev val HSS=0.382, 0.384 \|
	\| Deterministic mode (2 runs) \| Bit-identical (0.00 diff) \|

	## Reproducibility Notes

	All HSS numbers in this section are on dev val.

	Default mode (`reproduce.sh`): Uses torch.compile (~3x faster). Each run
	gets different Triton kernels, causing ~1e-8 floating-point divergence at a
	random step (31-45). This grows through chaotic SGD dynamics. Documented
	compiled E2E runs from this codebase have produced dev val HSS in
	0.342-0.379; the modal cluster is 0.375-0.379, with low-side excursions
	at 0.349 and 0.342 (no high-side outliers, so the distribution is one-sided
	rather than symmetric around a mean). Best compiled repro is 0.376, vs the
	shipped 0.382 from the original (lost) training run.

	Deterministic mode (`--deterministic` flag): Disables torch.compile and
	forces CUDA deterministic ops. Bit-identical across runs with the same seed
	(verified across 3 independent runs). Dev val HSS=0.372. Note: deterministic
	mode diverges from compiled mode at step 1 because eager and compiled
	forward passes use different floating-point reduction orders - it is a
	different numerical trajectory entirely, not a reproduction of any compiled run.

	bad_samples.txt: The shipped file has 156 entries to match original training.
	(Note: `wc -l` reports 155 because the last line lacks a trailing newline.)
	Two additional bad samples (`47b0e0ce19b`, `4b2d56eb3ef`) were discovered after
	the original training run. They are legitimately bad (misaligned GT) but were
	included in the original training data. Adding them changes the batch iteration
	order and costs ~0.005 dev val HSS in deterministic mode (0.372 -> 0.367) and
	~0.04 in compiled mode (0.376 -> 0.335 in our `validate_155_compiled` vs
	`validate_158_compiled` runs) due to compounded torch.compile variance.
	Participants training from scratch may wish to add these 2 entries for cleaner
	training data, but should expect slightly different scores due to the changed
	iteration order.

	The shipped `checkpoint.pt` is from the original training run
	(dev val HSS=0.382, public test HSS=0.4470).