================================================================================ Generalized Batched Thin SVD — Profiling Suite Device: NVIDIA RTX PRO 6000 Blackwell Server Edition ================================================================================ ====================================================================== CORRECTNESS VALIDATION (B=64, M=1024) ====================================================================== [auto] N= 2: S_err=1.91e-05 recon=9.54e-07 (ref=4.83e-06) orth=1.43e-06 desc=True [PASS] [triton] N= 2: S_err=1.91e-05 recon=9.54e-07 (ref=4.83e-06) orth=1.43e-06 desc=True [PASS] [auto] N= 3: S_err=4.01e-05 recon=2.38e-06 (ref=8.34e-06) orth=1.13e-06 desc=True [PASS] [triton] N= 3: S_err=4.01e-05 recon=2.38e-06 (ref=8.34e-06) orth=1.13e-06 desc=True [PASS] [auto] N= 4: S_err=4.01e-05 recon=2.38e-06 (ref=9.06e-06) orth=1.73e-06 desc=True [PASS] [gram] N= 4: S_err=4.01e-05 recon=2.38e-06 (ref=9.06e-06) orth=1.73e-06 desc=True [PASS] [auto] N= 5: S_err=5.15e-05 recon=3.81e-06 (ref=9.30e-06) orth=1.79e-06 desc=True [PASS] [gram] N= 5: S_err=5.15e-05 recon=3.81e-06 (ref=9.30e-06) orth=1.79e-06 desc=True [PASS] [auto] N= 6: S_err=6.29e-05 recon=2.86e-06 (ref=1.24e-05) orth=1.67e-06 desc=True [PASS] [gram] N= 6: S_err=6.29e-05 recon=2.86e-06 (ref=1.24e-05) orth=1.67e-06 desc=True [PASS] [auto] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS] [gram] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS] [newton] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS] [auto] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS] [gram] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS] [newton] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS] [auto] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS] [gram] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS] [newton] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS] [auto] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS] [gram] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS] [newton] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS] [auto] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS] [gram] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS] [newton] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS] [auto] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS] [gram] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS] [newton] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS] [auto] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS] [gram] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS] [newton] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS] [auto] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS] [gram] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS] [newton] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS] ALL PASSED ======================================================================================================================== PROCRUSTES ALIGNMENT: 5 methods of applying rank-k rotation to N-d space cos = mean cosine similarity after alignment (higher = better, full = ceiling) NN = nearest-neighbor agreement with full Procrustes (1.0 = identical downstream) ======================================================================================================================== N=32: k full pinv lerp (α) slerp (α) subspc stay_k │ nn_pv nn_lr nn_sl nn_ss ───────────────────────────────────────────────────────────────────────────────────────────────────────── 8 0.4359 0.2142 0.4248 0.3 0.2142 err 0.4299 0.4215 │ 0.177 0.681 0.177 1.000 16 0.4370 0.2967 0.4259 0.3 0.2967 err 0.4316 0.4252 │ 0.300 0.678 0.300 1.000 24 0.4405 0.3864 0.4365 0.3 0.3864 err 0.4369 0.4384 │ 0.555 0.772 0.555 1.000 N=48: k full pinv lerp (α) slerp (α) subspc stay_k │ nn_pv nn_lr nn_sl nn_ss ───────────────────────────────────────────────────────────────────────────────────────────────────────── 8 0.4421 0.1764 0.4306 0.3 0.1764 err 0.4350 0.4192 │ 0.102 0.702 0.102 1.000 16 0.4422 0.2494 0.4290 0.3 0.2494 err 0.4354 0.4292 │ 0.230 0.667 0.230 1.000 24 0.4432 0.3047 0.4294 0.3 0.3047 err 0.4366 0.4315 │ 0.326 0.676 0.326 1.000 32 0.4476 0.3621 0.4397 0.3 0.3621 err 0.4429 0.4425 │ 0.454 0.728 0.454 1.000 N=64: k full pinv lerp (α) slerp (α) subspc stay_k │ nn_pv nn_lr nn_sl nn_ss ───────────────────────────────────────────────────────────────────────────────────────────────────────── 8 0.4475 0.1602 0.4356 0.3 0.1602 err 0.4390 0.4323 │ 0.102 0.708 0.102 1.000 16 0.4444 0.2178 0.4300 0.3 0.2178 err 0.4355 0.4299 │ 0.164 0.658 0.164 1.000 24 0.4453 0.2678 0.4295 0.3 0.2678 err 0.4363 0.4332 │ 0.241 0.665 0.241 1.000 32 0.4468 0.3091 0.4324 0.3 0.3091 err 0.4390 0.4374 │ 0.312 0.680 0.312 1.000 N=96: k full pinv lerp (α) slerp (α) subspc stay_k │ nn_pv nn_lr nn_sl nn_ss ───────────────────────────────────────────────────────────────────────────────────────────────────────── 16 0.4267 0.1644 0.4035 0.3 0.1644 err 0.4077 0.4020 │ 0.132 0.721 0.132 1.000 24 0.4259 0.2023 0.4014 0.3 0.2023 err 0.4069 0.4034 │ 0.200 0.709 0.200 1.000 32 0.4241 0.2363 0.3996 0.3 0.2363 err 0.4057 0.4056 │ 0.241 0.688 0.241 1.000 48 0.4238 0.2978 0.4050 0.3 0.2978 err 0.4080 0.4139 │ 0.394 0.717 0.394 1.000 N=128: k full pinv lerp (α) slerp (α) subspc stay_k │ nn_pv nn_lr nn_sl nn_ss ───────────────────────────────────────────────────────────────────────────────────────────────────────── 16 0.4068 0.1380 0.3740 0.3 0.1380 err 0.3770 0.3763 │ 0.129 0.757 0.129 1.000 24 0.4072 0.1679 0.3733 0.3 0.1679 err 0.3774 0.3778 │ 0.169 0.739 0.169 1.000 32 0.4064 0.1860 0.3730 0.3 0.1860 err 0.3778 0.3736 │ 0.217 0.723 0.217 1.000 48 0.4073 0.2397 0.3783 0.3 0.2397 err 0.3812 0.3868 │ 0.310 0.733 0.310 1.000 64 0.4102 0.2781 0.3853 0.3 0.2781 err 0.3880 0.3937 │ 0.394 0.729 0.394 1.000 ═════════════════════════════════════════════════════════════════════════════════════════════════════════ WINNER PER CONFIG (closest cos to full, highest NN agreement): ═════════════════════════════════════════════════════════════════════════════════════════════════════════ N= 32 k= 8: best_cos=subspace (0.4299, gap=0.0060) best_nn=subspace (1.000) N= 32 k= 16: best_cos=subspace (0.4316, gap=0.0054) best_nn=subspace (1.000) N= 32 k= 24: best_cos=subspace (0.4369, gap=0.0037) best_nn=subspace (1.000) N= 48 k= 8: best_cos=subspace (0.4350, gap=0.0071) best_nn=subspace (1.000) N= 48 k= 16: best_cos=subspace (0.4354, gap=0.0068) best_nn=subspace (1.000) N= 48 k= 24: best_cos=subspace (0.4366, gap=0.0066) best_nn=subspace (1.000) N= 48 k= 32: best_cos=subspace (0.4429, gap=0.0047) best_nn=subspace (1.000) N= 64 k= 8: best_cos=subspace (0.4390, gap=0.0085) best_nn=subspace (1.000) N= 64 k= 16: best_cos=subspace (0.4355, gap=0.0089) best_nn=subspace (1.000) N= 64 k= 24: best_cos=subspace (0.4363, gap=0.0090) best_nn=subspace (1.000) N= 64 k= 32: best_cos=subspace (0.4390, gap=0.0078) best_nn=subspace (1.000) N= 96 k= 16: best_cos=subspace (0.4077, gap=0.0190) best_nn=subspace (1.000) N= 96 k= 24: best_cos=subspace (0.4069, gap=0.0190) best_nn=subspace (1.000) N= 96 k= 32: best_cos=subspace (0.4057, gap=0.0184) best_nn=subspace (1.000) N= 96 k= 48: best_cos=subspace (0.4080, gap=0.0158) best_nn=subspace (1.000) N=128 k= 16: best_cos=subspace (0.3770, gap=0.0298) best_nn=subspace (1.000) N=128 k= 24: best_cos=subspace (0.3774, gap=0.0298) best_nn=subspace (1.000) N=128 k= 32: best_cos=subspace (0.3778, gap=0.0286) best_nn=subspace (1.000) N=128 k= 48: best_cos=subspace (0.3812, gap=0.0261) best_nn=subspace (1.000) N=128 k= 64: best_cos=subspace (0.3880, gap=0.0222) best_nn=subspace (1.000) ==================================================================================================== PROJECTION QUALITY ANALYSIS — B=256, M=1024 Question: can rank-k SVD approximate rank-N SVD? ==================================================================================================== N=32: k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup ──────────────────────────────────────────────────────────────────────────────────────────────── 8 30.99% 8.65e-01 8.31e-01 0.5622 0.4432 7.849ms 0.508ms 0.1x 12 44.74% 7.89e-01 7.43e-01 0.4606 0.5508 10.556ms 0.508ms 0.0x 16 57.56% 7.05e-01 6.51e-01 0.3379 0.6432 11.222ms 0.508ms 0.0x 24 80.59% 4.41e-01 4.41e-01 0.0000 1.0000 0.510ms 0.508ms 1.0x N=48: k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup ──────────────────────────────────────────────────────────────────────────────────────────────── 8 22.33% 9.11e-01 8.81e-01 0.7880 0.3642 7.901ms 172.136ms 21.8x 12 32.39% 8.65e-01 8.22e-01 0.6575 0.4454 10.668ms 172.136ms 16.1x 16 41.87% 8.15e-01 7.62e-01 0.4125 0.5193 11.490ms 172.136ms 15.0x 24 59.24% 7.05e-01 6.38e-01 0.3178 0.6433 11.497ms 172.136ms 15.0x 32 74.71% 5.76e-01 5.03e-01 0.3076 0.7575 180.615ms 172.136ms 1.0x N=64: k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup ──────────────────────────────────────────────────────────────────────────────────────────────── 8 17.83% 9.34e-01 9.06e-01 0.9635 0.3152 7.917ms 182.058ms 23.0x 12 25.91% 9.00e-01 8.61e-01 0.6937 0.3898 10.693ms 182.058ms 17.0x 16 33.58% 8.64e-01 8.15e-01 0.6025 0.4484 11.311ms 182.058ms 16.1x 24 47.78% 7.89e-01 7.23e-01 0.3495 0.5505 11.207ms 182.058ms 16.2x 32 60.64% 7.05e-01 6.27e-01 0.3116 0.6438 176.453ms 182.058ms 1.0x 48 82.74% 4.99e-01 4.15e-01 0.3090 0.8138 204.625ms 182.058ms 0.9x N=96: k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup ──────────────────────────────────────────────────────────────────────────────────────────────── 8 13.09% 9.56e-01 9.32e-01 1.2033 0.2583 8.035ms 295.451ms 36.8x 16 24.83% 9.11e-01 8.67e-01 0.8721 0.3637 11.426ms 295.451ms 25.9x 24 35.57% 8.64e-01 8.02e-01 0.5587 0.4475 11.238ms 295.451ms 26.3x 32 45.45% 8.15e-01 7.38e-01 0.4710 0.5163 175.186ms 295.451ms 1.7x 48 62.97% 7.05e-01 6.08e-01 0.3243 0.6407 200.525ms 295.451ms 1.5x 64 77.83% 5.75e-01 4.71e-01 0.3073 0.7578 306.531ms 295.451ms 1.0x N=128: k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup ──────────────────────────────────────────────────────────────────────────────────────────────── 8 10.60% 9.68e-01 9.46e-01 1.4678 0.2251 8.085ms 436.551ms 54.0x 16 20.19% 9.34e-01 8.93e-01 1.0025 0.3145 11.509ms 436.551ms 37.9x 24 29.04% 9.00e-01 8.42e-01 0.7155 0.3867 11.432ms 436.551ms 38.2x 32 37.26% 8.64e-01 7.92e-01 0.5374 0.4447 174.994ms 436.551ms 2.5x 48 52.05% 7.89e-01 6.93e-01 0.3598 0.5498 198.286ms 436.551ms 2.2x 64 64.91% 7.05e-01 5.92e-01 0.3121 0.6407 305.364ms 436.551ms 1.4x 96 85.61% 4.99e-01 3.79e-01 0.3011 0.8136 452.623ms 436.551ms 1.0x ────────────────────────────────────────────────────────────────────── SUMMARY: Recommended target_rank per N (≥99% energy, ≥0.99 subspace cos, best speedup) ────────────────────────────────────────────────────────────────────── N= 32: best k= 24 → 80.6% energy, subspace=1.0000 (below 99% threshold) N= 48: best k= 32 → 74.7% energy, subspace=0.7575 (below 99% threshold) N= 64: best k= 48 → 82.7% energy, subspace=0.8138 (below 99% threshold) N= 96: best k= 64 → 77.8% energy, subspace=0.7578 (below 99% threshold) N=128: best k= 96 → 85.6% energy, subspace=0.8136 (below 99% threshold) ============================================================================================================== N-DIMENSION SWEEP — NVIDIA RTX PRO 6000 Blackwell Server Edition B=512, M=1024 ============================================================================================================== N Triton Gram Newton Proj→24 Proj→16 Torch Best Speedup ────────────────────────────────────────────────────────────────────────────────────────────────────────── 2 0.020ms 0.227ms — — — 79.040ms triton 3859.1x 3 0.022ms 0.242ms — — — 118.394ms triton 5394.2x 4 — 0.255ms — — — 125.263ms gram 490.6x 5 — 0.258ms — — — 144.426ms gram 560.8x 6 — 0.269ms — — — 155.042ms gram 576.9x 7 — 0.280ms — — — 163.771ms gram 584.2x 8 — 0.291ms 0.290ms — — 168.934ms newton 582.1x 10 — 0.380ms 0.379ms — — 190.292ms newton 502.2x 12 — 0.400ms 0.400ms — — 213.394ms gram 534.1x 16 — 0.429ms 0.428ms — — 230.670ms newton 538.6x 20 — 0.597ms 0.596ms — — 253.657ms newton 425.6x 24 — 0.651ms 0.651ms — 0.652ms 272.293ms newton 418.5x 32 — 0.795ms 0.794ms 0.800ms 22.025ms 303.023ms newton 381.8x 48 — 344.049ms 344.202ms 22.439ms 22.481ms 550.746ms proj24 24.5x 64 — 365.206ms 365.148ms 21.749ms 22.173ms 609.352ms proj24 28.0x 96 — 590.636ms 590.664ms 21.862ms 22.353ms 973.819ms proj24 44.5x 128 — 868.144ms 868.262ms 22.085ms 22.469ms 1421.924ms proj24 64.4x ================================================================================ SUMMARY ================================================================================ Strategy by N: N=2: Fused Triton (closed-form Jacobi rotation) N=3: Fused Triton (cyclic Jacobi in registers) N=4-32: Gram + eigh (bmm + cuSOLVER eigh) — sub-ms N=48+: Projected SVD (N→k, cheap SVD, lift back) — check quality table Standalone utilities: newton_schulz_invsqrt(G) — batched G^{-1/2} via pure bmm projected_svd(A, target_rank=k) — rank-k approximate SVD projected_svd_quality(A, target_rank) — measure approximation quality Key question answered: energy_ratio and subspace_cos in quality table Results saved to svd_general_profile.json ================================================================================