AbstractPhil commited on
Commit
498c445
Β·
verified Β·
1 Parent(s): cf087d8

Create svd_triton_gram_newton_profiled.txt

Browse files
Files changed (1) hide show
  1. svd_triton_gram_newton_profiled.txt +221 -0
svd_triton_gram_newton_profiled.txt ADDED
@@ -0,0 +1,221 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ================================================================================
2
+ Generalized Batched Thin SVD β€” Profiling Suite
3
+ Device: NVIDIA RTX PRO 6000 Blackwell Server Edition
4
+ ================================================================================
5
+
6
+ ======================================================================
7
+ CORRECTNESS VALIDATION (B=64, M=1024)
8
+ ======================================================================
9
+ [auto] N= 2: S_err=1.91e-05 recon=9.54e-07 (ref=4.83e-06) orth=1.43e-06 desc=True [PASS]
10
+ [triton] N= 2: S_err=1.91e-05 recon=9.54e-07 (ref=4.83e-06) orth=1.43e-06 desc=True [PASS]
11
+ [auto] N= 3: S_err=4.01e-05 recon=2.38e-06 (ref=8.34e-06) orth=1.13e-06 desc=True [PASS]
12
+ [triton] N= 3: S_err=4.01e-05 recon=2.38e-06 (ref=8.34e-06) orth=1.13e-06 desc=True [PASS]
13
+ [auto] N= 4: S_err=4.01e-05 recon=2.38e-06 (ref=9.06e-06) orth=1.73e-06 desc=True [PASS]
14
+ [gram] N= 4: S_err=4.01e-05 recon=2.38e-06 (ref=9.06e-06) orth=1.73e-06 desc=True [PASS]
15
+ [auto] N= 5: S_err=5.15e-05 recon=3.81e-06 (ref=9.30e-06) orth=1.79e-06 desc=True [PASS]
16
+ [gram] N= 5: S_err=5.15e-05 recon=3.81e-06 (ref=9.30e-06) orth=1.79e-06 desc=True [PASS]
17
+ [auto] N= 6: S_err=6.29e-05 recon=2.86e-06 (ref=1.24e-05) orth=1.67e-06 desc=True [PASS]
18
+ [gram] N= 6: S_err=6.29e-05 recon=2.86e-06 (ref=1.24e-05) orth=1.67e-06 desc=True [PASS]
19
+ [auto] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS]
20
+ [gram] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS]
21
+ [newton] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS]
22
+ [auto] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS]
23
+ [gram] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS]
24
+ [newton] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS]
25
+ [auto] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS]
26
+ [gram] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS]
27
+ [newton] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS]
28
+ [auto] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS]
29
+ [gram] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS]
30
+ [newton] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS]
31
+ [auto] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS]
32
+ [gram] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS]
33
+ [newton] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS]
34
+ [auto] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS]
35
+ [gram] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS]
36
+ [newton] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS]
37
+ [auto] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS]
38
+ [gram] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS]
39
+ [newton] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS]
40
+ [auto] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS]
41
+ [gram] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS]
42
+ [newton] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS]
43
+
44
+ ALL PASSED
45
+
46
+ ========================================================================================================================
47
+ PROCRUSTES ALIGNMENT: 5 methods of applying rank-k rotation to N-d space
48
+ cos = mean cosine similarity after alignment (higher = better, full = ceiling)
49
+ NN = nearest-neighbor agreement with full Procrustes (1.0 = identical downstream)
50
+ ========================================================================================================================
51
+
52
+ N=32:
53
+ k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β”‚ nn_pv nn_lr nn_sl nn_ss
54
+ ─────────────────────────────────────────────────────────────────────────────────────────────────────────
55
+ 8 0.4359 0.2142 0.4248 0.3 0.2142 err 0.4299 0.4215 β”‚ 0.177 0.681 0.177 1.000
56
+ 16 0.4370 0.2967 0.4259 0.3 0.2967 err 0.4316 0.4252 β”‚ 0.300 0.678 0.300 1.000
57
+ 24 0.4405 0.3864 0.4365 0.3 0.3864 err 0.4369 0.4384 β”‚ 0.555 0.772 0.555 1.000
58
+
59
+ N=48:
60
+ k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β”‚ nn_pv nn_lr nn_sl nn_ss
61
+ ─────────────────────────────────────────────────────────────────────────────────────────────────────────
62
+ 8 0.4421 0.1764 0.4306 0.3 0.1764 err 0.4350 0.4192 β”‚ 0.102 0.702 0.102 1.000
63
+ 16 0.4422 0.2494 0.4290 0.3 0.2494 err 0.4354 0.4292 β”‚ 0.230 0.667 0.230 1.000
64
+ 24 0.4432 0.3047 0.4294 0.3 0.3047 err 0.4366 0.4315 β”‚ 0.326 0.676 0.326 1.000
65
+ 32 0.4476 0.3621 0.4397 0.3 0.3621 err 0.4429 0.4425 β”‚ 0.454 0.728 0.454 1.000
66
+
67
+ N=64:
68
+ k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β”‚ nn_pv nn_lr nn_sl nn_ss
69
+ ─────────────────────────────────────────────────────────────────────────────────────────────────────────
70
+ 8 0.4475 0.1602 0.4356 0.3 0.1602 err 0.4390 0.4323 β”‚ 0.102 0.708 0.102 1.000
71
+ 16 0.4444 0.2178 0.4300 0.3 0.2178 err 0.4355 0.4299 β”‚ 0.164 0.658 0.164 1.000
72
+ 24 0.4453 0.2678 0.4295 0.3 0.2678 err 0.4363 0.4332 β”‚ 0.241 0.665 0.241 1.000
73
+ 32 0.4468 0.3091 0.4324 0.3 0.3091 err 0.4390 0.4374 β”‚ 0.312 0.680 0.312 1.000
74
+
75
+ N=96:
76
+ k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β”‚ nn_pv nn_lr nn_sl nn_ss
77
+ ─────────────────────────────────────────────────────────────────────────────────────────────────────────
78
+ 16 0.4267 0.1644 0.4035 0.3 0.1644 err 0.4077 0.4020 β”‚ 0.132 0.721 0.132 1.000
79
+ 24 0.4259 0.2023 0.4014 0.3 0.2023 err 0.4069 0.4034 β”‚ 0.200 0.709 0.200 1.000
80
+ 32 0.4241 0.2363 0.3996 0.3 0.2363 err 0.4057 0.4056 β”‚ 0.241 0.688 0.241 1.000
81
+ 48 0.4238 0.2978 0.4050 0.3 0.2978 err 0.4080 0.4139 β”‚ 0.394 0.717 0.394 1.000
82
+
83
+ N=128:
84
+ k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β”‚ nn_pv nn_lr nn_sl nn_ss
85
+ ─────────────────────────────────────────────────────────────────────────────────────────────────────────
86
+ 16 0.4068 0.1380 0.3740 0.3 0.1380 err 0.3770 0.3763 β”‚ 0.129 0.757 0.129 1.000
87
+ 24 0.4072 0.1679 0.3733 0.3 0.1679 err 0.3774 0.3778 β”‚ 0.169 0.739 0.169 1.000
88
+ 32 0.4064 0.1860 0.3730 0.3 0.1860 err 0.3778 0.3736 β”‚ 0.217 0.723 0.217 1.000
89
+ 48 0.4073 0.2397 0.3783 0.3 0.2397 err 0.3812 0.3868 β”‚ 0.310 0.733 0.310 1.000
90
+ 64 0.4102 0.2781 0.3853 0.3 0.2781 err 0.3880 0.3937 β”‚ 0.394 0.729 0.394 1.000
91
+
92
+ ═════════════════════════════════════════════════════════════════════════════════════════════════════════
93
+ WINNER PER CONFIG (closest cos to full, highest NN agreement):
94
+ ═════════════════════════════════════════════════════════════════════════════════════════════════════════
95
+ N= 32 k= 8: best_cos=subspace (0.4299, gap=0.0060) best_nn=subspace (1.000)
96
+ N= 32 k= 16: best_cos=subspace (0.4316, gap=0.0054) best_nn=subspace (1.000)
97
+ N= 32 k= 24: best_cos=subspace (0.4369, gap=0.0037) best_nn=subspace (1.000)
98
+ N= 48 k= 8: best_cos=subspace (0.4350, gap=0.0071) best_nn=subspace (1.000)
99
+ N= 48 k= 16: best_cos=subspace (0.4354, gap=0.0068) best_nn=subspace (1.000)
100
+ N= 48 k= 24: best_cos=subspace (0.4366, gap=0.0066) best_nn=subspace (1.000)
101
+ N= 48 k= 32: best_cos=subspace (0.4429, gap=0.0047) best_nn=subspace (1.000)
102
+ N= 64 k= 8: best_cos=subspace (0.4390, gap=0.0085) best_nn=subspace (1.000)
103
+ N= 64 k= 16: best_cos=subspace (0.4355, gap=0.0089) best_nn=subspace (1.000)
104
+ N= 64 k= 24: best_cos=subspace (0.4363, gap=0.0090) best_nn=subspace (1.000)
105
+ N= 64 k= 32: best_cos=subspace (0.4390, gap=0.0078) best_nn=subspace (1.000)
106
+ N= 96 k= 16: best_cos=subspace (0.4077, gap=0.0190) best_nn=subspace (1.000)
107
+ N= 96 k= 24: best_cos=subspace (0.4069, gap=0.0190) best_nn=subspace (1.000)
108
+ N= 96 k= 32: best_cos=subspace (0.4057, gap=0.0184) best_nn=subspace (1.000)
109
+ N= 96 k= 48: best_cos=subspace (0.4080, gap=0.0158) best_nn=subspace (1.000)
110
+ N=128 k= 16: best_cos=subspace (0.3770, gap=0.0298) best_nn=subspace (1.000)
111
+ N=128 k= 24: best_cos=subspace (0.3774, gap=0.0298) best_nn=subspace (1.000)
112
+ N=128 k= 32: best_cos=subspace (0.3778, gap=0.0286) best_nn=subspace (1.000)
113
+ N=128 k= 48: best_cos=subspace (0.3812, gap=0.0261) best_nn=subspace (1.000)
114
+ N=128 k= 64: best_cos=subspace (0.3880, gap=0.0222) best_nn=subspace (1.000)
115
+
116
+ ====================================================================================================
117
+ PROJECTION QUALITY ANALYSIS β€” B=256, M=1024
118
+ Question: can rank-k SVD approximate rank-N SVD?
119
+ ====================================================================================================
120
+
121
+ N=32:
122
+ k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup
123
+ ────────────────────────────────────────────────────────────────────────────────────────────────
124
+ 8 30.99% 8.65e-01 8.31e-01 0.5622 0.4432 7.849ms 0.508ms 0.1x
125
+ 12 44.74% 7.89e-01 7.43e-01 0.4606 0.5508 10.556ms 0.508ms 0.0x
126
+ 16 57.56% 7.05e-01 6.51e-01 0.3379 0.6432 11.222ms 0.508ms 0.0x
127
+ 24 80.59% 4.41e-01 4.41e-01 0.0000 1.0000 0.510ms 0.508ms 1.0x
128
+
129
+ N=48:
130
+ k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup
131
+ ────────────────────────────────────────────────────────────────────────────────────────────────
132
+ 8 22.33% 9.11e-01 8.81e-01 0.7880 0.3642 7.901ms 172.136ms 21.8x
133
+ 12 32.39% 8.65e-01 8.22e-01 0.6575 0.4454 10.668ms 172.136ms 16.1x
134
+ 16 41.87% 8.15e-01 7.62e-01 0.4125 0.5193 11.490ms 172.136ms 15.0x
135
+ 24 59.24% 7.05e-01 6.38e-01 0.3178 0.6433 11.497ms 172.136ms 15.0x
136
+ 32 74.71% 5.76e-01 5.03e-01 0.3076 0.7575 180.615ms 172.136ms 1.0x
137
+
138
+ N=64:
139
+ k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup
140
+ ────────────────────────────────────────────────────────────────────────────────────────────────
141
+ 8 17.83% 9.34e-01 9.06e-01 0.9635 0.3152 7.917ms 182.058ms 23.0x
142
+ 12 25.91% 9.00e-01 8.61e-01 0.6937 0.3898 10.693ms 182.058ms 17.0x
143
+ 16 33.58% 8.64e-01 8.15e-01 0.6025 0.4484 11.311ms 182.058ms 16.1x
144
+ 24 47.78% 7.89e-01 7.23e-01 0.3495 0.5505 11.207ms 182.058ms 16.2x
145
+ 32 60.64% 7.05e-01 6.27e-01 0.3116 0.6438 176.453ms 182.058ms 1.0x
146
+ 48 82.74% 4.99e-01 4.15e-01 0.3090 0.8138 204.625ms 182.058ms 0.9x
147
+
148
+ N=96:
149
+ k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup
150
+ ────────────────────────────────────────────────────────────────────────────────────────────────
151
+ 8 13.09% 9.56e-01 9.32e-01 1.2033 0.2583 8.035ms 295.451ms 36.8x
152
+ 16 24.83% 9.11e-01 8.67e-01 0.8721 0.3637 11.426ms 295.451ms 25.9x
153
+ 24 35.57% 8.64e-01 8.02e-01 0.5587 0.4475 11.238ms 295.451ms 26.3x
154
+ 32 45.45% 8.15e-01 7.38e-01 0.4710 0.5163 175.186ms 295.451ms 1.7x
155
+ 48 62.97% 7.05e-01 6.08e-01 0.3243 0.6407 200.525ms 295.451ms 1.5x
156
+ 64 77.83% 5.75e-01 4.71e-01 0.3073 0.7578 306.531ms 295.451ms 1.0x
157
+
158
+ N=128:
159
+ k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup
160
+ ───────────────────────────────────────────────────────────────���────────────────────────────────
161
+ 8 10.60% 9.68e-01 9.46e-01 1.4678 0.2251 8.085ms 436.551ms 54.0x
162
+ 16 20.19% 9.34e-01 8.93e-01 1.0025 0.3145 11.509ms 436.551ms 37.9x
163
+ 24 29.04% 9.00e-01 8.42e-01 0.7155 0.3867 11.432ms 436.551ms 38.2x
164
+ 32 37.26% 8.64e-01 7.92e-01 0.5374 0.4447 174.994ms 436.551ms 2.5x
165
+ 48 52.05% 7.89e-01 6.93e-01 0.3598 0.5498 198.286ms 436.551ms 2.2x
166
+ 64 64.91% 7.05e-01 5.92e-01 0.3121 0.6407 305.364ms 436.551ms 1.4x
167
+ 96 85.61% 4.99e-01 3.79e-01 0.3011 0.8136 452.623ms 436.551ms 1.0x
168
+
169
+ ──────────────────────────────────────────────────────────────────────
170
+ SUMMARY: Recommended target_rank per N
171
+ (β‰₯99% energy, β‰₯0.99 subspace cos, best speedup)
172
+ ──────────────────────────────────────────────────────────────────────
173
+ N= 32: best k= 24 β†’ 80.6% energy, subspace=1.0000 (below 99% threshold)
174
+ N= 48: best k= 32 β†’ 74.7% energy, subspace=0.7575 (below 99% threshold)
175
+ N= 64: best k= 48 β†’ 82.7% energy, subspace=0.8138 (below 99% threshold)
176
+ N= 96: best k= 64 β†’ 77.8% energy, subspace=0.7578 (below 99% threshold)
177
+ N=128: best k= 96 β†’ 85.6% energy, subspace=0.8136 (below 99% threshold)
178
+
179
+ ==============================================================================================================
180
+ N-DIMENSION SWEEP β€” NVIDIA RTX PRO 6000 Blackwell Server Edition
181
+ B=512, M=1024
182
+ ==============================================================================================================
183
+ N Triton Gram Newton Proj→24 Proj→16 Torch Best Speedup
184
+ ──────────────────────────────────────────────────────────────────────────────────────────────────────────
185
+ 2 0.020ms 0.227ms β€” β€” β€” 79.040ms triton 3859.1x
186
+ 3 0.022ms 0.242ms β€” β€” β€” 118.394ms triton 5394.2x
187
+ 4 β€” 0.255ms β€” β€” β€” 125.263ms gram 490.6x
188
+ 5 β€” 0.258ms β€” β€” β€” 144.426ms gram 560.8x
189
+ 6 β€” 0.269ms β€” β€” β€” 155.042ms gram 576.9x
190
+ 7 β€” 0.280ms β€” β€” β€” 163.771ms gram 584.2x
191
+ 8 β€” 0.291ms 0.290ms β€” β€” 168.934ms newton 582.1x
192
+ 10 β€” 0.380ms 0.379ms β€” β€” 190.292ms newton 502.2x
193
+ 12 β€” 0.400ms 0.400ms β€” β€” 213.394ms gram 534.1x
194
+ 16 β€” 0.429ms 0.428ms β€” β€” 230.670ms newton 538.6x
195
+ 20 β€” 0.597ms 0.596ms β€” β€” 253.657ms newton 425.6x
196
+ 24 β€” 0.651ms 0.651ms β€” 0.652ms 272.293ms newton 418.5x
197
+ 32 β€” 0.795ms 0.794ms 0.800ms 22.025ms 303.023ms newton 381.8x
198
+ 48 β€” 344.049ms 344.202ms 22.439ms 22.481ms 550.746ms proj24 24.5x
199
+ 64 β€” 365.206ms 365.148ms 21.749ms 22.173ms 609.352ms proj24 28.0x
200
+ 96 β€” 590.636ms 590.664ms 21.862ms 22.353ms 973.819ms proj24 44.5x
201
+ 128 β€” 868.144ms 868.262ms 22.085ms 22.469ms 1421.924ms proj24 64.4x
202
+
203
+ ================================================================================
204
+ SUMMARY
205
+ ================================================================================
206
+
207
+ Strategy by N:
208
+ N=2: Fused Triton (closed-form Jacobi rotation)
209
+ N=3: Fused Triton (cyclic Jacobi in registers)
210
+ N=4-32: Gram + eigh (bmm + cuSOLVER eigh) β€” sub-ms
211
+ N=48+: Projected SVD (N→k, cheap SVD, lift back) — check quality table
212
+
213
+ Standalone utilities:
214
+ newton_schulz_invsqrt(G) β€” batched G^{-1/2} via pure bmm
215
+ projected_svd(A, target_rank=k) β€” rank-k approximate SVD
216
+ projected_svd_quality(A, target_rank) β€” measure approximation quality
217
+
218
+ Key question answered: energy_ratio and subspace_cos in quality table
219
+
220
+ Results saved to svd_general_profile.json
221
+ ================================================================================