anshdadhich commited on
Commit
9da8374
Β·
verified Β·
1 Parent(s): 887b660

Delete CORRECTIONS.md

Browse files
Files changed (1) hide show
  1. CORRECTIONS.md +0 -288
CORRECTIONS.md DELETED
@@ -1,288 +0,0 @@
1
- # Data Verification & Corrections
2
-
3
- This document verifies every quantitative claim in PAPER.md and FINDINGS_SUMMARY.md against the raw JSON results, noting exact values, rounding, and source versions.
4
-
5
- **Status: All core conclusions are supported by data. Minor rounding discrepancies noted below.**
6
-
7
- ---
8
-
9
- ## 1. Core Results Table (Section 3.1 of PAPER)
10
-
11
- Source: `results_v6.json` (3 seeds, full benchmark with Vanilla/SinGLU/Hybrid/Adaptive)
12
-
13
- ### Memorization Improvement
14
-
15
- **PAPER claim:** 168,817Γ— improvement (Vanilla 0.157 β†’ SinGLU 9.3Γ—10⁻⁷)
16
-
17
- **Raw data:**
18
- - Vanilla mean: 0.15677066644032797
19
- - SinGLU mean: 9.313488410119438Γ—10⁻⁷
20
- - Actual ratio: **168,327Γ—**
21
-
22
- **Verification:** The PAPER uses rounded inputs (0.157 / 9.3Γ—10⁻⁷ = 168,817). The raw ratio is 168,327Γ—. Difference: 0.3%. **Conclusion unchanged.**
23
-
24
- ### Nested Function Improvement
25
-
26
- **PAPER claim:** 245Γ— improvement (Vanilla 0.049 β†’ SinGLU 0.0002)
27
-
28
- **Raw data:**
29
- - Vanilla mean: 0.0486922413110733
30
- - SinGLU mean: 0.00021879897879747054
31
- - Actual ratio: **222.5Γ—**
32
-
33
- **Verification:** PAPER uses rounded inputs (0.049 / 0.0002 = 245). Raw ratio is 222.5Γ—. **Conclusion unchanged (SinGLU wins by 2+ orders of magnitude).**
34
-
35
- ### Complex Function Improvement
36
-
37
- **PAPER claim:** 4.1Γ— improvement (Vanilla 0.058 β†’ SinGLU 0.014)
38
-
39
- **Raw data:**
40
- - Vanilla mean: 0.0574857605000337
41
- - SinGLU mean: 0.014336802531033754
42
- - Actual ratio: **4.01Γ—**
43
-
44
- **Verification:** PAPER uses rounded inputs (0.058 / 0.014 β‰ˆ 4.14). Raw ratio is 4.01Γ—. **Conclusion unchanged.**
45
-
46
- ### Checkerboard Improvement
47
-
48
- **PAPER claim:** +35.9 percentage points (Vanilla 57.9% β†’ SinGLU 93.8%)
49
-
50
- **Raw data:**
51
- - Vanilla mean: 0.5788888931274414 (57.89%)
52
- - SinGLU mean: 0.9377777775128683 (93.78%)
53
- - Difference: 0.3589 = **+35.9 pts** βœ“ **Exact.**
54
-
55
- ### Spiral Result
56
-
57
- **PAPER claim:** Vanilla 85.1% vs SinGLU 44.2%
58
-
59
- **Raw data (v6):**
60
- - Vanilla mean: 0.851111114025116 (85.11%) βœ“
61
- - SinGLU mean: 0.4422222177187602 (44.22%) βœ“
62
-
63
- **Note:** v5 shows Vanilla 90.2% vs SinGLU 44.4% β€” different seed sampling but same qualitative result (Vanilla wins Spiral). PAPER uses v6 values consistently.
64
-
65
- ### High-Frequency Signal
66
-
67
- **PAPER claim:** Vanilla 1.10 vs SinGLU 1.02
68
-
69
- **Raw data (v6):**
70
- - Vanilla mean: 1.099808136622111 βœ“
71
- - SinGLU mean: 1.0171122153600056 βœ“
72
-
73
- ---
74
-
75
- ## 2. OOD Results (Section 3.6 of PAPER)
76
-
77
- Source: `results_v6.json` β€” `ood` section
78
-
79
- | Model | ID MSE (raw) | OOD MSE (raw) | Degradation (raw) | PAPER claims |
80
- |-------|-------------|--------------|-------------------|-------------|
81
- | Vanilla | 0.21685 | 1.53195 | **7.06Γ—** | 0.217 / 1.53 / 7.1Γ— βœ“ |
82
- | SinGLU | 0.24568 | 5.89794 | **24.01Γ—** | 0.246 / 5.90 / 24.0Γ— βœ“ |
83
- | v10 | 0.00390 | 4.96385 | **1273.2Γ—** | 0.004 / 4.96 / 1273Γ— βœ“ |
84
- | v15 | 0.01042 | 4.37580 | **419.8Γ—** | 0.010 / 4.38 / 420Γ— βœ“ |
85
-
86
- **All OOD claims verified.** Minor rounding differences (<1%) in all cases.
87
-
88
- ---
89
-
90
- ## 3. Gradient Norms (Section 4.3 of PAPER)
91
-
92
- Source: `results_v5.json` β€” `gradient_norms` section
93
-
94
- **PAPER table (simplified):**
95
- | Model | Values shown |
96
- |-------|-------------|
97
- | Vanilla | 0.64 β†’ 0.33 β†’ 0.23 β†’ 0.16 |
98
- | SinGLU | 19.5 β†’ 14.9 β†’ 5.1 β†’ 1.3 β†’ 0.4 |
99
- | Shared (S2) | 1159 β†’ 884 β†’ 904 β†’ 714 β†’ 174 |
100
-
101
- **Raw data (5 values per model, epochs 0/200/400/600/1000):**
102
-
103
- | Model | Epoch 0 | Epoch 200 | Epoch 400 | Epoch 600 | Epoch 1000 |
104
- |-------|---------|-----------|-----------|-----------|------------|
105
- | Vanilla | 0.6396 | 0.3283 | 0.2343 | **0.2789** | 0.1642 |
106
- | SinGLU | 19.5319 | 14.9392 | 5.1340 | 1.3400 | 0.4012 |
107
- | Shared | 1159.4 | 884.1 | 904.1 | 714.2 | 174.4 |
108
-
109
- **Correction:** The PAPER omits Vanilla's 4th value (0.28 at epoch 600), which briefly rises above the epoch 400 value (0.23), slightly breaking the "smooth decay" narrative. The omission is minor and doesn't change the conclusion (Vanilla has stable gradients, Shared has catastrophic instability). SinGLU and Shared values are fully reported.
110
-
111
- ---
112
-
113
- ## 4. Adaptive Mechanisms Summary Table
114
-
115
- ### v6 Routing (Ξ±) β€” "Ξ± stuck at 0.5"
116
-
117
- **Raw data (v6 `alpha_analysis`):**
118
-
119
- | Task | Ξ± mean | Ξ± std | PAPER claim |
120
- |------|--------|-------|-------------|
121
- | Complex | 0.4525 | 0.0487 | ~0.5 βœ“ |
122
- | Nested | 0.4709 | 0.0659 | ~0.5 βœ“ |
123
- | Spiral | 0.4789 | 0.0544 | ~0.5 βœ“ |
124
- | Checker | 0.4902 | 0.0459 | ~0.5 βœ“ |
125
- | HiFreq | 0.4693 | 0.0616 | ~0.5 βœ“ |
126
- | Memorize | 0.5054 | 0.0108 | ~0.5 βœ“ |
127
-
128
- **PAPER says:** "stuck at sigmoid(0) = 0.50 Β± 0.02"
129
-
130
- **Correction:** Actual range is 0.45–0.51 with std ~0.05, not Β±0.02. The claim that Ξ± stays near 0.5 is correct, but the variability is ~2Γ— higher than stated. **Qualitative conclusion unchanged (gate never polarizes).**
131
-
132
- ### v7 Learnable Frequency (Ο‰) β€” "Ο‰ froze at initialization"
133
-
134
- **Raw data (v7 `omega_analysis`):**
135
-
136
- | Task | Ο‰ mean | Ο‰ std | Range |
137
- |------|--------|-------|-------|
138
- | Complex | 29.90 | 0.12 | [29.35, 30.32] |
139
- | Nested | 20.02 | 0.08 | [19.81, 20.47] |
140
- | Spiral | 15.02 | 0.23 | [14.04, 15.97] |
141
- | Checker | 20.05 | 0.23 | [18.44, 21.63] |
142
- | HiFreq | 59.99 | 0.04 | [59.85, 60.29] |
143
- | Memorize | 10.04 | 0.21 | [8.96, 11.06] |
144
-
145
- **Verification:** Std is extremely low (<0.25) for all tasks, confirming Ο‰ barely moves. But the initial values differ across tasks (10–60), suggesting different random initializations converged to local values. The PAPER's claim "Ο‰ stayed at initialization" is **approximately correct per-task** but understates cross-task variation.
146
-
147
- ### v10 Phase β€” "Phase is easy to optimize"
148
-
149
- **Raw data (v10 `phi` diagnostics):**
150
-
151
- | Task | Ο† mean | Ο† std | Range |
152
- |------|--------|-------|-------|
153
- | Complex | 0.0083 | 0.192 | [-0.686, 0.760] |
154
- | Nested | -0.013 | 0.142 | [-0.617, 0.594] |
155
- | Spiral | -0.038 | 0.242 | [-1.148, 0.678] |
156
- | Checker | -0.019 | 0.207 | [-0.774, 0.798] |
157
- | HiFreq | 0.054 | 0.321 | [-0.923, 0.917] |
158
- | Memorize | 0.003 | 0.206 | [-0.739, 1.073] |
159
-
160
- **Verification:** Ο† shows broad ranges (std 0.14–0.32, spans Β±1.0), confirming active learning. This supports the PAPER's claim that phase is optimizable where frequency is not.
161
-
162
- ### v11 Disciplined Phase β€” "Phase std ~0.007"
163
-
164
- **Raw data (v11 `phi` diagnostics):**
165
-
166
- | Task | Ο† mean | Ο† std |
167
- |------|--------|-------|
168
- | Complex | 0.00038 | 0.0061 |
169
- | Nested | -0.00047 | 0.0052 |
170
- | Spiral | 0.00027 | 0.0092 |
171
- | Checker | -0.00057 | 0.0072 |
172
- | HiFreq | 0.00206 | 0.0137 |
173
- | Memorize | 0.00016 | 0.0074 |
174
-
175
- **Verification:** Std ranges 0.005–0.014, centered near 0. PAPER says "~0.007" β€” accurate for most tasks. HiFreq is an outlier at 0.014. **Conclusion unchanged (disciplined phase is effectively zero).**
176
-
177
- ---
178
-
179
- ## 5. Width-Richness Tradeoff Table
180
-
181
- **PAPER claims:**
182
- | Architecture | Matrices/Layer | Hidden Dim @ 5K params | % of Vanilla |
183
- |------------|---------------|----------------------|-------------|
184
- | Vanilla | 1 | 64 | 100% |
185
- | SinGLU | 3 | 43 | 67% |
186
- | v9 | 5 | 24 | 38% |
187
- | v15 | 4 | 31 | 48% |
188
-
189
- **Raw data verification:**
190
-
191
- There is **no single task** where Vanilla hidden=64 and SinGLU hidden=43. The actual values across tasks (v4–v6):
192
-
193
- | Task | Vanilla hidden | SinGLU hidden | SinGLU % |
194
- |------|---------------|--------------|----------|
195
- | Complex (4D) | 48 | 31 | 65% |
196
- | Nested (2D) | 37 | 24 | 65% |
197
- | Spiral | 37 | 24 | 65% |
198
- | Checker | 37 | 24 | 65% |
199
- | HiFreq | 62 | 41 | 66% |
200
- | Memorize | 46 | 31 | 67% |
201
-
202
- **Correction:** The PAPER uses illustrative numbers (64, 43) that don't match any specific task. The actual SinGLU width is consistently **65%** of Vanilla (not 67%). For v9, actual hidden dims are 16–27 (v9 Complex: 20, v9 Nested: 16, v9 Spiral: 16). For v15, actual hidden dims are 21–35.
203
-
204
- **Qualitative conclusion unchanged** (every extra matrix reduces width), but the specific numbers in the PAPER table are rounded/illustrative rather than task-specific.
205
-
206
- ---
207
-
208
- ## 6. Killer Experiments β€” Frequency Generalization
209
-
210
- Source: `results_v15.json` β€” `freq_gen` and `mixed_freq` sections
211
-
212
- ### Experiment 1: Train sin(2Ο€x) β†’ Test sin(10Ο€x)
213
-
214
- | Model | Train MSE (raw) | Test MSE (raw) | PAPER claims |
215
- |-------|---------------|----------------|-------------|
216
- | Vanilla | 0.3647 | 1.1720 | 0.365 / 1.172 βœ“ |
217
- | SinGLU | 2.1663 | **0.7361** | 2.166 / **0.736** βœ“ |
218
- | v10 | 0.9693 | 0.9578 | 0.969 / 0.958 βœ“ |
219
- | v15 | 0.7184 | 0.9102 | 0.718 / 0.910 βœ“ |
220
-
221
- **Verification:** All values exact. SinGLU test MSE (0.736) < train MSE (2.166) β€” confirmed.
222
-
223
- ### Experiment 2: Train sin(2Ο€x)+sin(4Ο€x) β†’ Test sin(2Ο€x)+sin(20Ο€x)
224
-
225
- | Model | Train MSE (raw) | Test MSE (raw) | PAPER claims |
226
- |-------|---------------|----------------|-------------|
227
- | Vanilla | 0.8824 | 1.3290 | 0.882 / 1.329 βœ“ |
228
- | SinGLU | 4.6482 | 1.4905 | 4.648 / 1.491 βœ“ |
229
- | v10 | 1.8181 | **1.1781** | 1.818 / **1.178** βœ“ |
230
- | v15 | 2.0757 | 1.3173 | 2.076 / 1.317 βœ“ |
231
-
232
- **Verification:** All values exact.
233
-
234
- ---
235
-
236
- ## 7. Adaptive Mechanism Wins (Section 3.3 Table)
237
-
238
- | Mechanism | Version | PAPER: Wins vs SinGLU | Raw verification |
239
- |-----------|---------|----------------------|----------------|
240
- | Routing (Ξ±) | v6 | 0 | v6 SinGLU beats v6 Adaptive on all 6 tasks βœ“ |
241
- | Learnable Ο‰ | v7 | 0 | v7 SinGLU beats v7 LearnFreq on 5/6 tasks (HiFreq: LearnFreq 1.11 vs SinGLU 1.02 β†’ SinGLU still wins) βœ“ |
242
- | Phase + gate | v8 | 0 | v8 SinGLU beats v8:Phase on all tasks βœ“ |
243
- | Controlled | v9 | 1 (Spiral) | v9 Spiral 0.9967 > SinGLU 0.4422 βœ“. But v9 also loses on Complex, Nested, Checker, HiFreq, Memorize. |
244
- | Free phase | v10 | 2 (Complex, Spiral) | v10 Complex 0.0080 < SinGLU 0.0143 βœ“. v10 Spiral 0.9922 > SinGLU 0.4422 βœ“. But note: v10 Spiral 0.9922 also > Vanilla 0.8511. |
245
- | Tiny phase | v11 | 2 (Complex, Spiral) | v11 Complex 0.0074 < SinGLU 0.0143 βœ“. v11 Spiral 0.9889 > SinGLU 0.4422 βœ“. |
246
- | FM (v12) | v12 | 3 (Complex, Spiral, Checker) | v12 Complex 0.0075 < SinGLU 0.0143 βœ“. v12 Spiral 0.8678 > SinGLU 0.4422 βœ“. v12 Checker 0.9478 > SinGLU 0.9378 βœ“. |
247
- | Aligned (v13) | v13 | 2 (Complex, Checker) | v13 Complex 0.0074 < SinGLU 0.0143 βœ“. v13 Checker 0.9478 > SinGLU 0.9378 βœ“. v13 Spiral 0.4689 < Vanilla 0.8511 (loses). |
248
-
249
- **Note on v10 Spiral:** The PAPER table says v10 won Spiral, which is true (0.9922 vs SinGLU 0.4422). But v10 Spiral also beats Vanilla (0.9922 vs 0.8511). This is v10's biggest win.
250
-
251
- **Note on v12 wins:** v12 won 3 tasks vs SinGLU, but the PAPER description says "Actually frequency modulation, not phase" β€” this is an interpretive claim about mechanism, not a factual error.
252
-
253
- ---
254
-
255
- ## 8. v15 Dual-Phase HiFreq Win
256
-
257
- **PAPER claim:** "v15 MSE on HiFreq: 0.854 vs SinGLU's 1.017 β€” first and only win vs SinGLU"
258
-
259
- **Raw data (v15):**
260
- - v15 HiFreq mean: 0.8537534872690836 βœ“
261
- - SinGLU HiFreq mean: 1.0171122153600056 βœ“
262
-
263
- **Verification:** Exact. v15 is the only architecture in the entire study to beat SinGLU on HiFreq.
264
-
265
- ---
266
-
267
- ## 9. Architecture Equations
268
-
269
- All architecture equations in the PAPER and FINDINGS_SUMMARY were transcribed from the Python source code. Without reading every benchmark file, the equations match the conceptual descriptions used throughout the conversation. A full audit would require reading all 12 benchmark.py files β€” this has not been done but the equations are consistent with the results produced.
270
-
271
- ---
272
-
273
- ## 10. Overall Assessment
274
-
275
- | Aspect | Status | Notes |
276
- |--------|--------|-------|
277
- | Core conclusions (SinGLU wins 5/6, adaptive fails) | βœ… **Supported** | All verified against v6 raw data |
278
- | Quantitative ratios (168KΓ—, 245Γ—, 4Γ—) | ⚠️ **Rounded** | Raw ratios: 168,327Γ—, 222.5Γ—, 4.01Γ—. Conclusions identical. |
279
- | OOD degradation claims | βœ… **Supported** | All ratios verified within 1% |
280
- | Gradient norm trajectories | ⚠️ **Selective** | Vanilla omits 0.28 spike at epoch 600 |
281
- | Alpha stuck at 0.5 | βœ… **Supported** | Actual range 0.45–0.51, std ~0.05 |
282
- | Omega frozen | βœ… **Supported** | Std <0.25 across all tasks |
283
- | Width table | ⚠️ **Illustrative** | Numbers don't match specific tasks; use 65% not 67% |
284
- | Killer experiment values | βœ… **Exact** | All 8 values verified against v15.json |
285
- | v15 HiFreq win | βœ… **Exact** | 0.8538 vs 1.0171, only win vs SinGLU in study |
286
- | Task win counts per version | βœ… **Supported** | All verified against respective JSON files |
287
-
288
- **Verdict:** The paper's qualitative conclusions are fully supported. Quantitative claims use rounded inputs, producing slightly different ratios than raw data (~0.3–10% deviation). The width table uses illustrative rather than task-specific numbers. No findings are contradicted by the raw data.