anshdadhich
/

richneuron-vs-vanilla-benchmark

Model card Files Files and versions

xet

Community

anshdadhich commited on 17 days ago

Commit

9da8374

verified ·

1 Parent(s): 887b660

Delete CORRECTIONS.md

Browse files

Files changed (1) hide show

CORRECTIONS.md +0 -288

CORRECTIONS.md DELETED Viewed

@@ -1,288 +0,0 @@
-# Data Verification & Corrections
-This document verifies every quantitative claim in PAPER.md and FINDINGS_SUMMARY.md against the raw JSON results, noting exact values, rounding, and source versions.
-**Status: All core conclusions are supported by data. Minor rounding discrepancies noted below.**
----
-## 1. Core Results Table (Section 3.1 of PAPER)
-Source: `results_v6.json` (3 seeds, full benchmark with Vanilla/SinGLU/Hybrid/Adaptive)
-### Memorization Improvement
-**PAPER claim:** 168,817× improvement (Vanilla 0.157 → SinGLU 9.3×10⁻⁷)
-**Raw data:**
-- Vanilla mean: 0.15677066644032797
-- SinGLU mean: 9.313488410119438×10⁻⁷
-- Actual ratio: **168,327×**
-**Verification:** The PAPER uses rounded inputs (0.157 / 9.3×10⁻⁷ = 168,817). The raw ratio is 168,327×. Difference: 0.3%. **Conclusion unchanged.**
-### Nested Function Improvement
-**PAPER claim:** 245× improvement (Vanilla 0.049 → SinGLU 0.0002)
-**Raw data:**
-- Vanilla mean: 0.0486922413110733
-- SinGLU mean: 0.00021879897879747054
-- Actual ratio: **222.5×**
-**Verification:** PAPER uses rounded inputs (0.049 / 0.0002 = 245). Raw ratio is 222.5×. **Conclusion unchanged (SinGLU wins by 2+ orders of magnitude).**
-### Complex Function Improvement
-**PAPER claim:** 4.1× improvement (Vanilla 0.058 → SinGLU 0.014)
-**Raw data:**
-- Vanilla mean: 0.0574857605000337
-- SinGLU mean: 0.014336802531033754
-- Actual ratio: **4.01×**
-**Verification:** PAPER uses rounded inputs (0.058 / 0.014 ≈ 4.14). Raw ratio is 4.01×. **Conclusion unchanged.**
-### Checkerboard Improvement
-**PAPER claim:** +35.9 percentage points (Vanilla 57.9% → SinGLU 93.8%)
-**Raw data:**
-- Vanilla mean: 0.5788888931274414 (57.89%)
-- SinGLU mean: 0.9377777775128683 (93.78%)
-- Difference: 0.3589 = **+35.9 pts** ✓ **Exact.**
-### Spiral Result
-**PAPER claim:** Vanilla 85.1% vs SinGLU 44.2%
-**Raw data (v6):**
-- Vanilla mean: 0.851111114025116 (85.11%) ✓
-- SinGLU mean: 0.4422222177187602 (44.22%) ✓
-**Note:** v5 shows Vanilla 90.2% vs SinGLU 44.4% — different seed sampling but same qualitative result (Vanilla wins Spiral). PAPER uses v6 values consistently.
-### High-Frequency Signal
-**PAPER claim:** Vanilla 1.10 vs SinGLU 1.02
-**Raw data (v6):**
-- Vanilla mean: 1.099808136622111 ✓
-- SinGLU mean: 1.0171122153600056 ✓
----
-## 2. OOD Results (Section 3.6 of PAPER)
-Source: `results_v6.json` — `ood` section
-| Model | ID MSE (raw) | OOD MSE (raw) | Degradation (raw) | PAPER claims |
-|-------|-------------|--------------|-------------------|-------------|
-| Vanilla | 0.21685 | 1.53195 | **7.06×** | 0.217 / 1.53 / 7.1× ✓ |
-| SinGLU | 0.24568 | 5.89794 | **24.01×** | 0.246 / 5.90 / 24.0× ✓ |
-| v10 | 0.00390 | 4.96385 | **1273.2×** | 0.004 / 4.96 / 1273× ✓ |
-| v15 | 0.01042 | 4.37580 | **419.8×** | 0.010 / 4.38 / 420× ✓ |
-**All OOD claims verified.** Minor rounding differences (<1%) in all cases.
----
-## 3. Gradient Norms (Section 4.3 of PAPER)
-Source: `results_v5.json` — `gradient_norms` section
-**PAPER table (simplified):**
-| Model | Values shown |
-|-------|-------------|
-| Vanilla | 0.64 → 0.33 → 0.23 → 0.16 |
-| SinGLU | 19.5 → 14.9 → 5.1 → 1.3 → 0.4 |
-| Shared (S2) | 1159 → 884 → 904 → 714 → 174 |
-**Raw data (5 values per model, epochs 0/200/400/600/1000):**
-| Model | Epoch 0 | Epoch 200 | Epoch 400 | Epoch 600 | Epoch 1000 |
-|-------|---------|-----------|-----------|-----------|------------|
-| Vanilla | 0.6396 | 0.3283 | 0.2343 | **0.2789** | 0.1642 |
-| SinGLU | 19.5319 | 14.9392 | 5.1340 | 1.3400 | 0.4012 |
-| Shared | 1159.4 | 884.1 | 904.1 | 714.2 | 174.4 |
-**Correction:** The PAPER omits Vanilla's 4th value (0.28 at epoch 600), which briefly rises above the epoch 400 value (0.23), slightly breaking the "smooth decay" narrative. The omission is minor and doesn't change the conclusion (Vanilla has stable gradients, Shared has catastrophic instability). SinGLU and Shared values are fully reported.
----
-## 4. Adaptive Mechanisms Summary Table
-### v6 Routing (α) — "α stuck at 0.5"
-**Raw data (v6 `alpha_analysis`):**
-| Task | α mean | α std | PAPER claim |
-|------|--------|-------|-------------|
-| Complex | 0.4525 | 0.0487 | ~0.5 ✓ |
-| Nested | 0.4709 | 0.0659 | ~0.5 ✓ |
-| Spiral | 0.4789 | 0.0544 | ~0.5 ✓ |
-| Checker | 0.4902 | 0.0459 | ~0.5 ✓ |
-| HiFreq | 0.4693 | 0.0616 | ~0.5 ✓ |
-| Memorize | 0.5054 | 0.0108 | ~0.5 ✓ |
-**PAPER says:** "stuck at sigmoid(0) = 0.50 ± 0.02"
-**Correction:** Actual range is 0.45–0.51 with std ~0.05, not ±0.02. The claim that α stays near 0.5 is correct, but the variability is ~2× higher than stated. **Qualitative conclusion unchanged (gate never polarizes).**
-### v7 Learnable Frequency (ω) — "ω froze at initialization"
-**Raw data (v7 `omega_analysis`):**
-| Task | ω mean | ω std | Range |
-|------|--------|-------|-------|
-| Complex | 29.90 | 0.12 | [29.35, 30.32] |
-| Nested | 20.02 | 0.08 | [19.81, 20.47] |
-| Spiral | 15.02 | 0.23 | [14.04, 15.97] |
-| Checker | 20.05 | 0.23 | [18.44, 21.63] |
-| HiFreq | 59.99 | 0.04 | [59.85, 60.29] |
-| Memorize | 10.04 | 0.21 | [8.96, 11.06] |
-**Verification:** Std is extremely low (<0.25) for all tasks, confirming ω barely moves. But the initial values differ across tasks (10–60), suggesting different random initializations converged to local values. The PAPER's claim "ω stayed at initialization" is **approximately correct per-task** but understates cross-task variation.
-### v10 Phase — "Phase is easy to optimize"
-**Raw data (v10 `phi` diagnostics):**
-| Task | φ mean | φ std | Range |
-|------|--------|-------|-------|
-| Complex | 0.0083 | 0.192 | [-0.686, 0.760] |
-| Nested | -0.013 | 0.142 | [-0.617, 0.594] |
-| Spiral | -0.038 | 0.242 | [-1.148, 0.678] |
-| Checker | -0.019 | 0.207 | [-0.774, 0.798] |
-| HiFreq | 0.054 | 0.321 | [-0.923, 0.917] |
-| Memorize | 0.003 | 0.206 | [-0.739, 1.073] |
-**Verification:** φ shows broad ranges (std 0.14–0.32, spans ±1.0), confirming active learning. This supports the PAPER's claim that phase is optimizable where frequency is not.
-### v11 Disciplined Phase — "Phase std ~0.007"
-**Raw data (v11 `phi` diagnostics):**
-| Task | φ mean | φ std |
-|------|--------|-------|
-| Complex | 0.00038 | 0.0061 |
-| Nested | -0.00047 | 0.0052 |
-| Spiral | 0.00027 | 0.0092 |
-| Checker | -0.00057 | 0.0072 |
-| HiFreq | 0.00206 | 0.0137 |
-| Memorize | 0.00016 | 0.0074 |
-**Verification:** Std ranges 0.005–0.014, centered near 0. PAPER says "~0.007" — accurate for most tasks. HiFreq is an outlier at 0.014. **Conclusion unchanged (disciplined phase is effectively zero).**
----
-## 5. Width-Richness Tradeoff Table
-**PAPER claims:**
-| Architecture | Matrices/Layer | Hidden Dim @ 5K params | % of Vanilla |
-|------------|---------------|----------------------|-------------|
-| Vanilla | 1 | 64 | 100% |
-| SinGLU | 3 | 43 | 67% |
-| v9 | 5 | 24 | 38% |
-| v15 | 4 | 31 | 48% |
-**Raw data verification:**
-There is **no single task** where Vanilla hidden=64 and SinGLU hidden=43. The actual values across tasks (v4–v6):
-| Task | Vanilla hidden | SinGLU hidden | SinGLU % |
-|------|---------------|--------------|----------|
-| Complex (4D) | 48 | 31 | 65% |
-| Nested (2D) | 37 | 24 | 65% |
-| Spiral | 37 | 24 | 65% |
-| Checker | 37 | 24 | 65% |
-| HiFreq | 62 | 41 | 66% |
-| Memorize | 46 | 31 | 67% |
-**Correction:** The PAPER uses illustrative numbers (64, 43) that don't match any specific task. The actual SinGLU width is consistently **65%** of Vanilla (not 67%). For v9, actual hidden dims are 16–27 (v9 Complex: 20, v9 Nested: 16, v9 Spiral: 16). For v15, actual hidden dims are 21–35.
-**Qualitative conclusion unchanged** (every extra matrix reduces width), but the specific numbers in the PAPER table are rounded/illustrative rather than task-specific.
----
-## 6. Killer Experiments — Frequency Generalization
-Source: `results_v15.json` — `freq_gen` and `mixed_freq` sections
-### Experiment 1: Train sin(2πx) → Test sin(10πx)
-| Model | Train MSE (raw) | Test MSE (raw) | PAPER claims |
-|-------|---------------|----------------|-------------|
-| Vanilla | 0.3647 | 1.1720 | 0.365 / 1.172 ✓ |
-| SinGLU | 2.1663 | **0.7361** | 2.166 / **0.736** ✓ |
-| v10 | 0.9693 | 0.9578 | 0.969 / 0.958 ✓ |
-| v15 | 0.7184 | 0.9102 | 0.718 / 0.910 ✓ |
-**Verification:** All values exact. SinGLU test MSE (0.736) < train MSE (2.166) — confirmed.
-### Experiment 2: Train sin(2πx)+sin(4πx) → Test sin(2πx)+sin(20πx)
-| Model | Train MSE (raw) | Test MSE (raw) | PAPER claims |
-|-------|---------------|----------------|-------------|
-| Vanilla | 0.8824 | 1.3290 | 0.882 / 1.329 ✓ |
-| SinGLU | 4.6482 | 1.4905 | 4.648 / 1.491 ✓ |
-| v10 | 1.8181 | **1.1781** | 1.818 / **1.178** ✓ |
-| v15 | 2.0757 | 1.3173 | 2.076 / 1.317 ✓ |
-**Verification:** All values exact.
----
-## 7. Adaptive Mechanism Wins (Section 3.3 Table)
-| Mechanism | Version | PAPER: Wins vs SinGLU | Raw verification |
-|-----------|---------|----------------------|----------------|
-| Routing (α) | v6 | 0 | v6 SinGLU beats v6 Adaptive on all 6 tasks ✓ |
-| Learnable ω | v7 | 0 | v7 SinGLU beats v7 LearnFreq on 5/6 tasks (HiFreq: LearnFreq 1.11 vs SinGLU 1.02 → SinGLU still wins) ✓ |
-| Phase + gate | v8 | 0 | v8 SinGLU beats v8:Phase on all tasks ✓ |
-| Controlled | v9 | 1 (Spiral) | v9 Spiral 0.9967 > SinGLU 0.4422 ✓. But v9 also loses on Complex, Nested, Checker, HiFreq, Memorize. |
-| Free phase | v10 | 2 (Complex, Spiral) | v10 Complex 0.0080 < SinGLU 0.0143 ✓. v10 Spiral 0.9922 > SinGLU 0.4422 ✓. But note: v10 Spiral 0.9922 also > Vanilla 0.8511. |
-| Tiny phase | v11 | 2 (Complex, Spiral) | v11 Complex 0.0074 < SinGLU 0.0143 ✓. v11 Spiral 0.9889 > SinGLU 0.4422 ✓. |
-| FM (v12) | v12 | 3 (Complex, Spiral, Checker) | v12 Complex 0.0075 < SinGLU 0.0143 ✓. v12 Spiral 0.8678 > SinGLU 0.4422 ✓. v12 Checker 0.9478 > SinGLU 0.9378 ✓. |
-| Aligned (v13) | v13 | 2 (Complex, Checker) | v13 Complex 0.0074 < SinGLU 0.0143 ✓. v13 Checker 0.9478 > SinGLU 0.9378 ✓. v13 Spiral 0.4689 < Vanilla 0.8511 (loses). |
-**Note on v10 Spiral:** The PAPER table says v10 won Spiral, which is true (0.9922 vs SinGLU 0.4422). But v10 Spiral also beats Vanilla (0.9922 vs 0.8511). This is v10's biggest win.
-**Note on v12 wins:** v12 won 3 tasks vs SinGLU, but the PAPER description says "Actually frequency modulation, not phase" — this is an interpretive claim about mechanism, not a factual error.
----
-## 8. v15 Dual-Phase HiFreq Win
-**PAPER claim:** "v15 MSE on HiFreq: 0.854 vs SinGLU's 1.017 — first and only win vs SinGLU"
-**Raw data (v15):**
-- v15 HiFreq mean: 0.8537534872690836 ✓
-- SinGLU HiFreq mean: 1.0171122153600056 ✓
-**Verification:** Exact. v15 is the only architecture in the entire study to beat SinGLU on HiFreq.
----
-## 9. Architecture Equations
-All architecture equations in the PAPER and FINDINGS_SUMMARY were transcribed from the Python source code. Without reading every benchmark file, the equations match the conceptual descriptions used throughout the conversation. A full audit would require reading all 12 benchmark.py files — this has not been done but the equations are consistent with the results produced.
----
-## 10. Overall Assessment
-| Aspect | Status | Notes |
-|--------|--------|-------|
-| Core conclusions (SinGLU wins 5/6, adaptive fails) | ✅ **Supported** | All verified against v6 raw data |
-| Quantitative ratios (168K×, 245×, 4×) | ⚠️ **Rounded** | Raw ratios: 168,327×, 222.5×, 4.01×. Conclusions identical. |
-| OOD degradation claims | ✅ **Supported** | All ratios verified within 1% |
-| Gradient norm trajectories | ⚠️ **Selective** | Vanilla omits 0.28 spike at epoch 600 |
-| Alpha stuck at 0.5 | ✅ **Supported** | Actual range 0.45–0.51, std ~0.05 |
-| Omega frozen | ✅ **Supported** | Std <0.25 across all tasks |
-| Width table | ⚠️ **Illustrative** | Numbers don't match specific tasks; use 65% not 67% |
-| Killer experiment values | ✅ **Exact** | All 8 values verified against v15.json |
-| v15 HiFreq win | ✅ **Exact** | 0.8538 vs 1.0171, only win vs SinGLU in study |
-| Task win counts per version | ✅ **Supported** | All verified against respective JSON files |
-**Verdict:** The paper's qualitative conclusions are fully supported. Quantitative claims use rounded inputs, producing slightly different ratios than raw data (~0.3–10% deviation). The width table uses illustrative rather than task-specific numbers. No findings are contradicted by the raw data.