anshdadhich
/

richneuron-vs-vanilla-benchmark

Model card Files Files and versions

xet

Community

anshdadhich commited on 14 days ago

Commit

0608ec0

verified ·

1 Parent(s): 9da8374

Add 'Our Real Contribution' section to FINDINGS_SUMMARY.md

Browse files

Files changed (1) hide show

FINDINGS_SUMMARY.md +36 -3

FINDINGS_SUMMARY.md CHANGED Viewed

@@ -93,7 +93,7 @@ Our hypothesis: `y = sin(W₁x) ⊙ (W₂x)` — each parameter participates in
 ## The Six Biggest Findings
-### Finding 1: SinGLU Stores 168,327× More Information per Parameter
 **Memorization Task:** 200 random 8D→4D mappings, ~5K parameters.
@@ -195,6 +195,40 @@ All periodic architectures extrapolate their learned oscillations outside the tr
 ---
 ## Architecture Deep Dives
 ### SinGLU (The Winner)
@@ -314,14 +348,13 @@ out = Wo @ (core * v)
 | `results_*.json` | Raw per-seed scores for every version |
 | `PAPER.md` | Full technical report |
 | `FINDINGS_SUMMARY.md` | This file |
-| `CORRECTIONS.md` | Data verification and audit trail |
 ---
 ## What We Proved vs. What We Didn't
 ### Proven:
-1. ✓ Richer neurons store more information per parameter (168,327×)
 2. ✓ SinGLU is optimal at 3K-8K parameter scale
 3. ✓ Adaptive mechanisms fail at small scale due to gradient competition
 4. ✓ Fixed-frequency generalizes to unseen frequencies better than adaptive

 ## The Six Biggest Findings
+### Finding 1: SinGLU Stores 168,326× More Information per Parameter
 **Memorization Task:** 200 random 8D→4D mappings, ~5K parameters.
 ---
+## Our Real Contribution: Why the Regime Map Matters More Than 168,326×
+### The 168,326× Is a Symptom. The Regime Map Is the Diagnosis.
+The headline number is striking, but it's a **toy-task outlier** — memorization is the easiest function to overfit, and the ratio mostly reflects Vanilla failing at 5K params rather than SinGLU being revolutionary.
+Our real finding is a **conditional, predictive, falsifiable map**:
+| Task Structure | Winner | Why | Why Others Lose |
+|---------------|--------|-----|-----------------|
+| **Compositional / multiplicative** | **SinGLU** | `sin(ω·W₁x) ⊙ W₂x` creates cross-terms `xᵢ·xⱼ` matching the function's multiplicative structure | Vanilla ReLU needs extra layers for products; v10's phase adds noise to an already-good fit |
+| **Geometric / rotational** (spirals) | **v10** | `sin(ω·g + φ(x))` rotates decision boundaries via phase shifts | SinGLU's fixed frequency creates radial oscillations misaligned with angular spirals |
+| **Multi-scale frequencies** | **v15** | Dual `ω` + `2ω` channels match signal frequency composition | SinGLU's single frequency can't represent 20Hz and 100Hz simultaneously |
+| **Distribution shift** (OOD) | **Vanilla** | Piecewise linear, no hallucinated oscillations | All periodic models extrapolate waves outside domain — 24-1273× worse |
+| **Unseen frequencies** | **SinGLU** | Fixed frequency basis transfers; learns coefficients, not frequency | Adaptive phase (v10) overfits training frequency; flexible = fragile |
+### Why This Is the Real Finding
+1. **It answers the original question precisely.** "Can we replace `y = Wx + b` with something richer?" → **"It depends on what information."**
+2. **It's predictive.** Given a new task, we can pick the architecture before running it — testable, falsifiable.
+3. **It explains the 168K× rather than being explained by it.** Memorization is a multiplicative task; SinGLU's cross-terms match that structure.
+4. **It has practical value.** Medical diagnosis (OOD) → use ReLU. Audio reconstruction (multi-scale) → try v15. NLP embeddings (compositional) → try SinGLU.
+### The Limitation We Also Found
+All adaptive mechanisms (v6-v13) failed at small scale due to **gradient competition**. No architecture learned to *switch* between regimes dynamically. The routing gates (v6) couldn't do it at 3K-8K params. Maybe at 100K+ they can — that's an open question.
+> **In one sentence:** Neuron design is not one-size-fits-all — it's a conditional choice based on task structure, and we mapped those conditions empirically.
+---
 ## Architecture Deep Dives
 ### SinGLU (The Winner)
 | `results_*.json` | Raw per-seed scores for every version |
 | `PAPER.md` | Full technical report |
 | `FINDINGS_SUMMARY.md` | This file |
 ---
 ## What We Proved vs. What We Didn't
 ### Proven:
+1. ✓ Richer neurons store more information per parameter (168,326×)
 2. ✓ SinGLU is optimal at 3K-8K parameter scale
 3. ✓ Adaptive mechanisms fail at small scale due to gradient competition
 4. ✓ Fixed-frequency generalizes to unseen frequencies better than adaptive