Add full technical report: PAPER.md
Browse files
PAPER.md
ADDED
|
@@ -0,0 +1,287 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Beyond Linear Neurons: An Empirical Study of Multiplicative Periodic Architectures at Small Scale
|
| 2 |
+
|
| 3 |
+
**Authors:** anshdadhich, with adversarial review from two LLM collaborators
|
| 4 |
+
|
| 5 |
+
**Repository:** [huggingface.co/anshdadhich/richneuron-vs-vanilla-benchmark](https://huggingface.co/anshdadhich/richneuron-vs-vanilla-benchmark)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Abstract
|
| 10 |
+
|
| 11 |
+
We investigate whether replacing the standard neuron computation `y = ReLU(Wx + b)` with richer per-neuron functions can increase information storage and accuracy at fixed parameter budgets. Through 15 architecture iterations tested across 9 tasks (regression, classification, memorization, frequency generalization, and out-of-distribution), we find that **multiplicative periodic neurons** (`sin(ω·W₁x) ⊙ W₂x`, i.e., SinGLU) consistently outperform vanilla MLPs by 4-1,600,000× on structured tasks. We then systematically test 10 adaptive mechanisms (routing, learnable frequency, free phase, scaled phase, aligned phase, frequency modulation, dual-phase decomposition) attempting to improve upon SinGLU. All fail to consistently beat it at small scale (3K-8K parameters). We identify the root cause: every parameter spent on meta-computation (deciding *how* to compute) is stolen from actual computation, and the meta-learning signal is too weak at small scale to justify the cost. Our killer experiments reveal that fixed-frequency architectures generalize better to unseen frequencies than adaptive ones — directly contradicting the intuition that more expressive neurons generalize better.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## 1. Introduction
|
| 16 |
+
|
| 17 |
+
### 1.1 The Question
|
| 18 |
+
|
| 19 |
+
A standard neural network neuron computes `y = σ(Wx + b)` — a linear transformation followed by a fixed nonlinearity. Each weight parameter participates in exactly one multiply-add operation. We ask: **can replacing this with a richer computation store more information per parameter and achieve better accuracy without increasing total parameter count?**
|
| 20 |
+
|
| 21 |
+
This question is motivated by recent theoretical results showing that standard transformers store approximately 2 bits of knowledge per parameter ([Allen-Zhu & Li, 2024](https://arxiv.org/abs/2404.05405)), and by architectures like KAN ([Liu et al., 2024](https://arxiv.org/abs/2404.19756)), SIREN ([Sitzmann et al., 2020](https://arxiv.org/abs/2006.09661)), and MONet ([Chrysos et al., 2024](https://arxiv.org/abs/2401.17992)) that propose richer neuron computations.
|
| 22 |
+
|
| 23 |
+
### 1.2 Approach
|
| 24 |
+
|
| 25 |
+
We conduct a systematic architecture search, starting from simple modifications and iterating based on empirical results and adversarial critique. Each version tests a specific hypothesis:
|
| 26 |
+
|
| 27 |
+
| Version | Hypothesis | Architecture |
|
| 28 |
+
|---------|-----------|-------------|
|
| 29 |
+
| v1 | Multiplicative + periodic > linear | `(W₁x) ⊙ sin(ω·W₂x) + W₁x` |
|
| 30 |
+
| v4 | Width penalty can be eliminated | Low-rank, shared-weight, GLU-style variants |
|
| 31 |
+
| v5 | Honest multi-seed re-evaluation | 3 seeds, gradient norms, OOD |
|
| 32 |
+
| v6 | Adaptive routing (α) can select computation type | `α·periodic + (1-α)·linear` |
|
| 33 |
+
| v7 | Learnable frequency adapts per input | `sin(ω(x)·Wx)` |
|
| 34 |
+
| v8 | Phase + gate replaces frequency | `sin(ω·Wx + φ(x))` with sigmoid gate |
|
| 35 |
+
| v9 | Controlled frequency + phase + gate | Bounded ω(x) + φ(x) + α(x) |
|
| 36 |
+
| v10 | Minimal: SinGLU + free phase only | `sin(ω·Wg·x + π·tanh(Wφ·x))` |
|
| 37 |
+
| v11 | Disciplined phase (scaled down) | `sin(ω·(Wg·x + 0.1·tanh(Wφ·x)))` |
|
| 38 |
+
| v12 | Signal-proportional phase | `sin(ω·g·(1 + 0.2·tanh(Wφ·x)))` |
|
| 39 |
+
| v13 | Signal-aligned phase | `sin(ω·g + 0.1·g·tanh(Wφ·x))` |
|
| 40 |
+
| v15 | Dual-frequency decomposition | `sin(ωg+βφ) ⊙ (1 + α·sin(2ωg+γφ))` |
|
| 41 |
+
|
| 42 |
+
All comparisons use **strictly matched parameter budgets** via binary search over hidden dimensions.
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 2. Experimental Setup
|
| 47 |
+
|
| 48 |
+
### 2.1 Tasks
|
| 49 |
+
|
| 50 |
+
We use 9 tasks spanning different computational demands:
|
| 51 |
+
|
| 52 |
+
**Regression (lower MSE = better):**
|
| 53 |
+
- **Complex Fn (4D):** `f(x) = exp(sin(x₁²+x₂²) + sin(x₃²+x₄²))` — compositional structure ([from KAN paper](https://arxiv.org/abs/2404.19756))
|
| 54 |
+
- **Nested Fn (2D):** `f(x) = sin(π(x₁²+x₂²))·cos(3π·x₁x₂)` — multiplicative + periodic
|
| 55 |
+
- **High-Frequency Signal:** `f(x) = sin(20x) + sin(50x) + 0.5·sin(100x)` — pure frequency representation
|
| 56 |
+
- **Knowledge Memorization:** 200 random 8D→4D mappings — raw storage capacity test
|
| 57 |
+
|
| 58 |
+
**Classification (higher accuracy = better):**
|
| 59 |
+
- **Two-Spiral:** Interleaving spirals — nonlinear decision boundaries
|
| 60 |
+
- **Checkerboard (freq=3):** Feature interaction pattern
|
| 61 |
+
|
| 62 |
+
**Generalization:**
|
| 63 |
+
- **OOD:** Train on `[-1,1]`, test on `[1,2]` for `f(x₁,x₂) = sin(3πx₁)·cos(3πx₂) + x₁x₂`
|
| 64 |
+
- **Frequency Generalization:** Train on `sin(2πx)`, test on `sin(10πx)` — can the model represent unseen frequencies?
|
| 65 |
+
- **Mixed Frequency:** Train on `sin(2πx)+sin(4πx)`, test on `sin(2πx)+sin(20πx)` — can it decompose and generalize frequency components?
|
| 66 |
+
|
| 67 |
+
### 2.2 Protocol
|
| 68 |
+
|
| 69 |
+
- **Parameter matching:** Binary search finds hidden dimension giving closest match to target budget per architecture
|
| 70 |
+
- **3 random seeds** per experiment, reporting mean±std
|
| 71 |
+
- **Optimizer:** Adam with cosine annealing LR schedule
|
| 72 |
+
- **Gradient clipping:** max norm 1.0
|
| 73 |
+
- **Parameter budgets:** 3K-8K depending on task input dimensionality
|
| 74 |
+
|
| 75 |
+
### 2.3 Baselines
|
| 76 |
+
|
| 77 |
+
- **Vanilla MLP:** `Linear → ReLU → Linear → ReLU → Linear → ReLU → Linear`
|
| 78 |
+
- **SinGLU:** `sin(ω·Wg·x) ⊙ Wv·x` projected through `Wo` with LayerNorm. Same structure as SwiGLU ([Shazeer, 2020](https://arxiv.org/abs/2002.05202)) but with `sin()` instead of `Swish()`. Uses 2/3 hidden dim trick to match param count.
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
## 3. Results
|
| 83 |
+
|
| 84 |
+
### 3.1 Core Finding: Multiplicative Periodic Neurons Beat Vanilla (v1-v5)
|
| 85 |
+
|
| 86 |
+
On matched parameter budgets, SinGLU consistently outperforms Vanilla MLP:
|
| 87 |
+
|
| 88 |
+
| Task | Vanilla | SinGLU | Improvement |
|
| 89 |
+
|------|---------|--------|-------------|
|
| 90 |
+
| Nested Fn | 0.049 | **0.0002** | **245×** |
|
| 91 |
+
| Memorization | 0.157 | **9.3e-7** | **168,817×** |
|
| 92 |
+
| Complex Fn | 0.058 | **0.014** | **4.1×** |
|
| 93 |
+
| Checkerboard | 57.9% | **93.8%** | +35.9 pts |
|
| 94 |
+
| High-Freq | 1.10 | **1.02** | 1.08× |
|
| 95 |
+
| Spiral | **85.1%** | 44.2% | Vanilla wins |
|
| 96 |
+
|
| 97 |
+
**SinGLU wins 5/6 standard tasks.** The gains on memorization (168K×) and nested function (245×) are not incremental — they demonstrate a fundamentally different information encoding capacity. The sole loss (Spiral) is due to SinGLU's fixed-frequency basis being unable to form the specific nonlinear decision boundary spirals require.
|
| 98 |
+
|
| 99 |
+
### 3.2 The Width-Richness Tradeoff (v4)
|
| 100 |
+
|
| 101 |
+
SinGLU uses 3 matrices per layer (Wg, Wv, Wo) vs Vanilla's 1 (W). At matched param budgets, SinGLU gets ~2/3 the hidden width. This is acceptable because SinGLU's per-neuron computation is richer, but it creates a fundamental tension: **every additional matrix for adaptive mechanisms further reduces width.**
|
| 102 |
+
|
| 103 |
+
### 3.3 Adaptive Mechanisms: Systematic Failure (v6-v13)
|
| 104 |
+
|
| 105 |
+
We tested 8 different adaptive mechanisms. Results:
|
| 106 |
+
|
| 107 |
+
| Mechanism | Version | Wins vs SinGLU | Root Cause of Failure |
|
| 108 |
+
|-----------|---------|----------------|----------------------|
|
| 109 |
+
| Sigmoid routing (α) | v6 | 0 | α stuck at 0.5 — gradient competition |
|
| 110 |
+
| Learnable frequency ω(x) | v7 | 0 | ω froze at initialization — oscillatory gradient |
|
| 111 |
+
| Phase + gate | v8 | 0 | Gate weak, phase underused |
|
| 112 |
+
| Controlled freq + phase + gate | v9 | 1 (Spiral) | 5 matrices → hidden dim 20 vs SinGLU's 31 |
|
| 113 |
+
| Free phase | v10 | 2 (Complex, Spiral) | Destroyed HiFreq and OOD |
|
| 114 |
+
| Tiny phase (0.1 scale) | v11 | 2 (Complex, Spiral) | Phase std ~0.007 — effectively zero |
|
| 115 |
+
| Signal-proportional (FM) | v12 | 3 (Complex, Spiral, Checker) | Actually frequency modulation, not phase |
|
| 116 |
+
| Signal-aligned | v13 | 2 (Complex, Checker) | Killed Spiral — phase must be orthogonal to signal for geometry |
|
| 117 |
+
|
| 118 |
+
**Common pattern across all adaptive versions:**
|
| 119 |
+
|
| 120 |
+
1. **Meta-learning signal is too weak.** The gradient signal for "how should I compute" is second-order — it depends on how well the branches are already performing. The branches learn useful features via direct gradients, while the routing/gating/frequency mechanism receives indirect signal that's too small to overcome initialization.
|
| 121 |
+
|
| 122 |
+
2. **Parameter overhead kills width.** Each adaptive matrix reduces hidden dimension by ~4 units at 3K-5K budget. This is a 10-20% capacity loss that the adaptive mechanism never recovers.
|
| 123 |
+
|
| 124 |
+
3. **Gradient analysis confirms instability.** From v5's gradient norm tracking:
|
| 125 |
+
|
| 126 |
+
| Model | Gradient norm trajectory |
|
| 127 |
+
|-------|------------------------|
|
| 128 |
+
| Vanilla | 0.64 → 0.33 → 0.23 → 0.16 (smooth decay) |
|
| 129 |
+
| Shared (S2) | **1159 → 884 → 904 → 714 → 174** (1000× too large) |
|
| 130 |
+
| SinGLU | 19.5 → 14.9 → 5.1 → 1.3 → 0.4 (steep but converging) |
|
| 131 |
+
|
| 132 |
+
### 3.4 Dual-Phase Decomposition: Wins High-Freq (v15)
|
| 133 |
+
|
| 134 |
+
v15 introduces explicit multi-scale structure:
|
| 135 |
+
|
| 136 |
+
```
|
| 137 |
+
low = sin(ω·g + β·φ) # structure channel
|
| 138 |
+
high = sin(2ω·g + γ·φ) # detail channel
|
| 139 |
+
core = low ⊙ (1 + 0.3·high) # AM modulation
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
**This is the first and only architecture to beat SinGLU on High-Frequency Signal** — MSE 0.854 vs 1.017. The dual-frequency basis provides the neuron simultaneous access to ω and 2ω, which is exactly what a sum-of-sinusoids signal needs. However, v15 loses on most other tasks because the AM modulation adds nonlinear coupling that hurts simpler problems.
|
| 143 |
+
|
| 144 |
+
### 3.5 Killer Experiments: Frequency Generalization
|
| 145 |
+
|
| 146 |
+
The most revealing experiments test whether models can generalize to unseen frequencies:
|
| 147 |
+
|
| 148 |
+
**Experiment 1: Train sin(2πx) → Test sin(10πx)**
|
| 149 |
+
|
| 150 |
+
| Model | Train MSE | Test MSE (unseen 10πx) | Gap |
|
| 151 |
+
|-------|-----------|----------------------|-----|
|
| 152 |
+
| Vanilla | 0.365 | 1.172 | 3.2× |
|
| 153 |
+
| **SinGLU** | 2.166 | **0.736** | **0.3×** ← better on test than train |
|
| 154 |
+
| v10 | 0.969 | 0.958 | 1.0× |
|
| 155 |
+
| v15 | 0.718 | 0.910 | 1.3× |
|
| 156 |
+
|
| 157 |
+
**SinGLU shows the best frequency generalization despite worst training fit.** Its fixed-frequency basis acts as an inductive bias that transfers across frequency scales. The adaptive variants (v10, v15) overfit the training frequency and generalize worse.
|
| 158 |
+
|
| 159 |
+
**Experiment 2: Train sin(2πx)+sin(4πx) → Test sin(2πx)+sin(20πx)**
|
| 160 |
+
|
| 161 |
+
| Model | Train MSE | Test MSE (unseen 20πx mix) |
|
| 162 |
+
|-------|-----------|---------------------------|
|
| 163 |
+
| Vanilla | 0.882 | 1.329 |
|
| 164 |
+
| SinGLU | 4.648 | 1.491 |
|
| 165 |
+
| **v10** | 1.818 | **1.178** |
|
| 166 |
+
| v15 | 2.076 | 1.317 |
|
| 167 |
+
|
| 168 |
+
v10's free phase helps decompose mixed frequency components — but the margins are small and within noise.
|
| 169 |
+
|
| 170 |
+
### 3.6 OOD Generalization
|
| 171 |
+
|
| 172 |
+
All periodic architectures fail on out-of-distribution data (train [-1,1] → test [1,2]):
|
| 173 |
+
|
| 174 |
+
| Model | ID MSE | OOD MSE | Degradation |
|
| 175 |
+
|-------|--------|---------|-------------|
|
| 176 |
+
| **Vanilla** | 0.217 | **1.53** | **7.1×** |
|
| 177 |
+
| SinGLU | 0.246 | 5.90 | 24.0× |
|
| 178 |
+
| v10 | 0.004 | 4.96 | 1,273× |
|
| 179 |
+
| v15 | 0.010 | 4.38 | 420× |
|
| 180 |
+
|
| 181 |
+
**Vanilla's OOD robustness is unmatched.** Periodic activations extrapolate their learned oscillations outside the training domain, producing hallucinated patterns. This is a fundamental limitation of sinusoidal representations, not an implementation issue.
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
## 4. Analysis
|
| 186 |
+
|
| 187 |
+
### 4.1 Why SinGLU Wins
|
| 188 |
+
|
| 189 |
+
SinGLU's dominance can be attributed to three factors:
|
| 190 |
+
|
| 191 |
+
1. **100% of parameters do computation.** No routing, gating, or frequency prediction matrices. Every parameter directly encodes features.
|
| 192 |
+
|
| 193 |
+
2. **Multiplicative interaction captures cross-terms.** `sin(ω·W₁x) ⊙ W₂x` produces terms of the form `sin(ω·wᵢᵀx)·wⱼᵀx`, which includes the product `xₖ·xₗ` that a linear layer cannot represent. This is the same insight behind GLU variants in modern LLMs.
|
| 194 |
+
|
| 195 |
+
3. **Fixed frequency is a feature, not a bug.** Fixed ω provides a consistent frequency basis that transfers across inputs and even across frequency scales (as shown in the killer experiment). Adaptive frequency mechanisms add flexibility but lose this consistency.
|
| 196 |
+
|
| 197 |
+
### 4.2 The Regime Map
|
| 198 |
+
|
| 199 |
+
Our experiments reveal three distinct computational regimes:
|
| 200 |
+
|
| 201 |
+
| Regime | Best Architecture | Why |
|
| 202 |
+
|--------|------------------|-----|
|
| 203 |
+
| Structured functions (compositional, multiplicative) | SinGLU or v10 (free phase) | Periodic basis + cross-terms match function structure |
|
| 204 |
+
| Geometric decision boundaries (spirals, nonlinear classification) | v10 (free phase) | Phase shifts rotate decision boundaries |
|
| 205 |
+
| Multi-scale signals (sum of sinusoids) | v15 (dual-phase) | Explicit access to multiple frequency channels |
|
| 206 |
+
| Out-of-distribution robustness | Vanilla MLP | Simplicity = less overfitting to training distribution |
|
| 207 |
+
| Frequency generalization (unseen frequencies) | SinGLU | Fixed frequency basis transfers; adaptive basis overfits |
|
| 208 |
+
|
| 209 |
+
**No single architecture dominates all regimes.** This is consistent with the No Free Lunch theorem — every inductive bias that helps on one task class necessarily hurts on another.
|
| 210 |
+
|
| 211 |
+
### 4.3 Insights on Neural Network Optimization
|
| 212 |
+
|
| 213 |
+
Our adaptive mechanism experiments (v6-v13) revealed a consistent failure pattern that constitutes a finding in its own right:
|
| 214 |
+
|
| 215 |
+
> **Neural networks refuse to learn meta-computation when direct computation is available.**
|
| 216 |
+
|
| 217 |
+
Specifically:
|
| 218 |
+
- **Routing gates (v6):** α stays at sigmoid(0) = 0.5 — the network adjusts branch weights instead of the gate.
|
| 219 |
+
- **Learnable frequency (v7):** ω stays at initialization — the network adjusts W_per instead of ω.
|
| 220 |
+
- **Phase predictors (v8-v13):** Phase learns small perturbations at best — the network adjusts Wg instead of Wφ.
|
| 221 |
+
|
| 222 |
+
The root cause is **gradient competition**: meta-parameters receive second-order gradient signal (how changing the computation type would improve the already-optimized branches), while branch parameters receive first-order signal (how to directly reduce loss). At small scale with limited training, first-order always wins.
|
| 223 |
+
|
| 224 |
+
This parallels known results in meta-learning, neural architecture search, and mixture-of-experts, where explicit auxiliary losses (load balancing, architecture reward) are required to train the meta-mechanism.
|
| 225 |
+
|
| 226 |
+
---
|
| 227 |
+
|
| 228 |
+
## 5. Conclusions
|
| 229 |
+
|
| 230 |
+
### 5.1 Confirmed
|
| 231 |
+
|
| 232 |
+
1. **Replacing y = ReLU(Wx + b) with richer per-neuron computation increases information density.** The memorization test showed 168,817× lower MSE at matched parameters — each parameter encodes dramatically more information when participating in multiplicative periodic computation.
|
| 233 |
+
|
| 234 |
+
2. **SinGLU (`sin(ω·W₁x) ⊙ W₂x`) is the optimal neuron design at small scale.** It wins 4-5 out of 9 tasks consistently across all comparisons. The 2/3 width trick from the GLU literature makes it parameter-efficient.
|
| 235 |
+
|
| 236 |
+
3. **Different tasks favor different neuron types.** Geometric tasks favor free phase (v10), multi-scale signal tasks favor dual-phase (v15), and OOD robustness favors vanilla ReLU. This is a spectrum, not a single optimum.
|
| 237 |
+
|
| 238 |
+
### 5.2 Refuted
|
| 239 |
+
|
| 240 |
+
4. **Adaptive per-neuron computation does not pay at small scale (3K-8K params).** Every adaptive mechanism tested (8 variants) either matched or underperformed SinGLU. The meta-learning signal is too weak relative to direct weight learning.
|
| 241 |
+
|
| 242 |
+
5. **More expressive neurons do NOT generalize better to unseen frequencies.** The killer experiment showed that fixed-frequency SinGLU generalizes better than adaptive variants — directly contradicting the intuition that expressiveness aids generalization.
|
| 243 |
+
|
| 244 |
+
6. **Periodic activations do NOT improve OOD robustness.** All sinusoidal architectures showed 24-1273× degradation on OOD data, vs Vanilla's 7×. Periodic neurons hallucinate oscillations outside the training domain.
|
| 245 |
+
|
| 246 |
+
### 5.3 Open Questions
|
| 247 |
+
|
| 248 |
+
- Do the adaptive mechanisms (v6-v13) that failed at small scale succeed at 100K+ parameters where the width penalty becomes negligible?
|
| 249 |
+
- Can explicit auxiliary losses (analogous to MoE load balancing) make phase/frequency prediction trainable?
|
| 250 |
+
- Does v15's dual-phase decomposition scale to real signal processing tasks (audio, images)?
|
| 251 |
+
|
| 252 |
+
---
|
| 253 |
+
|
| 254 |
+
## 6. Reproducibility
|
| 255 |
+
|
| 256 |
+
All code is available at [huggingface.co/anshdadhich/richneuron-vs-vanilla-benchmark](https://huggingface.co/anshdadhich/richneuron-vs-vanilla-benchmark):
|
| 257 |
+
|
| 258 |
+
| File | Description |
|
| 259 |
+
|------|-------------|
|
| 260 |
+
| `benchmark.py` | v1: Original RichNeuron vs Vanilla |
|
| 261 |
+
| `benchmark_v4.py` | v4: Width-fix strategies (LowRank, Shared, SinGLU) |
|
| 262 |
+
| `benchmark_v5.py` | v5: Honest re-eval (multi-seed, grad norms, OOD) |
|
| 263 |
+
| `benchmark_v6.py` | v6: Adaptive routing neuron |
|
| 264 |
+
| `benchmark_v7.py` | v7: Learnable frequency neuron |
|
| 265 |
+
| `benchmark_v8.py` | v8: Adaptive phase + amplitude gate |
|
| 266 |
+
| `benchmark_v9.py` | v9: Controlled frequency + phase + gate |
|
| 267 |
+
| `benchmark_v10.py` | v10: SinGLU + free phase |
|
| 268 |
+
| `benchmark_v11.py` | v11: SinGLU + disciplined phase |
|
| 269 |
+
| `benchmark_v12.py` | v12: SinGLU + signal-proportional phase (FM) |
|
| 270 |
+
| `benchmark_v13.py` | v13: SinGLU + aligned phase + corr(g,φ) analysis |
|
| 271 |
+
| `benchmark_v15.py` | v15: Dual-phase decomposition + killer experiments |
|
| 272 |
+
| `results_*.json` | Raw results with per-seed scores |
|
| 273 |
+
|
| 274 |
+
All experiments run on CPU with PyTorch. Total compute: ~4 hours on a 2-vCPU machine.
|
| 275 |
+
|
| 276 |
+
---
|
| 277 |
+
|
| 278 |
+
## References
|
| 279 |
+
|
| 280 |
+
- Allen-Zhu, Z., & Li, Y. (2024). Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. *arXiv:2404.05405*
|
| 281 |
+
- Liu, Z., et al. (2024). KAN: Kolmogorov-Arnold Networks. *arXiv:2404.19756*
|
| 282 |
+
- Sitzmann, V., et al. (2020). Implicit Neural Representations with Periodic Activation Functions. *NeurIPS 2020*. *arXiv:2006.09661*
|
| 283 |
+
- Chrysos, G., et al. (2024). Multilinear Operator Networks. *ICLR 2024*. *arXiv:2401.17992*
|
| 284 |
+
- Shazeer, N. (2020). GLU Variants Improve Transformer. *arXiv:2002.05202*
|
| 285 |
+
- Hoff, S., et al. (2024). Efficient Learning with Sine-Activated Low-rank Matrices. *arXiv:2403.19243*
|
| 286 |
+
- Xu, J., et al. (2024). Densing Law of LLMs. *arXiv:2412.04315*
|
| 287 |
+
- Cho, Y., et al. (2022). FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning. *ICLR 2022*. *arXiv:2108.06098*
|