File size: 7,649 Bytes
887b660
a00a45a
887b660
a00a45a
887b660
a00a45a
ebd7a96
a00a45a
887b660
ebd7a96
887b660
ebd7a96
887b660
 
 
 
ebd7a96
887b660
ebd7a96
887b660
ebd7a96
887b660
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ebd7a96
 
 
887b660
a00a45a
887b660
ebd7a96
887b660
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a00a45a
887b660
 
 
 
 
 
a00a45a
887b660
 
 
a00a45a
887b660
 
 
 
 
 
a00a45a
ebd7a96
a00a45a
887b660
a00a45a
887b660
 
 
 
 
 
 
 
 
 
a00a45a
887b660
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a00a45a
 
 
887b660
 
a00a45a
 
887b660
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a00a45a
 
 
887b660
a00a45a
ebd7a96
887b660
a00a45a
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# Beyond Linear Neurons: RichNeuron Benchmark

**Can replacing `y = ReLU(Wx + b)` with richer per-neuron computation store more information per parameter?**

**Answer: Yes — up to 168,327× more at matched parameter budgets.** But every adaptive mechanism we tried failed at small scale.

---

## The Core Finding

We replaced the standard neural network neuron with 15 progressively more complex architectures. At matched parameter budgets (3K-8K params):

- **SinGLU** (`sin(ω·W₁x) ⊙ W₂x`) — a multiplicative periodic neuron inspired by the GLU literature — **wins 5/6 standard tasks** against vanilla ReLU MLPs
- **168,327× lower MSE** on a 200-point memorization task
- **222× lower MSE** on nested multiplicative periodic functions
- **+35.9 percentage points** on checkerboard classification

But **every adaptive mechanism failed** — routing gates, learnable frequencies, phase predictors, and multi-scale decompositions all either matched or underperformed SinGLU at this scale.

---

## Architecture Search (15 Versions)

| Version | Hypothesis | Key Equation | Verdict |
|---------|-----------|-------------|---------|
| **B1: Vanilla** | Baseline | `ReLU(Wx + b)` | OOD-robust but limited expressivity |
| **B2: SinGLU** | GLU-style periodic | `sin(ω·Wg·x) ⊙ Wv·x` | **Best overall — wins 5/6 tasks** |
| v1 | Multiplicative + periodic + residual | `(W₁x) ⊙ sin(ω·W₂x) + W₁x` | Killed by width penalty |
| v4 | Width-fix strategies | Low-rank, shared-weight, GLU-style | SinGLU emerges as best width fix |
| v6 | Adaptive routing gate | `α(x)·periodic + (1-α)·linear` | α stuck at 0.5 — never learns |
| v7 | Learnable frequency | `sin(ω(x)·Wx)` | ω frozen at initialization |
| v8 | Phase + amplitude gate | `sin(ω·Wx + φ(x))` with sigmoid | Gate weak, phase underused |
| v9 | Controlled freq + phase + gate | Bounded ω(x) + φ(x) + α(x) | 5 matrices → too narrow |
| **v10** | Free phase | `sin(ω·Wg·x + π·tanh(Wφ·x))` | **Best adaptive — wins Spiral + Complex** |
| v11 | Disciplined phase (scaled) | `sin(ω·(g + 0.1·tanh(φ)))` | Phase ~0 — basically SinGLU |
| v12 | Signal-proportional (FM) | `sin(ω·g·(1 + 0.2·tanh(φ)))` | Frequency modulation, not phase |
| v13 | Signal-aligned phase | `sin(ω·g + 0.1·g·tanh(φ))` | Wins Checker, kills Spiral |
| **v15** | Dual-phase decomposition | `sin(ωg+βφ) ⊙ (1+α·sin(2ωg+γφ))` | **First to beat SinGLU on HiFreq** |

---

## Complete Results (All Versions × 9 Tasks)

### Regression (MSE ↓)

| Model | Complex (4D) | Nested (2D) | HiFreq | Memorize |
|-------|-------------|-------------|--------|----------|
| Vanilla | 0.0575 | 0.0487 | 1.10 | 0.1568 |
| SinGLU | 0.0143 | **0.0002** | **1.02** | **9.3e-7** |
| v10 | **0.0080** | 0.0004 | 1.22 | 1.7e-5 |
| v15 | 0.0316 | 0.0051 | **0.85** | 1.0e-5 |

### Classification (Accuracy ↑)

| Model | Spiral | Checkerboard |
|-------|--------|-------------|
| Vanilla | **85.1%** | 57.9% |
| SinGLU | 44.2% | **93.8%** |
| v10 | **99.2%** | 93.8% |
| v15 | 98.9% | 90.0% |

### Generalization (MSE ↓)

| Model | OOD [-1,1]→[1,2] | Freq Gen 2π→10π | Mixed Freq 4π→20π |
|-------|-------------------|------------------|-------------------|
| **Vanilla** | **1.53** | 1.172 | 1.329 |
| SinGLU | 5.90 | **0.736** | 1.491 |
| v10 | 4.96 | 0.958 | **1.178** |
| v15 | 4.38 | 0.910 | 1.317 |

---

## The Six Biggest Findings

1. **SinGLU stores 168,327× more information per parameter** on memorization tasks
2. **SinGLU wins 5/6 standard tasks** — but no universal best neuron exists (Spiral needs v10's free phase)
3. **Every adaptive mechanism fails at small scale** due to gradient competition between meta-parameters and direct computation
4. **The width-richness tradeoff is severe** — every extra matrix steals ~35% of hidden width
5. **Fixed-frequency generalizes better than adaptive** — SinGLU trains worse on `sin(2πx)` but tests better on `sin(10πx)` (unseen frequency)
6. **Periodic neurons hallucinate on OOD** — all sinusoidal architectures show 24-1273× degradation on distribution shift vs Vanilla's 7×

---

## Task-Specific Regime Map

| Task Type | Best Architecture | Why |
|-----------|------------------|-----|
| Compositional / multiplicative | SinGLU | Cross-terms match function structure |
| Geometric / rotational (spirals) | v10 (free phase) | Phase shifts rotate boundaries |
| Multi-scale frequencies (audio) | v15 (dual-phase) | Explicit ω + 2ω channels |
| Pure memorization | SinGLU | Maximum information density |
| OOD / distribution shift | Vanilla ReLU | No hallucinated oscillations |
| Unseen frequency generalization | SinGLU | Fixed basis transfers across scales |

---

## Repository Files

### Benchmarks (one per version)

| File | Contains |
|------|----------|
| `benchmark.py` | v1 — original RichNeuron vs Vanilla |
| `benchmark_v4.py` | v4 — width-fix strategies (LowRank, Shared, SinGLU) |
| `benchmark_v5.py` | v5 — honest re-eval (3 seeds, gradient norms, OOD) |
| `benchmark_v6.py` | v6 — adaptive routing neuron |
| `benchmark_v7.py` | v7 — learnable frequency neuron |
| `benchmark_v8.py` | v8 — adaptive phase + amplitude gate |
| `benchmark_v9.py` | v9 — controlled freq + phase + gate |
| `benchmark_v10.py` | v10 — SinGLU + free phase |
| `benchmark_v11.py` | v11 — SinGLU + disciplined phase |
| `benchmark_v12.py` | v12 — SinGLU + signal-proportional phase (FM) |
| `benchmark_v13.py` | v13 — SinGLU + aligned phase + correlation analysis |
| `benchmark_v15.py` | v15 — dual-phase decomposition + killer experiments |

### Results & Reports

| File | Contains |
|------|----------|
| `results.json` | v1 raw results |
| `results_v4.json``results_v15.json` | Per-version raw JSON with per-seed scores |
| `PAPER.md` | Full technical report with analysis |
| `FINDINGS_SUMMARY.md` | Complete architecture catalog and all results tables |
| `CORRECTIONS.md` | Data verification and audit trail |

---

## Quick Start

```bash
pip install torch numpy
python benchmark_v10.py   # Run the best adaptive variant
python benchmark_v15.py   # Run dual-phase + killer experiments
```

All benchmarks use CPU-only PyTorch and complete in ~15 minutes each.

---

## Reproducibility

- **Hardware:** CPU-only (2 vCPU, 8GB RAM)
- **Total runtime:** ~4 hours for all 12 benchmarks
- **Framework:** PyTorch
- **Seeds:** 3 random seeds per experiment
- **Statistical significance:** Mean ± std reported; all claims based on consistent seed-wise ordering
- **Parameter matching:** Binary search over hidden dimensions to match budgets within ~5%

---

## Citation

```bibtex
@misc{richneuron2025,
  title={Beyond Linear Neurons: An Empirical Study of Multiplicative Periodic 
         Architectures at Small Scale},
  author={anshdadhich},
  year={2025},
  url={https://huggingface.co/anshdadhich/richneuron-vs-vanilla-benchmark}
}
```

---

## References

- [KAN: Kolmogorov-Arnold Networks](https://arxiv.org/abs/2404.19756) — learnable spline activations
- [MONet: Multilinear Operator Networks](https://arxiv.org/abs/2401.17992) — multiplicative interactions
- [SIREN](https://arxiv.org/abs/2006.09661) — periodic activation functions
- [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) — SwiGLU / the 2/3 width trick
- [Sine-Activated Low-Rank Matrices](https://arxiv.org/abs/2403.19243) — sin() provably increases rank
- [Knowledge Capacity Scaling Laws](https://arxiv.org/abs/2404.05405) — ~2 bits/param for standard transformers
- [Densing Law of LLMs](https://arxiv.org/abs/2412.04315) — capability density trends