File size: 7,649 Bytes
887b660 a00a45a 887b660 a00a45a 887b660 a00a45a ebd7a96 a00a45a 887b660 ebd7a96 887b660 ebd7a96 887b660 ebd7a96 887b660 ebd7a96 887b660 ebd7a96 887b660 ebd7a96 887b660 a00a45a 887b660 ebd7a96 887b660 a00a45a 887b660 a00a45a 887b660 a00a45a 887b660 a00a45a ebd7a96 a00a45a 887b660 a00a45a 887b660 a00a45a 887b660 a00a45a 887b660 a00a45a 887b660 a00a45a 887b660 a00a45a ebd7a96 887b660 a00a45a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | # Beyond Linear Neurons: RichNeuron Benchmark
**Can replacing `y = ReLU(Wx + b)` with richer per-neuron computation store more information per parameter?**
**Answer: Yes — up to 168,327× more at matched parameter budgets.** But every adaptive mechanism we tried failed at small scale.
---
## The Core Finding
We replaced the standard neural network neuron with 15 progressively more complex architectures. At matched parameter budgets (3K-8K params):
- **SinGLU** (`sin(ω·W₁x) ⊙ W₂x`) — a multiplicative periodic neuron inspired by the GLU literature — **wins 5/6 standard tasks** against vanilla ReLU MLPs
- **168,327× lower MSE** on a 200-point memorization task
- **222× lower MSE** on nested multiplicative periodic functions
- **+35.9 percentage points** on checkerboard classification
But **every adaptive mechanism failed** — routing gates, learnable frequencies, phase predictors, and multi-scale decompositions all either matched or underperformed SinGLU at this scale.
---
## Architecture Search (15 Versions)
| Version | Hypothesis | Key Equation | Verdict |
|---------|-----------|-------------|---------|
| **B1: Vanilla** | Baseline | `ReLU(Wx + b)` | OOD-robust but limited expressivity |
| **B2: SinGLU** | GLU-style periodic | `sin(ω·Wg·x) ⊙ Wv·x` | **Best overall — wins 5/6 tasks** |
| v1 | Multiplicative + periodic + residual | `(W₁x) ⊙ sin(ω·W₂x) + W₁x` | Killed by width penalty |
| v4 | Width-fix strategies | Low-rank, shared-weight, GLU-style | SinGLU emerges as best width fix |
| v6 | Adaptive routing gate | `α(x)·periodic + (1-α)·linear` | α stuck at 0.5 — never learns |
| v7 | Learnable frequency | `sin(ω(x)·Wx)` | ω frozen at initialization |
| v8 | Phase + amplitude gate | `sin(ω·Wx + φ(x))` with sigmoid | Gate weak, phase underused |
| v9 | Controlled freq + phase + gate | Bounded ω(x) + φ(x) + α(x) | 5 matrices → too narrow |
| **v10** | Free phase | `sin(ω·Wg·x + π·tanh(Wφ·x))` | **Best adaptive — wins Spiral + Complex** |
| v11 | Disciplined phase (scaled) | `sin(ω·(g + 0.1·tanh(φ)))` | Phase ~0 — basically SinGLU |
| v12 | Signal-proportional (FM) | `sin(ω·g·(1 + 0.2·tanh(φ)))` | Frequency modulation, not phase |
| v13 | Signal-aligned phase | `sin(ω·g + 0.1·g·tanh(φ))` | Wins Checker, kills Spiral |
| **v15** | Dual-phase decomposition | `sin(ωg+βφ) ⊙ (1+α·sin(2ωg+γφ))` | **First to beat SinGLU on HiFreq** |
---
## Complete Results (All Versions × 9 Tasks)
### Regression (MSE ↓)
| Model | Complex (4D) | Nested (2D) | HiFreq | Memorize |
|-------|-------------|-------------|--------|----------|
| Vanilla | 0.0575 | 0.0487 | 1.10 | 0.1568 |
| SinGLU | 0.0143 | **0.0002** | **1.02** | **9.3e-7** |
| v10 | **0.0080** | 0.0004 | 1.22 | 1.7e-5 |
| v15 | 0.0316 | 0.0051 | **0.85** | 1.0e-5 |
### Classification (Accuracy ↑)
| Model | Spiral | Checkerboard |
|-------|--------|-------------|
| Vanilla | **85.1%** | 57.9% |
| SinGLU | 44.2% | **93.8%** |
| v10 | **99.2%** | 93.8% |
| v15 | 98.9% | 90.0% |
### Generalization (MSE ↓)
| Model | OOD [-1,1]→[1,2] | Freq Gen 2π→10π | Mixed Freq 4π→20π |
|-------|-------------------|------------------|-------------------|
| **Vanilla** | **1.53** | 1.172 | 1.329 |
| SinGLU | 5.90 | **0.736** | 1.491 |
| v10 | 4.96 | 0.958 | **1.178** |
| v15 | 4.38 | 0.910 | 1.317 |
---
## The Six Biggest Findings
1. **SinGLU stores 168,327× more information per parameter** on memorization tasks
2. **SinGLU wins 5/6 standard tasks** — but no universal best neuron exists (Spiral needs v10's free phase)
3. **Every adaptive mechanism fails at small scale** due to gradient competition between meta-parameters and direct computation
4. **The width-richness tradeoff is severe** — every extra matrix steals ~35% of hidden width
5. **Fixed-frequency generalizes better than adaptive** — SinGLU trains worse on `sin(2πx)` but tests better on `sin(10πx)` (unseen frequency)
6. **Periodic neurons hallucinate on OOD** — all sinusoidal architectures show 24-1273× degradation on distribution shift vs Vanilla's 7×
---
## Task-Specific Regime Map
| Task Type | Best Architecture | Why |
|-----------|------------------|-----|
| Compositional / multiplicative | SinGLU | Cross-terms match function structure |
| Geometric / rotational (spirals) | v10 (free phase) | Phase shifts rotate boundaries |
| Multi-scale frequencies (audio) | v15 (dual-phase) | Explicit ω + 2ω channels |
| Pure memorization | SinGLU | Maximum information density |
| OOD / distribution shift | Vanilla ReLU | No hallucinated oscillations |
| Unseen frequency generalization | SinGLU | Fixed basis transfers across scales |
---
## Repository Files
### Benchmarks (one per version)
| File | Contains |
|------|----------|
| `benchmark.py` | v1 — original RichNeuron vs Vanilla |
| `benchmark_v4.py` | v4 — width-fix strategies (LowRank, Shared, SinGLU) |
| `benchmark_v5.py` | v5 — honest re-eval (3 seeds, gradient norms, OOD) |
| `benchmark_v6.py` | v6 — adaptive routing neuron |
| `benchmark_v7.py` | v7 — learnable frequency neuron |
| `benchmark_v8.py` | v8 — adaptive phase + amplitude gate |
| `benchmark_v9.py` | v9 — controlled freq + phase + gate |
| `benchmark_v10.py` | v10 — SinGLU + free phase |
| `benchmark_v11.py` | v11 — SinGLU + disciplined phase |
| `benchmark_v12.py` | v12 — SinGLU + signal-proportional phase (FM) |
| `benchmark_v13.py` | v13 — SinGLU + aligned phase + correlation analysis |
| `benchmark_v15.py` | v15 — dual-phase decomposition + killer experiments |
### Results & Reports
| File | Contains |
|------|----------|
| `results.json` | v1 raw results |
| `results_v4.json` — `results_v15.json` | Per-version raw JSON with per-seed scores |
| `PAPER.md` | Full technical report with analysis |
| `FINDINGS_SUMMARY.md` | Complete architecture catalog and all results tables |
| `CORRECTIONS.md` | Data verification and audit trail |
---
## Quick Start
```bash
pip install torch numpy
python benchmark_v10.py # Run the best adaptive variant
python benchmark_v15.py # Run dual-phase + killer experiments
```
All benchmarks use CPU-only PyTorch and complete in ~15 minutes each.
---
## Reproducibility
- **Hardware:** CPU-only (2 vCPU, 8GB RAM)
- **Total runtime:** ~4 hours for all 12 benchmarks
- **Framework:** PyTorch
- **Seeds:** 3 random seeds per experiment
- **Statistical significance:** Mean ± std reported; all claims based on consistent seed-wise ordering
- **Parameter matching:** Binary search over hidden dimensions to match budgets within ~5%
---
## Citation
```bibtex
@misc{richneuron2025,
title={Beyond Linear Neurons: An Empirical Study of Multiplicative Periodic
Architectures at Small Scale},
author={anshdadhich},
year={2025},
url={https://huggingface.co/anshdadhich/richneuron-vs-vanilla-benchmark}
}
```
---
## References
- [KAN: Kolmogorov-Arnold Networks](https://arxiv.org/abs/2404.19756) — learnable spline activations
- [MONet: Multilinear Operator Networks](https://arxiv.org/abs/2401.17992) — multiplicative interactions
- [SIREN](https://arxiv.org/abs/2006.09661) — periodic activation functions
- [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) — SwiGLU / the 2/3 width trick
- [Sine-Activated Low-Rank Matrices](https://arxiv.org/abs/2403.19243) — sin() provably increases rank
- [Knowledge Capacity Scaling Laws](https://arxiv.org/abs/2404.05405) — ~2 bits/param for standard transformers
- [Densing Law of LLMs](https://arxiv.org/abs/2412.04315) — capability density trends
|