# Beyond Linear Neurons: RichNeuron Benchmark **Can replacing `y = ReLU(Wx + b)` with richer per-neuron computation store more information per parameter?** **Answer: Yes — up to 168,327× more at matched parameter budgets.** But every adaptive mechanism we tried failed at small scale. --- ## The Core Finding We replaced the standard neural network neuron with 15 progressively more complex architectures. At matched parameter budgets (3K-8K params): - **SinGLU** (`sin(ω·W₁x) ⊙ W₂x`) — a multiplicative periodic neuron inspired by the GLU literature — **wins 5/6 standard tasks** against vanilla ReLU MLPs - **168,327× lower MSE** on a 200-point memorization task - **222× lower MSE** on nested multiplicative periodic functions - **+35.9 percentage points** on checkerboard classification But **every adaptive mechanism failed** — routing gates, learnable frequencies, phase predictors, and multi-scale decompositions all either matched or underperformed SinGLU at this scale. --- ## Architecture Search (15 Versions) | Version | Hypothesis | Key Equation | Verdict | |---------|-----------|-------------|---------| | **B1: Vanilla** | Baseline | `ReLU(Wx + b)` | OOD-robust but limited expressivity | | **B2: SinGLU** | GLU-style periodic | `sin(ω·Wg·x) ⊙ Wv·x` | **Best overall — wins 5/6 tasks** | | v1 | Multiplicative + periodic + residual | `(W₁x) ⊙ sin(ω·W₂x) + W₁x` | Killed by width penalty | | v4 | Width-fix strategies | Low-rank, shared-weight, GLU-style | SinGLU emerges as best width fix | | v6 | Adaptive routing gate | `α(x)·periodic + (1-α)·linear` | α stuck at 0.5 — never learns | | v7 | Learnable frequency | `sin(ω(x)·Wx)` | ω frozen at initialization | | v8 | Phase + amplitude gate | `sin(ω·Wx + φ(x))` with sigmoid | Gate weak, phase underused | | v9 | Controlled freq + phase + gate | Bounded ω(x) + φ(x) + α(x) | 5 matrices → too narrow | | **v10** | Free phase | `sin(ω·Wg·x + π·tanh(Wφ·x))` | **Best adaptive — wins Spiral + Complex** | | v11 | Disciplined phase (scaled) | `sin(ω·(g + 0.1·tanh(φ)))` | Phase ~0 — basically SinGLU | | v12 | Signal-proportional (FM) | `sin(ω·g·(1 + 0.2·tanh(φ)))` | Frequency modulation, not phase | | v13 | Signal-aligned phase | `sin(ω·g + 0.1·g·tanh(φ))` | Wins Checker, kills Spiral | | **v15** | Dual-phase decomposition | `sin(ωg+βφ) ⊙ (1+α·sin(2ωg+γφ))` | **First to beat SinGLU on HiFreq** | --- ## Complete Results (All Versions × 9 Tasks) ### Regression (MSE ↓) | Model | Complex (4D) | Nested (2D) | HiFreq | Memorize | |-------|-------------|-------------|--------|----------| | Vanilla | 0.0575 | 0.0487 | 1.10 | 0.1568 | | SinGLU | 0.0143 | **0.0002** | **1.02** | **9.3e-7** | | v10 | **0.0080** | 0.0004 | 1.22 | 1.7e-5 | | v15 | 0.0316 | 0.0051 | **0.85** | 1.0e-5 | ### Classification (Accuracy ↑) | Model | Spiral | Checkerboard | |-------|--------|-------------| | Vanilla | **85.1%** | 57.9% | | SinGLU | 44.2% | **93.8%** | | v10 | **99.2%** | 93.8% | | v15 | 98.9% | 90.0% | ### Generalization (MSE ↓) | Model | OOD [-1,1]→[1,2] | Freq Gen 2π→10π | Mixed Freq 4π→20π | |-------|-------------------|------------------|-------------------| | **Vanilla** | **1.53** | 1.172 | 1.329 | | SinGLU | 5.90 | **0.736** | 1.491 | | v10 | 4.96 | 0.958 | **1.178** | | v15 | 4.38 | 0.910 | 1.317 | --- ## The Six Biggest Findings 1. **SinGLU stores 168,327× more information per parameter** on memorization tasks 2. **SinGLU wins 5/6 standard tasks** — but no universal best neuron exists (Spiral needs v10's free phase) 3. **Every adaptive mechanism fails at small scale** due to gradient competition between meta-parameters and direct computation 4. **The width-richness tradeoff is severe** — every extra matrix steals ~35% of hidden width 5. **Fixed-frequency generalizes better than adaptive** — SinGLU trains worse on `sin(2πx)` but tests better on `sin(10πx)` (unseen frequency) 6. **Periodic neurons hallucinate on OOD** — all sinusoidal architectures show 24-1273× degradation on distribution shift vs Vanilla's 7× --- ## Task-Specific Regime Map | Task Type | Best Architecture | Why | |-----------|------------------|-----| | Compositional / multiplicative | SinGLU | Cross-terms match function structure | | Geometric / rotational (spirals) | v10 (free phase) | Phase shifts rotate boundaries | | Multi-scale frequencies (audio) | v15 (dual-phase) | Explicit ω + 2ω channels | | Pure memorization | SinGLU | Maximum information density | | OOD / distribution shift | Vanilla ReLU | No hallucinated oscillations | | Unseen frequency generalization | SinGLU | Fixed basis transfers across scales | --- ## Repository Files ### Benchmarks (one per version) | File | Contains | |------|----------| | `benchmark.py` | v1 — original RichNeuron vs Vanilla | | `benchmark_v4.py` | v4 — width-fix strategies (LowRank, Shared, SinGLU) | | `benchmark_v5.py` | v5 — honest re-eval (3 seeds, gradient norms, OOD) | | `benchmark_v6.py` | v6 — adaptive routing neuron | | `benchmark_v7.py` | v7 — learnable frequency neuron | | `benchmark_v8.py` | v8 — adaptive phase + amplitude gate | | `benchmark_v9.py` | v9 — controlled freq + phase + gate | | `benchmark_v10.py` | v10 — SinGLU + free phase | | `benchmark_v11.py` | v11 — SinGLU + disciplined phase | | `benchmark_v12.py` | v12 — SinGLU + signal-proportional phase (FM) | | `benchmark_v13.py` | v13 — SinGLU + aligned phase + correlation analysis | | `benchmark_v15.py` | v15 — dual-phase decomposition + killer experiments | ### Results & Reports | File | Contains | |------|----------| | `results.json` | v1 raw results | | `results_v4.json` — `results_v15.json` | Per-version raw JSON with per-seed scores | | `PAPER.md` | Full technical report with analysis | | `FINDINGS_SUMMARY.md` | Complete architecture catalog and all results tables | | `CORRECTIONS.md` | Data verification and audit trail | --- ## Quick Start ```bash pip install torch numpy python benchmark_v10.py # Run the best adaptive variant python benchmark_v15.py # Run dual-phase + killer experiments ``` All benchmarks use CPU-only PyTorch and complete in ~15 minutes each. --- ## Reproducibility - **Hardware:** CPU-only (2 vCPU, 8GB RAM) - **Total runtime:** ~4 hours for all 12 benchmarks - **Framework:** PyTorch - **Seeds:** 3 random seeds per experiment - **Statistical significance:** Mean ± std reported; all claims based on consistent seed-wise ordering - **Parameter matching:** Binary search over hidden dimensions to match budgets within ~5% --- ## Citation ```bibtex @misc{richneuron2025, title={Beyond Linear Neurons: An Empirical Study of Multiplicative Periodic Architectures at Small Scale}, author={anshdadhich}, year={2025}, url={https://huggingface.co/anshdadhich/richneuron-vs-vanilla-benchmark} } ``` --- ## References - [KAN: Kolmogorov-Arnold Networks](https://arxiv.org/abs/2404.19756) — learnable spline activations - [MONet: Multilinear Operator Networks](https://arxiv.org/abs/2401.17992) — multiplicative interactions - [SIREN](https://arxiv.org/abs/2006.09661) — periodic activation functions - [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) — SwiGLU / the 2/3 width trick - [Sine-Activated Low-Rank Matrices](https://arxiv.org/abs/2403.19243) — sin() provably increases rank - [Knowledge Capacity Scaling Laws](https://arxiv.org/abs/2404.05405) — ~2 bits/param for standard transformers - [Densing Law of LLMs](https://arxiv.org/abs/2412.04315) — capability density trends