Update README.md with complete 15-version study results

887b660 verified 14 days ago

7.65 kB

	# Beyond Linear Neurons: RichNeuron Benchmark

	Can replacing `y = ReLU(Wx + b)` with richer per-neuron computation store more information per parameter?

	Answer: Yes — up to 168,327× more at matched parameter budgets. But every adaptive mechanism we tried failed at small scale.

	---

	## The Core Finding

	We replaced the standard neural network neuron with 15 progressively more complex architectures. At matched parameter budgets (3K-8K params):

	- SinGLU (`sin(ω·W₁x) ⊙ W₂x`) — a multiplicative periodic neuron inspired by the GLU literature — wins 5/6 standard tasks against vanilla ReLU MLPs
	- 168,327× lower MSE on a 200-point memorization task
	- 222× lower MSE on nested multiplicative periodic functions
	- +35.9 percentage points on checkerboard classification

	But every adaptive mechanism failed — routing gates, learnable frequencies, phase predictors, and multi-scale decompositions all either matched or underperformed SinGLU at this scale.

	---

	## Architecture Search (15 Versions)

	\| Version \| Hypothesis \| Key Equation \| Verdict \|
	\|---------\|-----------\|-------------\|---------\|
	\| B1: Vanilla \| Baseline \| `ReLU(Wx + b)` \| OOD-robust but limited expressivity \|
	\| B2: SinGLU \| GLU-style periodic \| `sin(ω·Wg·x) ⊙ Wv·x` \| Best overall — wins 5/6 tasks \|
	\| v1 \| Multiplicative + periodic + residual \| `(W₁x) ⊙ sin(ω·W₂x) + W₁x` \| Killed by width penalty \|
	\| v4 \| Width-fix strategies \| Low-rank, shared-weight, GLU-style \| SinGLU emerges as best width fix \|
	\| v6 \| Adaptive routing gate \| `α(x)·periodic + (1-α)·linear` \| α stuck at 0.5 — never learns \|
	\| v7 \| Learnable frequency \| `sin(ω(x)·Wx)` \| ω frozen at initialization \|
	\| v8 \| Phase + amplitude gate \| `sin(ω·Wx + φ(x))` with sigmoid \| Gate weak, phase underused \|
	\| v9 \| Controlled freq + phase + gate \| Bounded ω(x) + φ(x) + α(x) \| 5 matrices → too narrow \|
	\| v10 \| Free phase \| `sin(ω·Wg·x + π·tanh(Wφ·x))` \| Best adaptive — wins Spiral + Complex \|
	\| v11 \| Disciplined phase (scaled) \| `sin(ω·(g + 0.1·tanh(φ)))` \| Phase ~0 — basically SinGLU \|
	\| v12 \| Signal-proportional (FM) \| `sin(ω·g·(1 + 0.2·tanh(φ)))` \| Frequency modulation, not phase \|
	\| v13 \| Signal-aligned phase \| `sin(ω·g + 0.1·g·tanh(φ))` \| Wins Checker, kills Spiral \|
	\| v15 \| Dual-phase decomposition \| `sin(ωg+βφ) ⊙ (1+α·sin(2ωg+γφ))` \| First to beat SinGLU on HiFreq \|

	---

	## Complete Results (All Versions × 9 Tasks)

	### Regression (MSE ↓)

	\| Model \| Complex (4D) \| Nested (2D) \| HiFreq \| Memorize \|
	\|-------\|-------------\|-------------\|--------\|----------\|
	\| Vanilla \| 0.0575 \| 0.0487 \| 1.10 \| 0.1568 \|
	\| SinGLU \| 0.0143 \| 0.0002 \| 1.02 \| 9.3e-7 \|
	\| v10 \| 0.0080 \| 0.0004 \| 1.22 \| 1.7e-5 \|
	\| v15 \| 0.0316 \| 0.0051 \| 0.85 \| 1.0e-5 \|

	### Classification (Accuracy ↑)

	\| Model \| Spiral \| Checkerboard \|
	\|-------\|--------\|-------------\|
	\| Vanilla \| 85.1% \| 57.9% \|
	\| SinGLU \| 44.2% \| 93.8% \|
	\| v10 \| 99.2% \| 93.8% \|
	\| v15 \| 98.9% \| 90.0% \|

	### Generalization (MSE ↓)

	\| Model \| OOD [-1,1]→[1,2] \| Freq Gen 2π→10π \| Mixed Freq 4π→20π \|
	\|-------\|-------------------\|------------------\|-------------------\|
	\| Vanilla \| 1.53 \| 1.172 \| 1.329 \|
	\| SinGLU \| 5.90 \| 0.736 \| 1.491 \|
	\| v10 \| 4.96 \| 0.958 \| 1.178 \|
	\| v15 \| 4.38 \| 0.910 \| 1.317 \|

	---

	## The Six Biggest Findings

	1. SinGLU stores 168,327× more information per parameter on memorization tasks
	2. SinGLU wins 5/6 standard tasks — but no universal best neuron exists (Spiral needs v10's free phase)
	3. Every adaptive mechanism fails at small scale due to gradient competition between meta-parameters and direct computation
	4. The width-richness tradeoff is severe — every extra matrix steals ~35% of hidden width
	5. Fixed-frequency generalizes better than adaptive — SinGLU trains worse on `sin(2πx)` but tests better on `sin(10πx)` (unseen frequency)
	6. Periodic neurons hallucinate on OOD — all sinusoidal architectures show 24-1273× degradation on distribution shift vs Vanilla's 7×

	---

	## Task-Specific Regime Map

	\| Task Type \| Best Architecture \| Why \|
	\|-----------\|------------------\|-----\|
	\| Compositional / multiplicative \| SinGLU \| Cross-terms match function structure \|
	\| Geometric / rotational (spirals) \| v10 (free phase) \| Phase shifts rotate boundaries \|
	\| Multi-scale frequencies (audio) \| v15 (dual-phase) \| Explicit ω + 2ω channels \|
	\| Pure memorization \| SinGLU \| Maximum information density \|
	\| OOD / distribution shift \| Vanilla ReLU \| No hallucinated oscillations \|
	\| Unseen frequency generalization \| SinGLU \| Fixed basis transfers across scales \|

	---

	## Repository Files

	### Benchmarks (one per version)

	\| File \| Contains \|
	\|------\|----------\|
	\| `benchmark.py` \| v1 — original RichNeuron vs Vanilla \|
	\| `benchmark_v4.py` \| v4 — width-fix strategies (LowRank, Shared, SinGLU) \|
	\| `benchmark_v5.py` \| v5 — honest re-eval (3 seeds, gradient norms, OOD) \|
	\| `benchmark_v6.py` \| v6 — adaptive routing neuron \|
	\| `benchmark_v7.py` \| v7 — learnable frequency neuron \|
	\| `benchmark_v8.py` \| v8 — adaptive phase + amplitude gate \|
	\| `benchmark_v9.py` \| v9 — controlled freq + phase + gate \|
	\| `benchmark_v10.py` \| v10 — SinGLU + free phase \|
	\| `benchmark_v11.py` \| v11 — SinGLU + disciplined phase \|
	\| `benchmark_v12.py` \| v12 — SinGLU + signal-proportional phase (FM) \|
	\| `benchmark_v13.py` \| v13 — SinGLU + aligned phase + correlation analysis \|
	\| `benchmark_v15.py` \| v15 — dual-phase decomposition + killer experiments \|

	### Results & Reports

	\| File \| Contains \|
	\|------\|----------\|
	\| `results.json` \| v1 raw results \|
	\| `results_v4.json` — `results_v15.json` \| Per-version raw JSON with per-seed scores \|
	\| `PAPER.md` \| Full technical report with analysis \|
	\| `FINDINGS_SUMMARY.md` \| Complete architecture catalog and all results tables \|
	\| `CORRECTIONS.md` \| Data verification and audit trail \|

	---

	## Quick Start

	```bash
	pip install torch numpy
	python benchmark_v10.py # Run the best adaptive variant
	python benchmark_v15.py # Run dual-phase + killer experiments
	```

	All benchmarks use CPU-only PyTorch and complete in ~15 minutes each.

	---

	## Reproducibility

	- Hardware: CPU-only (2 vCPU, 8GB RAM)
	- Total runtime: ~4 hours for all 12 benchmarks
	- Framework: PyTorch
	- Seeds: 3 random seeds per experiment
	- Statistical significance: Mean ± std reported; all claims based on consistent seed-wise ordering
	- Parameter matching: Binary search over hidden dimensions to match budgets within ~5%

	---

	## Citation

	```bibtex
	@misc{richneuron2025,
	title={Beyond Linear Neurons: An Empirical Study of Multiplicative Periodic
	Architectures at Small Scale},
	author={anshdadhich},
	year={2025},
	url={https://huggingface.co/anshdadhich/richneuron-vs-vanilla-benchmark}
	}
	```

	---

	## References

	- [KAN: Kolmogorov-Arnold Networks](https://arxiv.org/abs/2404.19756) — learnable spline activations
	- [MONet: Multilinear Operator Networks](https://arxiv.org/abs/2401.17992) — multiplicative interactions
	- [SIREN](https://arxiv.org/abs/2006.09661) — periodic activation functions
	- [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) — SwiGLU / the 2/3 width trick
	- [Sine-Activated Low-Rank Matrices](https://arxiv.org/abs/2403.19243) — sin() provably increases rank
	- [Knowledge Capacity Scaling Laws](https://arxiv.org/abs/2404.05405) — ~2 bits/param for standard transformers
	- [Densing Law of LLMs](https://arxiv.org/abs/2412.04315) — capability density trends

	# Beyond Linear Neurons: RichNeuron Benchmark

	Can replacing `y = ReLU(Wx + b)` with richer per-neuron computation store more information per parameter?

	Answer: Yes — up to 168,327× more at matched parameter budgets. But every adaptive mechanism we tried failed at small scale.

	---

	## The Core Finding

	We replaced the standard neural network neuron with 15 progressively more complex architectures. At matched parameter budgets (3K-8K params):

	- SinGLU (`sin(ω·W₁x) ⊙ W₂x`) — a multiplicative periodic neuron inspired by the GLU literature — wins 5/6 standard tasks against vanilla ReLU MLPs
	- 168,327× lower MSE on a 200-point memorization task
	- 222× lower MSE on nested multiplicative periodic functions
	- +35.9 percentage points on checkerboard classification

	But every adaptive mechanism failed — routing gates, learnable frequencies, phase predictors, and multi-scale decompositions all either matched or underperformed SinGLU at this scale.

	---

	## Architecture Search (15 Versions)

	\| Version \| Hypothesis \| Key Equation \| Verdict \|
	\|---------\|-----------\|-------------\|---------\|
	\| B1: Vanilla \| Baseline \| `ReLU(Wx + b)` \| OOD-robust but limited expressivity \|
	\| B2: SinGLU \| GLU-style periodic \| `sin(ω·Wg·x) ⊙ Wv·x` \| Best overall — wins 5/6 tasks \|
	\| v1 \| Multiplicative + periodic + residual \| `(W₁x) ⊙ sin(ω·W₂x) + W₁x` \| Killed by width penalty \|
	\| v4 \| Width-fix strategies \| Low-rank, shared-weight, GLU-style \| SinGLU emerges as best width fix \|
	\| v6 \| Adaptive routing gate \| `α(x)·periodic + (1-α)·linear` \| α stuck at 0.5 — never learns \|
	\| v7 \| Learnable frequency \| `sin(ω(x)·Wx)` \| ω frozen at initialization \|
	\| v8 \| Phase + amplitude gate \| `sin(ω·Wx + φ(x))` with sigmoid \| Gate weak, phase underused \|
	\| v9 \| Controlled freq + phase + gate \| Bounded ω(x) + φ(x) + α(x) \| 5 matrices → too narrow \|
	\| v10 \| Free phase \| `sin(ω·Wg·x + π·tanh(Wφ·x))` \| Best adaptive — wins Spiral + Complex \|
	\| v11 \| Disciplined phase (scaled) \| `sin(ω·(g + 0.1·tanh(φ)))` \| Phase ~0 — basically SinGLU \|
	\| v12 \| Signal-proportional (FM) \| `sin(ω·g·(1 + 0.2·tanh(φ)))` \| Frequency modulation, not phase \|
	\| v13 \| Signal-aligned phase \| `sin(ω·g + 0.1·g·tanh(φ))` \| Wins Checker, kills Spiral \|
	\| v15 \| Dual-phase decomposition \| `sin(ωg+βφ) ⊙ (1+α·sin(2ωg+γφ))` \| First to beat SinGLU on HiFreq \|

	---

	## Complete Results (All Versions × 9 Tasks)

	### Regression (MSE ↓)

	\| Model \| Complex (4D) \| Nested (2D) \| HiFreq \| Memorize \|
	\|-------\|-------------\|-------------\|--------\|----------\|
	\| Vanilla \| 0.0575 \| 0.0487 \| 1.10 \| 0.1568 \|
	\| SinGLU \| 0.0143 \| 0.0002 \| 1.02 \| 9.3e-7 \|
	\| v10 \| 0.0080 \| 0.0004 \| 1.22 \| 1.7e-5 \|
	\| v15 \| 0.0316 \| 0.0051 \| 0.85 \| 1.0e-5 \|

	### Classification (Accuracy ↑)

	\| Model \| Spiral \| Checkerboard \|
	\|-------\|--------\|-------------\|
	\| Vanilla \| 85.1% \| 57.9% \|
	\| SinGLU \| 44.2% \| 93.8% \|
	\| v10 \| 99.2% \| 93.8% \|
	\| v15 \| 98.9% \| 90.0% \|

	### Generalization (MSE ↓)

	\| Model \| OOD [-1,1]→[1,2] \| Freq Gen 2π→10π \| Mixed Freq 4π→20π \|
	\|-------\|-------------------\|------------------\|-------------------\|
	\| Vanilla \| 1.53 \| 1.172 \| 1.329 \|
	\| SinGLU \| 5.90 \| 0.736 \| 1.491 \|
	\| v10 \| 4.96 \| 0.958 \| 1.178 \|
	\| v15 \| 4.38 \| 0.910 \| 1.317 \|

	---

	## The Six Biggest Findings

	1. SinGLU stores 168,327× more information per parameter on memorization tasks
	2. SinGLU wins 5/6 standard tasks — but no universal best neuron exists (Spiral needs v10's free phase)
	3. Every adaptive mechanism fails at small scale due to gradient competition between meta-parameters and direct computation
	4. The width-richness tradeoff is severe — every extra matrix steals ~35% of hidden width
	5. Fixed-frequency generalizes better than adaptive — SinGLU trains worse on `sin(2πx)` but tests better on `sin(10πx)` (unseen frequency)
	6. Periodic neurons hallucinate on OOD — all sinusoidal architectures show 24-1273× degradation on distribution shift vs Vanilla's 7×

	---

	## Task-Specific Regime Map

	\| Task Type \| Best Architecture \| Why \|
	\|-----------\|------------------\|-----\|
	\| Compositional / multiplicative \| SinGLU \| Cross-terms match function structure \|
	\| Geometric / rotational (spirals) \| v10 (free phase) \| Phase shifts rotate boundaries \|
	\| Multi-scale frequencies (audio) \| v15 (dual-phase) \| Explicit ω + 2ω channels \|
	\| Pure memorization \| SinGLU \| Maximum information density \|
	\| OOD / distribution shift \| Vanilla ReLU \| No hallucinated oscillations \|
	\| Unseen frequency generalization \| SinGLU \| Fixed basis transfers across scales \|

	---

	## Repository Files

	### Benchmarks (one per version)

	\| File \| Contains \|
	\|------\|----------\|
	\| `benchmark.py` \| v1 — original RichNeuron vs Vanilla \|
	\| `benchmark_v4.py` \| v4 — width-fix strategies (LowRank, Shared, SinGLU) \|
	\| `benchmark_v5.py` \| v5 — honest re-eval (3 seeds, gradient norms, OOD) \|
	\| `benchmark_v6.py` \| v6 — adaptive routing neuron \|
	\| `benchmark_v7.py` \| v7 — learnable frequency neuron \|
	\| `benchmark_v8.py` \| v8 — adaptive phase + amplitude gate \|
	\| `benchmark_v9.py` \| v9 — controlled freq + phase + gate \|
	\| `benchmark_v10.py` \| v10 — SinGLU + free phase \|
	\| `benchmark_v11.py` \| v11 — SinGLU + disciplined phase \|
	\| `benchmark_v12.py` \| v12 — SinGLU + signal-proportional phase (FM) \|
	\| `benchmark_v13.py` \| v13 — SinGLU + aligned phase + correlation analysis \|
	\| `benchmark_v15.py` \| v15 — dual-phase decomposition + killer experiments \|

	### Results & Reports

	\| File \| Contains \|
	\|------\|----------\|
	\| `results.json` \| v1 raw results \|
	\| `results_v4.json` — `results_v15.json` \| Per-version raw JSON with per-seed scores \|
	\| `PAPER.md` \| Full technical report with analysis \|
	\| `FINDINGS_SUMMARY.md` \| Complete architecture catalog and all results tables \|
	\| `CORRECTIONS.md` \| Data verification and audit trail \|

	---

	## Quick Start

	```bash
	pip install torch numpy
	python benchmark_v10.py # Run the best adaptive variant
	python benchmark_v15.py # Run dual-phase + killer experiments
	```

	All benchmarks use CPU-only PyTorch and complete in ~15 minutes each.

	---

	## Reproducibility

	- Hardware: CPU-only (2 vCPU, 8GB RAM)
	- Total runtime: ~4 hours for all 12 benchmarks
	- Framework: PyTorch
	- Seeds: 3 random seeds per experiment
	- Statistical significance: Mean ± std reported; all claims based on consistent seed-wise ordering
	- Parameter matching: Binary search over hidden dimensions to match budgets within ~5%

	---

	## Citation

	```bibtex
	@misc{richneuron2025,
	title={Beyond Linear Neurons: An Empirical Study of Multiplicative Periodic
	Architectures at Small Scale},
	author={anshdadhich},
	year={2025},
	url={https://huggingface.co/anshdadhich/richneuron-vs-vanilla-benchmark}
	}
	```

	---

	## References

	- [KAN: Kolmogorov-Arnold Networks](https://arxiv.org/abs/2404.19756) — learnable spline activations
	- [MONet: Multilinear Operator Networks](https://arxiv.org/abs/2401.17992) — multiplicative interactions
	- [SIREN](https://arxiv.org/abs/2006.09661) — periodic activation functions
	- [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) — SwiGLU / the 2/3 width trick
	- [Sine-Activated Low-Rank Matrices](https://arxiv.org/abs/2403.19243) — sin() provably increases rank
	- [Knowledge Capacity Scaling Laws](https://arxiv.org/abs/2404.05405) — ~2 bits/param for standard transformers
	- [Densing Law of LLMs](https://arxiv.org/abs/2412.04315) — capability density trends