anshdadhich
/

richneuron-vs-vanilla-benchmark

Model card Files Files and versions

xet

Community

anshdadhich commited on 15 days ago

Commit

a00a45a

verified ·

1 Parent(s): 194df55

Add README with results and explanation

Browse files

Files changed (1) hide show

README.md +52 -0

README.md ADDED Viewed

	@@ -0,0 +1,52 @@

+# RichNeuron vs Vanilla MLP Benchmark
+## Can replacing `y = ReLU(Wx + b)` with something richer store more information per parameter?
+**Yes.** This repo contains the code and results proving it.
+## The Architecture
+```
+Vanilla neuron:   y = ReLU(W·x + b)
+RichNeuron:       y = LayerNorm( (W1·x) ⊙ sin(ω·W2·x + b) + W1·x )
+```
+RichNeuron combines three ideas from recent research:
+- **Multiplicative interactions** (from [MONet/Π-Nets](https://arxiv.org/abs/2401.17992)): `(W1·x) ⊙ (...)` creates cross-terms between features
+- **Periodic activations** (from [SIREN](https://arxiv.org/abs/2006.09661)): `sin(ω·W2·x)` encodes frequency information
+- **Residual connection**: `+ W1·x` prevents scalar collapse
+## Results (same parameter budget)
+| Task | Vanilla MLP | RichNeuron | Winner | Improvement |
+|---|---|---|---|---|
+| **High-Frequency Signal** | MSE 34.21 | MSE **1.20** | 🟢 Rich | **28.5× better** |
+| **Nested Nonlinear Fn** | MSE 0.032 | MSE **0.00027** | 🟢 Rich | **119.7× better** |
+| **Knowledge Memorization** | MSE 1.0e-10 | MSE **4.1e-13** | 🟢 Rich | **248× better** |
+| Complex Compositional Fn | MSE **0.022** | MSE 0.038 | ⚪ Vanilla | 1.8× |
+| Two-Spiral Classification | **100%** | **100%** | Tie | — |
+| Checkerboard Pattern | **95.2%** | 94.4% | ⚪ Vanilla | ~0.8% |
+| Structured Classification | **100%** | **100%** | Tie | — |
+**Where RichNeuron dominates:** Tasks with periodicity, feature interactions, or requiring high information density per parameter (28-248× better).
+**Where Vanilla wins:** Tasks where raw network width matters more than per-neuron expressiveness.
+## Key Insight
+RichNeuron uses ~2× params per hidden unit (W1 + W2), so it gets **half the hidden width** for the same budget. Despite having fewer neurons, it achieves dramatically better results on structured tasks because each neuron computes a richer function (quadratic × sinusoidal vs. linear × threshold).
+## Run It Yourself
+```bash
+pip install torch numpy
+python benchmark.py
+```
+## References
+- [KAN: Kolmogorov-Arnold Networks](https://arxiv.org/abs/2404.19756) — learnable spline activations
+- [MONet: Multilinear Operator Networks](https://arxiv.org/abs/2401.17992) — multiplicative interactions
+- [SIREN](https://arxiv.org/abs/2006.09661) — periodic activation functions
+- [Knowledge Capacity Scaling Laws](https://arxiv.org/abs/2404.05405) — ~2 bits/param for standard transformers
+- [Densing Law of LLMs](https://arxiv.org/abs/2412.04315) — capability density trends