anshdadhich commited on
Commit
a00a45a
·
verified ·
1 Parent(s): 194df55

Add README with results and explanation

Browse files
Files changed (1) hide show
  1. README.md +52 -0
README.md ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RichNeuron vs Vanilla MLP Benchmark
2
+
3
+ ## Can replacing `y = ReLU(Wx + b)` with something richer store more information per parameter?
4
+
5
+ **Yes.** This repo contains the code and results proving it.
6
+
7
+ ## The Architecture
8
+
9
+ ```
10
+ Vanilla neuron: y = ReLU(W·x + b)
11
+ RichNeuron: y = LayerNorm( (W1·x) ⊙ sin(ω·W2·x + b) + W1·x )
12
+ ```
13
+
14
+ RichNeuron combines three ideas from recent research:
15
+ - **Multiplicative interactions** (from [MONet/Π-Nets](https://arxiv.org/abs/2401.17992)): `(W1·x) ⊙ (...)` creates cross-terms between features
16
+ - **Periodic activations** (from [SIREN](https://arxiv.org/abs/2006.09661)): `sin(ω·W2·x)` encodes frequency information
17
+ - **Residual connection**: `+ W1·x` prevents scalar collapse
18
+
19
+ ## Results (same parameter budget)
20
+
21
+ | Task | Vanilla MLP | RichNeuron | Winner | Improvement |
22
+ |---|---|---|---|---|
23
+ | **High-Frequency Signal** | MSE 34.21 | MSE **1.20** | 🟢 Rich | **28.5× better** |
24
+ | **Nested Nonlinear Fn** | MSE 0.032 | MSE **0.00027** | 🟢 Rich | **119.7× better** |
25
+ | **Knowledge Memorization** | MSE 1.0e-10 | MSE **4.1e-13** | 🟢 Rich | **248× better** |
26
+ | Complex Compositional Fn | MSE **0.022** | MSE 0.038 | ⚪ Vanilla | 1.8× |
27
+ | Two-Spiral Classification | **100%** | **100%** | Tie | — |
28
+ | Checkerboard Pattern | **95.2%** | 94.4% | ⚪ Vanilla | ~0.8% |
29
+ | Structured Classification | **100%** | **100%** | Tie | — |
30
+
31
+ **Where RichNeuron dominates:** Tasks with periodicity, feature interactions, or requiring high information density per parameter (28-248× better).
32
+
33
+ **Where Vanilla wins:** Tasks where raw network width matters more than per-neuron expressiveness.
34
+
35
+ ## Key Insight
36
+
37
+ RichNeuron uses ~2× params per hidden unit (W1 + W2), so it gets **half the hidden width** for the same budget. Despite having fewer neurons, it achieves dramatically better results on structured tasks because each neuron computes a richer function (quadratic × sinusoidal vs. linear × threshold).
38
+
39
+ ## Run It Yourself
40
+
41
+ ```bash
42
+ pip install torch numpy
43
+ python benchmark.py
44
+ ```
45
+
46
+ ## References
47
+
48
+ - [KAN: Kolmogorov-Arnold Networks](https://arxiv.org/abs/2404.19756) — learnable spline activations
49
+ - [MONet: Multilinear Operator Networks](https://arxiv.org/abs/2401.17992) — multiplicative interactions
50
+ - [SIREN](https://arxiv.org/abs/2006.09661) — periodic activation functions
51
+ - [Knowledge Capacity Scaling Laws](https://arxiv.org/abs/2404.05405) — ~2 bits/param for standard transformers
52
+ - [Densing Law of LLMs](https://arxiv.org/abs/2412.04315) — capability density trends