Densing Law of LLMs
Paper • 2412.04315 • Published • 19
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Can replacing y = ReLU(Wx + b) with richer per-neuron computation store more information per parameter?
Answer: Yes — up to 168,327× more at matched parameter budgets. But every adaptive mechanism we tried failed at small scale.
We replaced the standard neural network neuron with 15 progressively more complex architectures. At matched parameter budgets (3K-8K params):
sin(ω·W₁x) ⊙ W₂x) — a multiplicative periodic neuron inspired by the GLU literature — wins 5/6 standard tasks against vanilla ReLU MLPsBut every adaptive mechanism failed — routing gates, learnable frequencies, phase predictors, and multi-scale decompositions all either matched or underperformed SinGLU at this scale.
| Version | Hypothesis | Key Equation | Verdict |
|---|---|---|---|
| B1: Vanilla | Baseline | ReLU(Wx + b) |
OOD-robust but limited expressivity |
| B2: SinGLU | GLU-style periodic | sin(ω·Wg·x) ⊙ Wv·x |
Best overall — wins 5/6 tasks |
| v1 | Multiplicative + periodic + residual | (W₁x) ⊙ sin(ω·W₂x) + W₁x |
Killed by width penalty |
| v4 | Width-fix strategies | Low-rank, shared-weight, GLU-style | SinGLU emerges as best width fix |
| v6 | Adaptive routing gate | α(x)·periodic + (1-α)·linear |
α stuck at 0.5 — never learns |
| v7 | Learnable frequency | sin(ω(x)·Wx) |
ω frozen at initialization |
| v8 | Phase + amplitude gate | sin(ω·Wx + φ(x)) with sigmoid |
Gate weak, phase underused |
| v9 | Controlled freq + phase + gate | Bounded ω(x) + φ(x) + α(x) | 5 matrices → too narrow |
| v10 | Free phase | sin(ω·Wg·x + π·tanh(Wφ·x)) |
Best adaptive — wins Spiral + Complex |
| v11 | Disciplined phase (scaled) | sin(ω·(g + 0.1·tanh(φ))) |
Phase ~0 — basically SinGLU |
| v12 | Signal-proportional (FM) | sin(ω·g·(1 + 0.2·tanh(φ))) |
Frequency modulation, not phase |
| v13 | Signal-aligned phase | sin(ω·g + 0.1·g·tanh(φ)) |
Wins Checker, kills Spiral |
| v15 | Dual-phase decomposition | sin(ωg+βφ) ⊙ (1+α·sin(2ωg+γφ)) |
First to beat SinGLU on HiFreq |
| Model | Complex (4D) | Nested (2D) | HiFreq | Memorize |
|---|---|---|---|---|
| Vanilla | 0.0575 | 0.0487 | 1.10 | 0.1568 |
| SinGLU | 0.0143 | 0.0002 | 1.02 | 9.3e-7 |
| v10 | 0.0080 | 0.0004 | 1.22 | 1.7e-5 |
| v15 | 0.0316 | 0.0051 | 0.85 | 1.0e-5 |
| Model | Spiral | Checkerboard |
|---|---|---|
| Vanilla | 85.1% | 57.9% |
| SinGLU | 44.2% | 93.8% |
| v10 | 99.2% | 93.8% |
| v15 | 98.9% | 90.0% |
| Model | OOD [-1,1]→[1,2] | Freq Gen 2π→10π | Mixed Freq 4π→20π |
|---|---|---|---|
| Vanilla | 1.53 | 1.172 | 1.329 |
| SinGLU | 5.90 | 0.736 | 1.491 |
| v10 | 4.96 | 0.958 | 1.178 |
| v15 | 4.38 | 0.910 | 1.317 |
sin(2πx) but tests better on sin(10πx) (unseen frequency)| Task Type | Best Architecture | Why |
|---|---|---|
| Compositional / multiplicative | SinGLU | Cross-terms match function structure |
| Geometric / rotational (spirals) | v10 (free phase) | Phase shifts rotate boundaries |
| Multi-scale frequencies (audio) | v15 (dual-phase) | Explicit ω + 2ω channels |
| Pure memorization | SinGLU | Maximum information density |
| OOD / distribution shift | Vanilla ReLU | No hallucinated oscillations |
| Unseen frequency generalization | SinGLU | Fixed basis transfers across scales |
| File | Contains |
|---|---|
benchmark.py |
v1 — original RichNeuron vs Vanilla |
benchmark_v4.py |
v4 — width-fix strategies (LowRank, Shared, SinGLU) |
benchmark_v5.py |
v5 — honest re-eval (3 seeds, gradient norms, OOD) |
benchmark_v6.py |
v6 — adaptive routing neuron |
benchmark_v7.py |
v7 — learnable frequency neuron |
benchmark_v8.py |
v8 — adaptive phase + amplitude gate |
benchmark_v9.py |
v9 — controlled freq + phase + gate |
benchmark_v10.py |
v10 — SinGLU + free phase |
benchmark_v11.py |
v11 — SinGLU + disciplined phase |
benchmark_v12.py |
v12 — SinGLU + signal-proportional phase (FM) |
benchmark_v13.py |
v13 — SinGLU + aligned phase + correlation analysis |
benchmark_v15.py |
v15 — dual-phase decomposition + killer experiments |
| File | Contains |
|---|---|
results.json |
v1 raw results |
results_v4.json — results_v15.json |
Per-version raw JSON with per-seed scores |
PAPER.md |
Full technical report with analysis |
FINDINGS_SUMMARY.md |
Complete architecture catalog and all results tables |
CORRECTIONS.md |
Data verification and audit trail |
pip install torch numpy
python benchmark_v10.py # Run the best adaptive variant
python benchmark_v15.py # Run dual-phase + killer experiments
All benchmarks use CPU-only PyTorch and complete in ~15 minutes each.
@misc{richneuron2025,
title={Beyond Linear Neurons: An Empirical Study of Multiplicative Periodic
Architectures at Small Scale},
author={anshdadhich},
year={2025},
url={https://huggingface.co/anshdadhich/richneuron-vs-vanilla-benchmark}
}