Update README.md

A series of ablations asking what’s actually necessary in transformer attention. Each step removed something thought to be essential. Nothing broke. It got better.

Files changed (1) hide show

README.md +17 -3

README.md CHANGED Viewed

@@ -1,3 +1,17 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+---
+Key Result
+O(N) learned causal convolution beats O(N²) softmax attention on both perplexity AND throughput, with the advantage growing at longer sequences:
+Model	PPL	Change	TPS (128)	TPS (2048)	Speedup
+Learned Conv O(N)	8.08	-3.2%	378,066	1,009,622	5.5x
+Standard QKV O(N²)	8.34	baseline	317,968	183,408	1.0x
+At 2048 tokens, the O(N) model is 5.5x faster while achieving better perplexity. The gap widens with sequence length because O(N) scales linearly while O(N²) scales quadratically.
+https://github.com/MikeyBeez/DifferentialLR
+https://medium.com/p/6659a3793322
+https://doi.org/10.5281/zenodo.18498944