MikeyBeez commited on
Commit
d3e8fea
·
verified ·
1 Parent(s): 45bde7c

Update README.md

Browse files

A series of ablations asking what’s actually necessary in transformer attention. Each step removed something thought to be essential. Nothing broke. It got better.

Files changed (1) hide show
  1. README.md +17 -3
README.md CHANGED
@@ -1,3 +1,17 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ Key Result
5
+
6
+ O(N) learned causal convolution beats O(N²) softmax attention on both perplexity AND throughput, with the advantage growing at longer sequences:
7
+
8
+ Model PPL Change TPS (128) TPS (2048) Speedup
9
+ Learned Conv O(N) 8.08 -3.2% 378,066 1,009,622 5.5x
10
+ Standard QKV O(N²) 8.34 baseline 317,968 183,408 1.0x
11
+ At 2048 tokens, the O(N) model is 5.5x faster while achieving better perplexity. The gap widens with sequence length because O(N) scales linearly while O(N²) scales quadratically.
12
+
13
+ https://github.com/MikeyBeez/DifferentialLR
14
+ https://medium.com/p/6659a3793322
15
+ https://doi.org/10.5281/zenodo.18498944
16
+
17
+