knoxel commited on
Commit
a7279bf
·
verified ·
1 Parent(s): e7a93b3

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Gzip Text Classifier
2
+
3
+ **Paper:** ["Less is More: Parameter-Free Text Classification with Gzip"](https://arxiv.org/abs/2212.09410) (Jiang et al., 2022)
4
+
5
+ ## 🏆 The Smallest Implementable Paper
6
+
7
+ This is an implementation of the smallest ML paper you can run — **zero trainable parameters**, **zero training**, **CPU only**.
8
+
9
+ The method uses **Normalized Compression Distance (NCD)** with gzip as the compressor, combined with **k-nearest-neighbor** classification:
10
+
11
+ ```
12
+ NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))
13
+ ```
14
+
15
+ where `C(x)` is the compressed length of `x` using gzip.
16
+
17
+ The entire algorithm fits in **15 lines of Python**.
18
+
19
+ ## Results on AG News
20
+
21
+ | k | Accuracy | Macro F1 |
22
+ |---|----------|----------|
23
+ | 1 | 0.725 | 0.720 |
24
+ | 2 | 0.725 | 0.720 |
25
+ | 3 | 0.735 | 0.733 |
26
+ | 5 | 0.760 | 0.755 |
27
+ | **7** | **0.775** | **0.773** |
28
+
29
+ **Config:** 500 train samples/class (2000 total), 200 test samples, gzip compressor
30
+
31
+ ### Comparison with Paper
32
+
33
+ | Method | Train Size | Accuracy |
34
+ |--------|-----------|----------|
35
+ | Paper (gzip, k=2) | 120,000 | **0.937** |
36
+ | Paper (BERT) | 120,000 | 0.944 |
37
+ | Paper (gzip, 100-shot) | 400 | ~0.80 |
38
+ | **Ours (gzip, k=7)** | **2,000** | **0.775** |
39
+
40
+ ## How It Works
41
+
42
+ 1. **Compress** each text with gzip to get its compressed length `C(x)`
43
+ 2. For each test sample, compute **NCD** to every training sample
44
+ 3. Use **k-nearest-neighbor** voting on the k closest training samples
45
+ 4. The predicted class is the majority vote
46
+
47
+ No embeddings, no weights, no gradients, no GPU. Just information theory.
48
+
49
+ ## Why This Works
50
+
51
+ Gzip captures statistical regularities in text. Texts from the same category compress better together (lower NCD) because they share vocabulary, phrasing, and statistical patterns. The compressor acts as an implicit similarity measure grounded in Kolmogorov complexity.
52
+
53
+ ## Usage
54
+
55
+ ```python
56
+ import gzip
57
+ from collections import Counter
58
+ import numpy as np
59
+
60
+ def gzip_classify(test_text, train_texts, train_labels, k=7):
61
+ Cx1 = len(gzip.compress(test_text.encode()))
62
+ distances = []
63
+ for x2, label in zip(train_texts, train_labels):
64
+ Cx2 = len(gzip.compress(x2.encode()))
65
+ Cx1x2 = len(gzip.compress(" ".join([test_text, x2]).encode()))
66
+ ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)
67
+ distances.append((ncd, label))
68
+ distances.sort()
69
+ top_k = [label for _, label in distances[:k]]
70
+ return Counter(top_k).most_common(1)[0][0]
71
+ ```
72
+
73
+ ## Hardware Requirements
74
+
75
+ - **CPU only** ✅
76
+ - **0 parameters** ✅
77
+ - **0 training** ✅
78
+ - **No GPU needed** ✅
79
+ - **No dependencies** beyond `gzip` (stdlib) and `numpy`
80
+
81
+ ## Citation
82
+
83
+ ```bibtex
84
+ @article{jiang2022less,
85
+ title={Less is More: Parameter-Free Text Classification with Gzip},
86
+ author={Jiang, Zhiying and Yang, Matthew YR and Tsirlin, Mikhail and Tang, Raphael and Lin, Jimmy},
87
+ journal={arXiv preprint arXiv:2212.09410},
88
+ year={2022}
89
+ }
90
+ ```