Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Gzip Text Classifier
|
| 2 |
+
|
| 3 |
+
**Paper:** ["Less is More: Parameter-Free Text Classification with Gzip"](https://arxiv.org/abs/2212.09410) (Jiang et al., 2022)
|
| 4 |
+
|
| 5 |
+
## 🏆 The Smallest Implementable Paper
|
| 6 |
+
|
| 7 |
+
This is an implementation of the smallest ML paper you can run — **zero trainable parameters**, **zero training**, **CPU only**.
|
| 8 |
+
|
| 9 |
+
The method uses **Normalized Compression Distance (NCD)** with gzip as the compressor, combined with **k-nearest-neighbor** classification:
|
| 10 |
+
|
| 11 |
+
```
|
| 12 |
+
NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
where `C(x)` is the compressed length of `x` using gzip.
|
| 16 |
+
|
| 17 |
+
The entire algorithm fits in **15 lines of Python**.
|
| 18 |
+
|
| 19 |
+
## Results on AG News
|
| 20 |
+
|
| 21 |
+
| k | Accuracy | Macro F1 |
|
| 22 |
+
|---|----------|----------|
|
| 23 |
+
| 1 | 0.725 | 0.720 |
|
| 24 |
+
| 2 | 0.725 | 0.720 |
|
| 25 |
+
| 3 | 0.735 | 0.733 |
|
| 26 |
+
| 5 | 0.760 | 0.755 |
|
| 27 |
+
| **7** | **0.775** | **0.773** |
|
| 28 |
+
|
| 29 |
+
**Config:** 500 train samples/class (2000 total), 200 test samples, gzip compressor
|
| 30 |
+
|
| 31 |
+
### Comparison with Paper
|
| 32 |
+
|
| 33 |
+
| Method | Train Size | Accuracy |
|
| 34 |
+
|--------|-----------|----------|
|
| 35 |
+
| Paper (gzip, k=2) | 120,000 | **0.937** |
|
| 36 |
+
| Paper (BERT) | 120,000 | 0.944 |
|
| 37 |
+
| Paper (gzip, 100-shot) | 400 | ~0.80 |
|
| 38 |
+
| **Ours (gzip, k=7)** | **2,000** | **0.775** |
|
| 39 |
+
|
| 40 |
+
## How It Works
|
| 41 |
+
|
| 42 |
+
1. **Compress** each text with gzip to get its compressed length `C(x)`
|
| 43 |
+
2. For each test sample, compute **NCD** to every training sample
|
| 44 |
+
3. Use **k-nearest-neighbor** voting on the k closest training samples
|
| 45 |
+
4. The predicted class is the majority vote
|
| 46 |
+
|
| 47 |
+
No embeddings, no weights, no gradients, no GPU. Just information theory.
|
| 48 |
+
|
| 49 |
+
## Why This Works
|
| 50 |
+
|
| 51 |
+
Gzip captures statistical regularities in text. Texts from the same category compress better together (lower NCD) because they share vocabulary, phrasing, and statistical patterns. The compressor acts as an implicit similarity measure grounded in Kolmogorov complexity.
|
| 52 |
+
|
| 53 |
+
## Usage
|
| 54 |
+
|
| 55 |
+
```python
|
| 56 |
+
import gzip
|
| 57 |
+
from collections import Counter
|
| 58 |
+
import numpy as np
|
| 59 |
+
|
| 60 |
+
def gzip_classify(test_text, train_texts, train_labels, k=7):
|
| 61 |
+
Cx1 = len(gzip.compress(test_text.encode()))
|
| 62 |
+
distances = []
|
| 63 |
+
for x2, label in zip(train_texts, train_labels):
|
| 64 |
+
Cx2 = len(gzip.compress(x2.encode()))
|
| 65 |
+
Cx1x2 = len(gzip.compress(" ".join([test_text, x2]).encode()))
|
| 66 |
+
ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)
|
| 67 |
+
distances.append((ncd, label))
|
| 68 |
+
distances.sort()
|
| 69 |
+
top_k = [label for _, label in distances[:k]]
|
| 70 |
+
return Counter(top_k).most_common(1)[0][0]
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Hardware Requirements
|
| 74 |
+
|
| 75 |
+
- **CPU only** ✅
|
| 76 |
+
- **0 parameters** ✅
|
| 77 |
+
- **0 training** ✅
|
| 78 |
+
- **No GPU needed** ✅
|
| 79 |
+
- **No dependencies** beyond `gzip` (stdlib) and `numpy`
|
| 80 |
+
|
| 81 |
+
## Citation
|
| 82 |
+
|
| 83 |
+
```bibtex
|
| 84 |
+
@article{jiang2022less,
|
| 85 |
+
title={Less is More: Parameter-Free Text Classification with Gzip},
|
| 86 |
+
author={Jiang, Zhiying and Yang, Matthew YR and Tsirlin, Mikhail and Tang, Raphael and Lin, Jimmy},
|
| 87 |
+
journal={arXiv preprint arXiv:2212.09410},
|
| 88 |
+
year={2022}
|
| 89 |
+
}
|
| 90 |
+
```
|