Gzip Text Classifier
Paper: "Less is More: Parameter-Free Text Classification with Gzip" (Jiang et al., 2022)
π The Smallest Implementable Paper
This is an implementation of the smallest ML paper you can run β zero trainable parameters, zero training, CPU only.
The method uses Normalized Compression Distance (NCD) with gzip as the compressor, combined with k-nearest-neighbor classification:
NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))
where C(x) is the compressed length of x using gzip.
The entire algorithm fits in 15 lines of Python.
Results on AG News
| k | Accuracy | Macro F1 |
|---|---|---|
| 1 | 0.725 | 0.720 |
| 2 | 0.725 | 0.720 |
| 3 | 0.735 | 0.733 |
| 5 | 0.760 | 0.755 |
| 7 | 0.775 | 0.773 |
Config: 500 train samples/class (2000 total), 200 test samples, gzip compressor
Comparison with Paper
| Method | Train Size | Accuracy |
|---|---|---|
| Paper (gzip, k=2) | 120,000 | 0.937 |
| Paper (BERT) | 120,000 | 0.944 |
| Paper (gzip, 100-shot) | 400 | ~0.80 |
| Ours (gzip, k=7) | 2,000 | 0.775 |
How It Works
- Compress each text with gzip to get its compressed length
C(x) - For each test sample, compute NCD to every training sample
- Use k-nearest-neighbor voting on the k closest training samples
- The predicted class is the majority vote
No embeddings, no weights, no gradients, no GPU. Just information theory.
Why This Works
Gzip captures statistical regularities in text. Texts from the same category compress better together (lower NCD) because they share vocabulary, phrasing, and statistical patterns. The compressor acts as an implicit similarity measure grounded in Kolmogorov complexity.
Usage
import gzip
from collections import Counter
import numpy as np
def gzip_classify(test_text, train_texts, train_labels, k=7):
Cx1 = len(gzip.compress(test_text.encode()))
distances = []
for x2, label in zip(train_texts, train_labels):
Cx2 = len(gzip.compress(x2.encode()))
Cx1x2 = len(gzip.compress(" ".join([test_text, x2]).encode()))
ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)
distances.append((ncd, label))
distances.sort()
top_k = [label for _, label in distances[:k]]
return Counter(top_k).most_common(1)[0][0]
Hardware Requirements
- CPU only β
- 0 parameters β
- 0 training β
- No GPU needed β
- No dependencies beyond
gzip(stdlib) andnumpy
Citation
@article{jiang2022less,
title={Less is More: Parameter-Free Text Classification with Gzip},
author={Jiang, Zhiying and Yang, Matthew YR and Tsirlin, Mikhail and Tang, Raphael and Lin, Jimmy},
journal={arXiv preprint arXiv:2212.09410},
year={2022}
}