knoxel's picture
Upload README.md
a7279bf verified

Gzip Text Classifier

Paper: "Less is More: Parameter-Free Text Classification with Gzip" (Jiang et al., 2022)

πŸ† The Smallest Implementable Paper

This is an implementation of the smallest ML paper you can run β€” zero trainable parameters, zero training, CPU only.

The method uses Normalized Compression Distance (NCD) with gzip as the compressor, combined with k-nearest-neighbor classification:

NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))

where C(x) is the compressed length of x using gzip.

The entire algorithm fits in 15 lines of Python.

Results on AG News

k Accuracy Macro F1
1 0.725 0.720
2 0.725 0.720
3 0.735 0.733
5 0.760 0.755
7 0.775 0.773

Config: 500 train samples/class (2000 total), 200 test samples, gzip compressor

Comparison with Paper

Method Train Size Accuracy
Paper (gzip, k=2) 120,000 0.937
Paper (BERT) 120,000 0.944
Paper (gzip, 100-shot) 400 ~0.80
Ours (gzip, k=7) 2,000 0.775

How It Works

  1. Compress each text with gzip to get its compressed length C(x)
  2. For each test sample, compute NCD to every training sample
  3. Use k-nearest-neighbor voting on the k closest training samples
  4. The predicted class is the majority vote

No embeddings, no weights, no gradients, no GPU. Just information theory.

Why This Works

Gzip captures statistical regularities in text. Texts from the same category compress better together (lower NCD) because they share vocabulary, phrasing, and statistical patterns. The compressor acts as an implicit similarity measure grounded in Kolmogorov complexity.

Usage

import gzip
from collections import Counter
import numpy as np

def gzip_classify(test_text, train_texts, train_labels, k=7):
    Cx1 = len(gzip.compress(test_text.encode()))
    distances = []
    for x2, label in zip(train_texts, train_labels):
        Cx2 = len(gzip.compress(x2.encode()))
        Cx1x2 = len(gzip.compress(" ".join([test_text, x2]).encode()))
        ncd = (Cx1x2 - min(Cx1, Cx2)) / max(Cx1, Cx2)
        distances.append((ncd, label))
    distances.sort()
    top_k = [label for _, label in distances[:k]]
    return Counter(top_k).most_common(1)[0][0]

Hardware Requirements

  • CPU only βœ…
  • 0 parameters βœ…
  • 0 training βœ…
  • No GPU needed βœ…
  • No dependencies beyond gzip (stdlib) and numpy

Citation

@article{jiang2022less,
  title={Less is More: Parameter-Free Text Classification with Gzip},
  author={Jiang, Zhiying and Yang, Matthew YR and Tsirlin, Mikhail and Tang, Raphael and Lin, Jimmy},
  journal={arXiv preprint arXiv:2212.09410},
  year={2022}
}