Fix comparison table: correct BF16 ratio, add ZipServ, add streaming example

d0677c7 verified about 15 hours ago

2.37 kB

	---
	license: apache-2.0
	tags:
	- bigsmall
	- compression
	- lossless
	- gpt2
	---

	# GPT-2 (BigSmall compressed)

	548MB → 414MB (75.5%). Bit-identical. Under 500MB peak RAM with streaming.

	This is GPT-2 117M compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression. Not quantization. Every weight is bit-identical to the original.

	## Install

	```bash
	pip install bigsmall
	```

	## Load and run inference (streaming)

	```python
	from bigsmall import StreamingLoader
	from transformers import GPT2LMHeadModel, GPT2Tokenizer

	# Streams one layer at a time — under 500MB peak RAM
	loader = StreamingLoader("wpferrell/gpt2-bigsmall")
	model = loader.load_model(GPT2LMHeadModel)
	tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")

	inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(outputs[0]))
	```

	## Or decompress to disk first

	```python
	from bigsmall import from_pretrained
	from transformers import GPT2LMHeadModel
	model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
	```

	## What's inside

	\| File \| Original \| Compressed \| Ratio \|
	\|------\|----------\|------------\|-------\|
	\| model.safetensors (FP32) \| 548 MB \| 414 MB \| 75.5% \|

	Verified lossless: md5 of every weight tensor matches original after decompression.

	## Comparison

	\| Tool \| BF16 Ratio \| FP32 Ratio \| Inference Overhead \| Hardware \|
	\|------\|------------\|------------\|-------------------\|---------\|
	\| [ZipNN](https://arxiv.org/abs/2411.05239) \| 67% \| 83% \| None \| CPU \|
	\| [DFloat11](https://arxiv.org/abs/2504.11651) \| ~70% \| BF16 only \| ~2x at batch=1 \| CUDA only \|
	\| [ZipServ](https://arxiv.org/abs/2603.17435) \| ~70% \| BF16 only \| 1.22x faster \| GDDR GPU \|
	\| BigSmall \| 65.6% \| 75.5% \| None \| CPU + any GPU \|

	Lower ratio = better compression. BigSmall BF16 measured on Mistral 7B.

	## About BigSmall

	BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM.

	- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
	- PyPI: `pip install bigsmall`