Fix comparison table: correct BF16 ratio, add ZipServ, add streaming example

131e635 verified about 16 hours ago

2.65 kB

	---
	license: apache-2.0
	tags:
	- bigsmall
	- compression
	- lossless
	- mistral
	---

	# Mistral 7B Instruct (BigSmall compressed)

	14 GB → 9.3 GB. Under 2 GB peak RAM. Full quality — not quantization.

	This is Mistral-7B-Instruct-v0.3 compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression. Every weight is bit-identical to the original. No accuracy loss whatsoever.

	## Install

	```bash
	pip install bigsmall
	```

	## Load and run inference (streaming — under 2GB peak RAM)

	```python
	from bigsmall import StreamingLoader
	from transformers import MistralForCausalLM, AutoTokenizer

	# Streams one layer at a time — 9.3GB download, under 2GB peak RAM
	loader = StreamingLoader("wpferrell/mistral-7b-instruct-bigsmall")
	model = loader.load_model(MistralForCausalLM)
	tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

	messages = [{"role": "user", "content": "Explain lossless compression in one paragraph."}]
	inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
	outputs = model.generate(inputs, max_new_tokens=200)
	print(tokenizer.decode(outputs[0]))
	```

	## Or decompress to disk first

	```python
	from bigsmall import from_pretrained
	from transformers import MistralForCausalLM
	model = from_pretrained("wpferrell/mistral-7b-instruct-bigsmall", model_class=MistralForCausalLM)
	```

	## Compression stats

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Original size \| 14.2 GB \|
	\| Compressed size \| 9.3 GB \|
	\| Ratio \| 65.6% (BF16) \|
	\| Format \| BF16 → BigSmall (.bs shards) \|
	\| Lossless verified \| md5 every tensor \|
	\| Peak RAM (streaming) \| < 2 GB \|

	## Comparison

	\| Tool \| BF16 Ratio \| FP32 Ratio \| Inference Overhead \| Hardware \|
	\|------\|------------\|------------\|-------------------\|---------\|
	\| [ZipNN](https://arxiv.org/abs/2411.05239) \| 67% \| 83% \| None \| CPU \|
	\| [DFloat11](https://arxiv.org/abs/2504.11651) \| ~70% \| BF16 only \| ~2x at batch=1 \| CUDA only \|
	\| [ZipServ](https://arxiv.org/abs/2603.17435) \| ~70% \| BF16 only \| 1.22x faster \| GDDR GPU \|
	\| BigSmall \| 65.6% \| 75.5% \| None \| CPU + any GPU \|

	Lower ratio = better compression.

	## About BigSmall

	BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM — making 7B+ models accessible on hardware that couldn't otherwise load them.

	- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
	- PyPI: `pip install bigsmall`