wpferrell
/

qwen2.5-7b-instruct-bigsmall

Model card Files Files and versions

qwen2.5-7b-instruct-bigsmall / README.md

wpferrell's picture

Add model card

b4cd8de verified about 7 hours ago

|

history blame contribute delete

2.68 kB

	---
	license: apache-2.0
	tags:
	- bigsmall
	- compression
	- lossless
	- qwen2
	---

	# Qwen 2.5 7B Instruct (BigSmall compressed)

	15.2 GB → 10.1 GB (66.0%). Under 2 GB peak RAM. Full quality — not quantization.

	This is Qwen2.5-7B-Instruct compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression. Every weight is bit-identical to the original. No accuracy loss whatsoever.

	## Install

	```bash
	pip install bigsmall
	```

	## Load and run inference (streaming — under 2GB peak RAM)

	```python
	from bigsmall import StreamingLoader
	from transformers import AutoModelForCausalLM, AutoTokenizer

	loader = StreamingLoader("wpferrell/qwen2.5-7b-instruct-bigsmall")
	model = loader.load_model(AutoModelForCausalLM)
	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

	messages = [{"role": "user", "content": "Explain lossless compression."}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=200)
	print(tokenizer.decode(outputs[0]))
	```

	## Or use AutoModel with the transparent hook

	```python
	import bigsmall
	bigsmall.install_hook()

	from transformers import AutoModelForCausalLM
	model = AutoModelForCausalLM.from_pretrained("wpferrell/qwen2.5-7b-instruct-bigsmall")
	```

	## Compression stats

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Original size \| 15.2 GB \|
	\| Compressed size \| 10.1 GB \|
	\| Ratio \| 66.0% (BF16) \|
	\| Format \| BF16 → BigSmall (.bs shards) \|
	\| Lossless verified \| md5 every tensor \|
	\| Peak RAM (streaming) \| < 2 GB \|

	## Comparison

	\| Tool \| BF16 Ratio \| FP32 Ratio \| Inference Overhead \| Hardware \|
	\|------\|------------\|------------\|-------------------\|---------\|
	\| [ZipNN](https://arxiv.org/abs/2411.05239) \| 67% \| 83% \| None \| CPU \|
	\| [DFloat11](https://arxiv.org/abs/2504.11651) \| ~70% \| BF16 only \| ~2x at batch=1 \| CUDA only \|
	\| [ZipServ](https://arxiv.org/abs/2603.17435) \| ~70% \| BF16 only \| 1.22x faster \| GDDR GPU \|
	\| BigSmall \| 65.6% \| 75.5% \| None \| CPU + any GPU \|

	## About BigSmall

	BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM.

	- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
	- PyPI: `pip install bigsmall`
	- Paper: [BigSmall: Lossless Neural Network Weight Compression at the Joint Entropy Floor](https://github.com/wpferrell/Bigsmall/blob/main/paper.pdf)