wpferrell
/

qwen2.5-7b-instruct-bigsmall

Model card Files Files and versions

wpferrell commited on about 6 hours ago

Commit

b4cd8de

·

verified ·

1 Parent(s): 06f221d

Add model card

Files changed (1) hide show

README.md +75 -0

README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+---
+license: apache-2.0
+tags:
+  - bigsmall
+  - compression
+  - lossless
+  - qwen2
+---
+# Qwen 2.5 7B Instruct (BigSmall compressed)
+**15.2 GB → 10.1 GB (66.0%). Under 2 GB peak RAM. Full quality — not quantization.**
+This is Qwen2.5-7B-Instruct compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression. Every weight is bit-identical to the original. No accuracy loss whatsoever.
+## Install
+```bash
+pip install bigsmall
+```
+## Load and run inference (streaming — under 2GB peak RAM)
+```python
+from bigsmall import StreamingLoader
+from transformers import AutoModelForCausalLM, AutoTokenizer
+loader = StreamingLoader("wpferrell/qwen2.5-7b-instruct-bigsmall")
+model = loader.load_model(AutoModelForCausalLM)
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
+messages = [{"role": "user", "content": "Explain lossless compression."}]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer([text], return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=200)
+print(tokenizer.decode(outputs[0]))
+```
+## Or use AutoModel with the transparent hook
+```python
+import bigsmall
+bigsmall.install_hook()
+from transformers import AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained("wpferrell/qwen2.5-7b-instruct-bigsmall")
+```
+## Compression stats
+| Metric | Value |
+|--------|-------|
+| Original size | 15.2 GB |
+| Compressed size | 10.1 GB |
+| Ratio | 66.0% (BF16) |
+| Format | BF16 → BigSmall (.bs shards) |
+| Lossless verified | md5 every tensor |
+| Peak RAM (streaming) | < 2 GB |
+## Comparison
+| Tool | BF16 Ratio | FP32 Ratio | Inference Overhead | Hardware |
+|------|------------|------------|-------------------|---------|
+| [ZipNN](https://arxiv.org/abs/2411.05239) | 67% | 83% | None | CPU |
+| [DFloat11](https://arxiv.org/abs/2504.11651) | ~70% | BF16 only | ~2x at batch=1 | CUDA only |
+| [ZipServ](https://arxiv.org/abs/2603.17435) | ~70% | BF16 only | 1.22x faster | GDDR GPU |
+| **BigSmall** | **65.6%** | **75.5%** | **None** | **CPU + any GPU** |
+## About BigSmall
+BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM.
+- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
+- PyPI: `pip install bigsmall`
+- Paper: [BigSmall: Lossless Neural Network Weight Compression at the Joint Entropy Floor](https://github.com/wpferrell/Bigsmall/blob/main/paper.pdf)