File size: 2,733 Bytes
b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 b4cd8de 5850c93 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | ---
license: apache-2.0
tags:
- bigsmall
- compression
- lossless
- qwen
- qwen2.5
---
# Qwen 2.5 7B Instruct (BigSmall compressed)
**15.2 GB -> 10.05 GB. Full quality - not quantization.**
This is Qwen2.5-7B-Instruct compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) - lossless neural network weight compression. Every weight is bit-identical to the original. No accuracy loss whatsoever.
## Install
`ash
pip install bigsmall
`
## Load and run inference (streaming)
`python
from bigsmall import StreamingLoader
from transformers import AutoModelForCausalLM, AutoTokenizer
# Streams one layer at a time - saves significant peak RAM
loader = StreamingLoader("wpferrell/qwen2.5-7b-instruct-bigsmall")
model = loader.load_model(AutoModelForCausalLM)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
messages = [{"role": "user", "content": "Explain lossless compression in one paragraph."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
`
## Or decompress to disk first
`python
from bigsmall import from_pretrained
from transformers import AutoModelForCausalLM
model = from_pretrained("wpferrell/qwen2.5-7b-instruct-bigsmall", model_class=AutoModelForCausalLM)
`
## Or use the AutoModel hook (transparent)
`python
import bigsmall
bigsmall.install_hook()
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("wpferrell/qwen2.5-7b-instruct-bigsmall")
`
## Compression stats
| Metric | Value |
|--------|-------|
| Original size | 15.2 GB |
| Compressed size | 10.05 GB |
| Ratio | 66.0% (BF16) |
| Format | BF16 -> BigSmall (.bs shards) |
| Lossless verified | md5 every tensor |
## Comparison
| Tool | BF16 Ratio | FP32 Ratio | Inference Overhead | Hardware |
|------|------------|------------|-------------------|---------|
| [ZipNN](https://arxiv.org/abs/2411.05239) | 67% | 83% | None | CPU |
| [DFloat11](https://arxiv.org/abs/2504.11651) | ~70% | BF16 only | ~2x at batch=1 | CUDA only |
| [ZipServ](https://arxiv.org/abs/2603.17435) | ~70% | BF16 only | 1.22x faster | GDDR GPU |
| **BigSmall** | **66.0%** | **75.5%** | **None** | **CPU + any GPU** |
*Lower ratio = better compression.*
## About BigSmall
BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM.
- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
- PyPI: `pip install bigsmall`
|