File size: 2,651 Bytes
9dd2f01 4a6740d 9dd2f01 4a6740d 9dd2f01 4a6740d 9dd2f01 4a6740d 9dd2f01 4a6740d 9dd2f01 4a6740d 9dd2f01 4a6740d 9dd2f01 4a6740d 9dd2f01 4a6740d 9dd2f01 4a6740d 9dd2f01 4a6740d 131e635 4a6740d 131e635 4a6740d 131e635 9dd2f01 4a6740d 131e635 4a6740d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | ---
license: apache-2.0
tags:
- bigsmall
- compression
- lossless
- mistral
---
# Mistral 7B Instruct (BigSmall compressed)
**14 GB → 9.3 GB. Under 2 GB peak RAM. Full quality — not quantization.**
This is Mistral-7B-Instruct-v0.3 compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression. Every weight is bit-identical to the original. No accuracy loss whatsoever.
## Install
```bash
pip install bigsmall
```
## Load and run inference (streaming — under 2GB peak RAM)
```python
from bigsmall import StreamingLoader
from transformers import MistralForCausalLM, AutoTokenizer
# Streams one layer at a time — 9.3GB download, under 2GB peak RAM
loader = StreamingLoader("wpferrell/mistral-7b-instruct-bigsmall")
model = loader.load_model(MistralForCausalLM)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
messages = [{"role": "user", "content": "Explain lossless compression in one paragraph."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
```
## Or decompress to disk first
```python
from bigsmall import from_pretrained
from transformers import MistralForCausalLM
model = from_pretrained("wpferrell/mistral-7b-instruct-bigsmall", model_class=MistralForCausalLM)
```
## Compression stats
| Metric | Value |
|--------|-------|
| Original size | 14.2 GB |
| Compressed size | 9.3 GB |
| Ratio | 65.6% (BF16) |
| Format | BF16 → BigSmall (.bs shards) |
| Lossless verified | md5 every tensor |
| Peak RAM (streaming) | < 2 GB |
## Comparison
| Tool | BF16 Ratio | FP32 Ratio | Inference Overhead | Hardware |
|------|------------|------------|-------------------|---------|
| [ZipNN](https://arxiv.org/abs/2411.05239) | 67% | 83% | None | CPU |
| [DFloat11](https://arxiv.org/abs/2504.11651) | ~70% | BF16 only | ~2x at batch=1 | CUDA only |
| [ZipServ](https://arxiv.org/abs/2603.17435) | ~70% | BF16 only | 1.22x faster | GDDR GPU |
| **BigSmall** | **65.6%** | **75.5%** | **None** | **CPU + any GPU** |
*Lower ratio = better compression.*
## About BigSmall
BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM — making 7B+ models accessible on hardware that couldn't otherwise load them.
- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
- PyPI: `pip install bigsmall`
|