wpferrell
/

mistral-7b-instruct-bigsmall

@@ -1,41 +1,71 @@
 ---
 license: apache-2.0
 tags:
-- bigsmall
-- compression
-- lossless
-- mistral
 ---
-# Mistral 7B Instruct v0.3 (BigSmall compressed)
-Mistral 7B Instruct v0.3 compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression.
-**14 GB → 9.3 GB. Bit-for-bit identical weights. No quality loss.**
-## Usage
 pip install bigsmall
 ```python
-import bigsmall
-from transformers import AutoModelForCausalLM, AutoTokenizer
-state_dict = bigsmall.from_pretrained("wpferrell/mistral-7b-instruct-bigsmall")
-model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3", state_dict=state_dict)
 tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
 ```
-Or stream layer-by-layer with under 2GB peak RAM:
 ```python
-with bigsmall.StreamingLoader("model.bs", device="cuda") as loader:
-    for layer_idx, tensors in loader.iter_layers():
-        pass
 ```
 ## Compression stats
-- Original: ~14.2 GB (BF16)
-- Compressed: ~9.3 GB
-- Ratio: ~65.6% (lossless)
-- All tensors md5-verified bit-identical
 ## About BigSmall
-pip install bigsmall | https://github.com/wpferrell/Bigsmall

 ---
 license: apache-2.0
 tags:
+  - bigsmall
+  - compression
+  - lossless
+  - mistral
 ---
+# Mistral 7B Instruct (BigSmall compressed)
+**14 GB → 9.3 GB. Under 2 GB peak RAM. Full quality — not quantization.**
+This is Mistral-7B-Instruct-v0.3 compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression. Every weight is bit-identical to the original. No accuracy loss whatsoever.
+## Install
+```bash
 pip install bigsmall
+```
+## Load and run inference (streaming — under 2GB peak RAM)
 ```python
+from bigsmall import StreamingLoader
+from transformers import MistralForCausalLM, AutoTokenizer
+# Streams one layer at a time — 9.3GB download, under 2GB peak RAM
+loader = StreamingLoader("wpferrell/mistral-7b-instruct-bigsmall")
+model = loader.load_model(MistralForCausalLM)
 tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
+messages = [{"role": "user", "content": "Explain lossless compression in one paragraph."}]
+inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
+outputs = model.generate(inputs, max_new_tokens=200)
+print(tokenizer.decode(outputs[0]))
 ```
+## Or decompress to disk first
 ```python
+from bigsmall import from_pretrained
+from transformers import MistralForCausalLM
+model = from_pretrained("wpferrell/mistral-7b-instruct-bigsmall", model_class=MistralForCausalLM)
 ```
 ## Compression stats
+| Metric | Value |
+|--------|-------|
+| Original size | 14.2 GB |
+| Compressed size | 9.3 GB |
+| Ratio | 65.6% (BF16) |
+| Format | BF16 → BigSmall (.bs shards) |
+| Lossless verified | ✅ md5 every tensor |
+| Peak RAM (streaming) | < 2 GB |
+## vs other compression tools
+| Tool | BF16 Ratio | Inference Overhead | Hardware |
+|------|------------|-------------------|---------|
+| ZipNN | ~83% | None | CPU |
+| DFloat11 | ~70% | ~2x at batch=1 | CUDA only |
+| **BigSmall** | **59.8%** | **None** | **CPU + GPU** |
 ## About BigSmall
+BigSmall compresses neural network weights at the Shannon entropy floor. It detects float format automatically (FP32, BF16, FP16, FP8, FP4) and applies the optimal lossless codec per tensor. The streaming loader decompresses one transformer layer at a time directly into VRAM — making 7B+ models accessible on hardware that couldn't otherwise load them.
+- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
+- PyPI: `pip install bigsmall`