wpferrell
/

gpt2-bigsmall

@@ -19,13 +19,13 @@ This is GPT-2 117M compressed with [BigSmall](https://github.com/wpferrell/Bigsm
 pip install bigsmall
 ```
-## Load and run inference
 ```python
 from bigsmall import StreamingLoader
 from transformers import GPT2LMHeadModel, GPT2Tokenizer
-# Decompress one layer at a time — under 500MB peak RAM
 loader = StreamingLoader("wpferrell/gpt2-bigsmall")
 model = loader.load_model(GPT2LMHeadModel)
 tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
@@ -39,6 +39,7 @@ print(tokenizer.decode(outputs[0]))
 ```python
 from bigsmall import from_pretrained
 model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
 ```
@@ -50,17 +51,20 @@ model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
 Verified lossless: md5 of every weight tensor matches original after decompression.
-## vs other compression tools
-| Tool | BF16 Ratio | Inference Overhead | Hardware |
-|------|------------|-------------------|---------|
-| ZipNN | ~83% | None | CPU |
-| DFloat11 | ~70% | ~2x at batch=1 | CUDA only |
-| **BigSmall** | **59.8%** | **None** | **CPU + GPU** |
 ## About BigSmall
-BigSmall compresses at the Shannon entropy floor for neural network weights. It detects the float format automatically (FP32, BF16, FP16, FP8, FP4) and applies the optimal lossless codec for each tensor. Streaming loader decompresses one layer at a time directly into VRAM — peak RAM stays under 2GB even for 7B models.
 - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
 - PyPI: `pip install bigsmall`

 pip install bigsmall
 ```
+## Load and run inference (streaming)
 ```python
 from bigsmall import StreamingLoader
 from transformers import GPT2LMHeadModel, GPT2Tokenizer
+# Streams one layer at a time — under 500MB peak RAM
 loader = StreamingLoader("wpferrell/gpt2-bigsmall")
 model = loader.load_model(GPT2LMHeadModel)
 tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
 ```python
 from bigsmall import from_pretrained
+from transformers import GPT2LMHeadModel
 model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
 ```
 Verified lossless: md5 of every weight tensor matches original after decompression.
+## Comparison
+| Tool | BF16 Ratio | FP32 Ratio | Inference Overhead | Hardware |
+|------|------------|------------|-------------------|---------|
+| [ZipNN](https://arxiv.org/abs/2411.05239) | 67% | 83% | None | CPU |
+| [DFloat11](https://arxiv.org/abs/2504.11651) | ~70% | BF16 only | ~2x at batch=1 | CUDA only |
+| [ZipServ](https://arxiv.org/abs/2603.17435) | ~70% | BF16 only | 1.22x faster | GDDR GPU |
+| **BigSmall** | **65.6%** | **75.5%** | **None** | **CPU + any GPU** |
+*Lower ratio = better compression. BigSmall BF16 measured on Mistral 7B.*
 ## About BigSmall
+BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM.
 - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
 - PyPI: `pip install bigsmall`