Fix comparison table: correct BF16 ratio, add ZipServ, add streaming example
Browse files
README.md
CHANGED
|
@@ -19,13 +19,13 @@ This is GPT-2 117M compressed with [BigSmall](https://github.com/wpferrell/Bigsm
|
|
| 19 |
pip install bigsmall
|
| 20 |
```
|
| 21 |
|
| 22 |
-
## Load and run inference
|
| 23 |
|
| 24 |
```python
|
| 25 |
from bigsmall import StreamingLoader
|
| 26 |
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
| 27 |
|
| 28 |
-
#
|
| 29 |
loader = StreamingLoader("wpferrell/gpt2-bigsmall")
|
| 30 |
model = loader.load_model(GPT2LMHeadModel)
|
| 31 |
tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
|
|
@@ -39,6 +39,7 @@ print(tokenizer.decode(outputs[0]))
|
|
| 39 |
|
| 40 |
```python
|
| 41 |
from bigsmall import from_pretrained
|
|
|
|
| 42 |
model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
|
| 43 |
```
|
| 44 |
|
|
@@ -50,17 +51,20 @@ model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
|
|
| 50 |
|
| 51 |
Verified lossless: md5 of every weight tensor matches original after decompression.
|
| 52 |
|
| 53 |
-
##
|
| 54 |
|
| 55 |
-
| Tool | BF16 Ratio | Inference Overhead | Hardware |
|
| 56 |
-
|------|------------|-------------------|---------|
|
| 57 |
-
| ZipNN |
|
| 58 |
-
| DFloat11 | ~70% | ~2x at batch=1 | CUDA only |
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
## About BigSmall
|
| 62 |
|
| 63 |
-
BigSmall compresses at the
|
| 64 |
|
| 65 |
- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
|
| 66 |
- PyPI: `pip install bigsmall`
|
|
|
|
| 19 |
pip install bigsmall
|
| 20 |
```
|
| 21 |
|
| 22 |
+
## Load and run inference (streaming)
|
| 23 |
|
| 24 |
```python
|
| 25 |
from bigsmall import StreamingLoader
|
| 26 |
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
| 27 |
|
| 28 |
+
# Streams one layer at a time — under 500MB peak RAM
|
| 29 |
loader = StreamingLoader("wpferrell/gpt2-bigsmall")
|
| 30 |
model = loader.load_model(GPT2LMHeadModel)
|
| 31 |
tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
|
|
|
|
| 39 |
|
| 40 |
```python
|
| 41 |
from bigsmall import from_pretrained
|
| 42 |
+
from transformers import GPT2LMHeadModel
|
| 43 |
model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
|
| 44 |
```
|
| 45 |
|
|
|
|
| 51 |
|
| 52 |
Verified lossless: md5 of every weight tensor matches original after decompression.
|
| 53 |
|
| 54 |
+
## Comparison
|
| 55 |
|
| 56 |
+
| Tool | BF16 Ratio | FP32 Ratio | Inference Overhead | Hardware |
|
| 57 |
+
|------|------------|------------|-------------------|---------|
|
| 58 |
+
| [ZipNN](https://arxiv.org/abs/2411.05239) | 67% | 83% | None | CPU |
|
| 59 |
+
| [DFloat11](https://arxiv.org/abs/2504.11651) | ~70% | BF16 only | ~2x at batch=1 | CUDA only |
|
| 60 |
+
| [ZipServ](https://arxiv.org/abs/2603.17435) | ~70% | BF16 only | 1.22x faster | GDDR GPU |
|
| 61 |
+
| **BigSmall** | **65.6%** | **75.5%** | **None** | **CPU + any GPU** |
|
| 62 |
+
|
| 63 |
+
*Lower ratio = better compression. BigSmall BF16 measured on Mistral 7B.*
|
| 64 |
|
| 65 |
## About BigSmall
|
| 66 |
|
| 67 |
+
BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM.
|
| 68 |
|
| 69 |
- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
|
| 70 |
- PyPI: `pip install bigsmall`
|