gpt2
bigsmall
compression
lossless
wpferrell commited on
Commit
d0677c7
·
verified ·
1 Parent(s): 6fd0437

Fix comparison table: correct BF16 ratio, add ZipServ, add streaming example

Browse files
Files changed (1) hide show
  1. README.md +13 -9
README.md CHANGED
@@ -19,13 +19,13 @@ This is GPT-2 117M compressed with [BigSmall](https://github.com/wpferrell/Bigsm
19
  pip install bigsmall
20
  ```
21
 
22
- ## Load and run inference
23
 
24
  ```python
25
  from bigsmall import StreamingLoader
26
  from transformers import GPT2LMHeadModel, GPT2Tokenizer
27
 
28
- # Decompress one layer at a time — under 500MB peak RAM
29
  loader = StreamingLoader("wpferrell/gpt2-bigsmall")
30
  model = loader.load_model(GPT2LMHeadModel)
31
  tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
@@ -39,6 +39,7 @@ print(tokenizer.decode(outputs[0]))
39
 
40
  ```python
41
  from bigsmall import from_pretrained
 
42
  model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
43
  ```
44
 
@@ -50,17 +51,20 @@ model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
50
 
51
  Verified lossless: md5 of every weight tensor matches original after decompression.
52
 
53
- ## vs other compression tools
54
 
55
- | Tool | BF16 Ratio | Inference Overhead | Hardware |
56
- |------|------------|-------------------|---------|
57
- | ZipNN | ~83% | None | CPU |
58
- | DFloat11 | ~70% | ~2x at batch=1 | CUDA only |
59
- | **BigSmall** | **59.8%** | **None** | **CPU + GPU** |
 
 
 
60
 
61
  ## About BigSmall
62
 
63
- BigSmall compresses at the Shannon entropy floor for neural network weights. It detects the float format automatically (FP32, BF16, FP16, FP8, FP4) and applies the optimal lossless codec for each tensor. Streaming loader decompresses one layer at a time directly into VRAM — peak RAM stays under 2GB even for 7B models.
64
 
65
  - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
66
  - PyPI: `pip install bigsmall`
 
19
  pip install bigsmall
20
  ```
21
 
22
+ ## Load and run inference (streaming)
23
 
24
  ```python
25
  from bigsmall import StreamingLoader
26
  from transformers import GPT2LMHeadModel, GPT2Tokenizer
27
 
28
+ # Streams one layer at a time — under 500MB peak RAM
29
  loader = StreamingLoader("wpferrell/gpt2-bigsmall")
30
  model = loader.load_model(GPT2LMHeadModel)
31
  tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
 
39
 
40
  ```python
41
  from bigsmall import from_pretrained
42
+ from transformers import GPT2LMHeadModel
43
  model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
44
  ```
45
 
 
51
 
52
  Verified lossless: md5 of every weight tensor matches original after decompression.
53
 
54
+ ## Comparison
55
 
56
+ | Tool | BF16 Ratio | FP32 Ratio | Inference Overhead | Hardware |
57
+ |------|------------|------------|-------------------|---------|
58
+ | [ZipNN](https://arxiv.org/abs/2411.05239) | 67% | 83% | None | CPU |
59
+ | [DFloat11](https://arxiv.org/abs/2504.11651) | ~70% | BF16 only | ~2x at batch=1 | CUDA only |
60
+ | [ZipServ](https://arxiv.org/abs/2603.17435) | ~70% | BF16 only | 1.22x faster | GDDR GPU |
61
+ | **BigSmall** | **65.6%** | **75.5%** | **None** | **CPU + any GPU** |
62
+
63
+ *Lower ratio = better compression. BigSmall BF16 measured on Mistral 7B.*
64
 
65
  ## About BigSmall
66
 
67
+ BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM.
68
 
69
  - GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
70
  - PyPI: `pip install bigsmall`