Update model card with streaming loader and benchmark info
Browse files
README.md
CHANGED
|
@@ -1,50 +1,66 @@
|
|
| 1 |
---
|
| 2 |
-
license:
|
| 3 |
tags:
|
| 4 |
-
- bigsmall
|
| 5 |
-
- compression
|
| 6 |
-
- lossless
|
| 7 |
-
- gpt2
|
| 8 |
---
|
| 9 |
|
| 10 |
# GPT-2 (BigSmall compressed)
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
##
|
| 17 |
|
| 18 |
-
```
|
| 19 |
pip install bigsmall
|
| 20 |
```
|
| 21 |
|
|
|
|
|
|
|
| 22 |
```python
|
| 23 |
-
|
| 24 |
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
| 25 |
|
| 26 |
-
#
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
model = GPT2LMHeadModel.from_pretrained("gpt2", state_dict=state_dict)
|
| 31 |
-
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
|
| 32 |
|
| 33 |
-
# Run inference - identical to original GPT-2
|
| 34 |
inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")
|
| 35 |
-
outputs = model.generate(**inputs, max_new_tokens=
|
| 36 |
print(tokenizer.decode(outputs[0]))
|
| 37 |
```
|
| 38 |
|
| 39 |
-
##
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
## About BigSmall
|
| 47 |
-
BigSmall compresses model weights losslessly. Unlike quantization, the decompressed weights are bit-for-bit identical to the originals. Supports BF16, FP16, FP32, FP64, INT8 formats across LLMs and diffusion models.
|
| 48 |
|
|
|
|
|
|
|
|
|
|
| 49 |
- PyPI: `pip install bigsmall`
|
| 50 |
-
- GitHub: https://github.com/wpferrell/Bigsmall
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
tags:
|
| 4 |
+
- bigsmall
|
| 5 |
+
- compression
|
| 6 |
+
- lossless
|
| 7 |
+
- gpt2
|
| 8 |
---
|
| 9 |
|
| 10 |
# GPT-2 (BigSmall compressed)
|
| 11 |
|
| 12 |
+
**548MB → 414MB (75.5%). Bit-identical. Under 500MB peak RAM with streaming.**
|
| 13 |
|
| 14 |
+
This is GPT-2 117M compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression. Not quantization. Every weight is bit-identical to the original.
|
| 15 |
|
| 16 |
+
## Install
|
| 17 |
|
| 18 |
+
```bash
|
| 19 |
pip install bigsmall
|
| 20 |
```
|
| 21 |
|
| 22 |
+
## Load and run inference
|
| 23 |
+
|
| 24 |
```python
|
| 25 |
+
from bigsmall import StreamingLoader
|
| 26 |
from transformers import GPT2LMHeadModel, GPT2Tokenizer
|
| 27 |
|
| 28 |
+
# Decompress one layer at a time — under 500MB peak RAM
|
| 29 |
+
loader = StreamingLoader("wpferrell/gpt2-bigsmall")
|
| 30 |
+
model = loader.load_model(GPT2LMHeadModel)
|
| 31 |
+
tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
|
|
|
|
|
|
|
| 32 |
|
|
|
|
| 33 |
inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")
|
| 34 |
+
outputs = model.generate(**inputs, max_new_tokens=50)
|
| 35 |
print(tokenizer.decode(outputs[0]))
|
| 36 |
```
|
| 37 |
|
| 38 |
+
## Or decompress to disk first
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
from bigsmall import from_pretrained
|
| 42 |
+
model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
## What's inside
|
| 46 |
+
|
| 47 |
+
| File | Original | Compressed | Ratio |
|
| 48 |
+
|------|----------|------------|-------|
|
| 49 |
+
| model.safetensors (FP32) | 548 MB | 414 MB | 75.5% |
|
| 50 |
+
|
| 51 |
+
Verified lossless: md5 of every weight tensor matches original after decompression.
|
| 52 |
+
|
| 53 |
+
## vs other compression tools
|
| 54 |
+
|
| 55 |
+
| Tool | BF16 Ratio | Inference Overhead | Hardware |
|
| 56 |
+
|------|------------|-------------------|---------|
|
| 57 |
+
| ZipNN | ~83% | None | CPU |
|
| 58 |
+
| DFloat11 | ~70% | ~2x at batch=1 | CUDA only |
|
| 59 |
+
| **BigSmall** | **59.8%** | **None** | **CPU + GPU** |
|
| 60 |
|
| 61 |
## About BigSmall
|
|
|
|
| 62 |
|
| 63 |
+
BigSmall compresses at the Shannon entropy floor for neural network weights. It detects the float format automatically (FP32, BF16, FP16, FP8, FP4) and applies the optimal lossless codec for each tensor. Streaming loader decompresses one layer at a time directly into VRAM — peak RAM stays under 2GB even for 7B models.
|
| 64 |
+
|
| 65 |
+
- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
|
| 66 |
- PyPI: `pip install bigsmall`
|
|
|