File size: 2,369 Bytes
7214371 6fd0437 7214371 6fd0437 7214371 6fd0437 7214371 6fd0437 7214371 6fd0437 7214371 6fd0437 7214371 d0677c7 6fd0437 7214371 6fd0437 7214371 d0677c7 6fd0437 7214371 6fd0437 7214371 6fd0437 d0677c7 6fd0437 d0677c7 6fd0437 d0677c7 7214371 d0677c7 6fd0437 7214371 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | ---
license: apache-2.0
tags:
- bigsmall
- compression
- lossless
- gpt2
---
# GPT-2 (BigSmall compressed)
**548MB → 414MB (75.5%). Bit-identical. Under 500MB peak RAM with streaming.**
This is GPT-2 117M compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression. Not quantization. Every weight is bit-identical to the original.
## Install
```bash
pip install bigsmall
```
## Load and run inference (streaming)
```python
from bigsmall import StreamingLoader
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Streams one layer at a time — under 500MB peak RAM
loader = StreamingLoader("wpferrell/gpt2-bigsmall")
model = loader.load_model(GPT2LMHeadModel)
tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
```
## Or decompress to disk first
```python
from bigsmall import from_pretrained
from transformers import GPT2LMHeadModel
model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
```
## What's inside
| File | Original | Compressed | Ratio |
|------|----------|------------|-------|
| model.safetensors (FP32) | 548 MB | 414 MB | 75.5% |
Verified lossless: md5 of every weight tensor matches original after decompression.
## Comparison
| Tool | BF16 Ratio | FP32 Ratio | Inference Overhead | Hardware |
|------|------------|------------|-------------------|---------|
| [ZipNN](https://arxiv.org/abs/2411.05239) | 67% | 83% | None | CPU |
| [DFloat11](https://arxiv.org/abs/2504.11651) | ~70% | BF16 only | ~2x at batch=1 | CUDA only |
| [ZipServ](https://arxiv.org/abs/2603.17435) | ~70% | BF16 only | 1.22x faster | GDDR GPU |
| **BigSmall** | **65.6%** | **75.5%** | **None** | **CPU + any GPU** |
*Lower ratio = better compression. BigSmall BF16 measured on Mistral 7B.*
## About BigSmall
BigSmall compresses at the joint entropy floor for neural network weights. It codes sign+exponent jointly and mantissa conditioned on exponent, achieving the information-theoretic minimum. The streaming loader decompresses one transformer layer at a time directly into VRAM.
- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
- PyPI: `pip install bigsmall`
|