wpferrell
/

gpt2-bigsmall

@@ -1,50 +1,66 @@
 ---
-license: mit
 tags:
-- bigsmall
-- compression
-- lossless
-- gpt2
 ---
 # GPT-2 (BigSmall compressed)
-This is GPT-2 compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression.
-The weights are **bit-for-bit identical** to the original after decompression. No quality degradation, no accuracy loss.
-## Usage
-```python
 pip install bigsmall
 ```
 ```python
-import bigsmall
 from transformers import GPT2LMHeadModel, GPT2Tokenizer
-# Load compressed weights
-state_dict = bigsmall.from_pretrained("wpferrell/gpt2-bigsmall")
-# Load into model
-model = GPT2LMHeadModel.from_pretrained("gpt2", state_dict=state_dict)
-tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
-# Run inference - identical to original GPT-2
 inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=20)
 print(tokenizer.decode(outputs[0]))
 ```
-## Compression stats
-- Original size: ~548 MB
-- Compressed size: ~414 MB
-- Ratio: 75.53% (lossless)
-- Format: FP32
-- Round-trip verified: 160/160 tensors md5-identical
 ## About BigSmall
-BigSmall compresses model weights losslessly. Unlike quantization, the decompressed weights are bit-for-bit identical to the originals. Supports BF16, FP16, FP32, FP64, INT8 formats across LLMs and diffusion models.
 - PyPI: `pip install bigsmall`
-- GitHub: https://github.com/wpferrell/Bigsmall

 ---
+license: apache-2.0
 tags:
+  - bigsmall
+  - compression
+  - lossless
+  - gpt2
 ---
 # GPT-2 (BigSmall compressed)
+**548MB → 414MB (75.5%). Bit-identical. Under 500MB peak RAM with streaming.**
+This is GPT-2 117M compressed with [BigSmall](https://github.com/wpferrell/Bigsmall) — lossless neural network weight compression. Not quantization. Every weight is bit-identical to the original.
+## Install
+```bash
 pip install bigsmall
 ```
+## Load and run inference
 ```python
+from bigsmall import StreamingLoader
 from transformers import GPT2LMHeadModel, GPT2Tokenizer
+# Decompress one layer at a time — under 500MB peak RAM
+loader = StreamingLoader("wpferrell/gpt2-bigsmall")
+model = loader.load_model(GPT2LMHeadModel)
+tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
 inputs = tokenizer("Hello, I'm a language model", return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(outputs[0]))
 ```
+## Or decompress to disk first
+```python
+from bigsmall import from_pretrained
+model = from_pretrained("wpferrell/gpt2-bigsmall", model_class=GPT2LMHeadModel)
+```
+## What's inside
+| File | Original | Compressed | Ratio |
+|------|----------|------------|-------|
+| model.safetensors (FP32) | 548 MB | 414 MB | 75.5% |
+Verified lossless: md5 of every weight tensor matches original after decompression.
+## vs other compression tools
+| Tool | BF16 Ratio | Inference Overhead | Hardware |
+|------|------------|-------------------|---------|
+| ZipNN | ~83% | None | CPU |
+| DFloat11 | ~70% | ~2x at batch=1 | CUDA only |
+| **BigSmall** | **59.8%** | **None** | **CPU + GPU** |
 ## About BigSmall
+BigSmall compresses at the Shannon entropy floor for neural network weights. It detects the float format automatically (FP32, BF16, FP16, FP8, FP4) and applies the optimal lossless codec for each tensor. Streaming loader decompresses one layer at a time directly into VRAM — peak RAM stays under 2GB even for 7B models.
+- GitHub: [wpferrell/Bigsmall](https://github.com/wpferrell/Bigsmall)
 - PyPI: `pip install bigsmall`