fraQtl/TinyLlama-1.1B-optimized
1B โข Updated โข 47
KV cache compression, inference optimization, model compression
fraQtl compresses the KV cache of transformers without breaking attention.
We don't compress blindly.
We:
Compression preserves decisions, not just values.
Compression can improve models.
| Model | Params | Architecture | PPL Delta (k=32) | Compression |
|---|---|---|---|---|
| Mistral 7B | 7B | GQA-8 | +0.007 | 5ร |
| Llama 3.2 3B | 3B | GQA-3 | +0.011 | 5ร |
| Llama-2 7B | 7B | MHA-32 | +0.007 | 5ร |
| Qwen 2.5 3B | 3B | GQA-2 | +0.010 | 5ร |
| Llama 3.1 8B | 8B | GQA-8 | +0.025 | 5ร |
| Llama-2 13B | 13B | MHA-40 | +0.005 | 5ร |
| Llama 3.1 70B | 70B | GQA-8 | +0.019 | 5ร |
fraQtl achieves:
| Method | Behavior |
|---|---|
| Quantization | adds noise |
| Low-rank | removes information |
| fraQtl | preserves signal |
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("fraQtl/MODEL")
Compression that understands attention.