Mistral 7B โ fraQtl KV Cache Optimized
KV cache optimized with fraQtl โ 3.5x less KV cache memory during inference.
Note: The model file size is the same as the original (~14GB). The optimization modifies V projection weights so that at inference time, the KV cache uses 3.5x less GPU memory. The savings happen at runtime, not at download.
| Metric | Value |
|---|---|
| Original | mistralai/Mistral-7B-v0.1 |
| File size | Same as original (~14GB) |
| KV cache memory | 3.5x less at runtime |
| PPL before | 10.4690 |
| PPL after | 10.6908 |
| Delta | +0.222 (weight-level) |
| Config | k=64, INT3 |
How It Works
The model weights are rotated into an eigenbasis that separates important V-cache directions from noise. At inference, the KV cache concentrates information in fewer dimensions โ using 3.5x less memory.
Our runtime compression (the real product) achieves +0.01 PPL on the same model. Contact us for integration.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Mistral-7B-compressed")
# KV cache uses 3.5x less memory during inference.
Generation Samples
Prompt: Explain how photosynthesis works in simple terms:
Output: Photosynthesis is the process by which plants use energy from sunlight to make their own food. Plants need carbon dioxide, water, and light to make their own food...
Prompt: The three most important breakthroughs in physics during the 20th century were
Output: The three most important breakthroughs in physics during the 20th century were the theory of relativity, quantum mechanics, and string theory...
Runtime Compression (the full product)
| Method | PPL Delta | How |
|---|---|---|
| This download (weight-level) | +0.222 | Modified weights, download and use |
| Runtime cache compression | +0.01 | fraQtl applied during inference |
Runtime compression gives 30x better quality. Available for production deployment.
fraqtl.ai | contact@fraqtl.ai | Patent pending. Paper: arXiv:2604.11501
- Downloads last month
- 68