Mistral 7B — fraQtl KV Cache Optimized

KV cache optimized with fraQtl — 3.5x less KV cache memory during inference.

Note: The model file size is the same as the original (~14GB). The optimization modifies V projection weights so that at inference time, the KV cache uses 3.5x less GPU memory. The savings happen at runtime, not at download.

Metric	Value
Original	mistralai/Mistral-7B-v0.1
File size	Same as original (~14GB)
KV cache memory	3.5x less at runtime
PPL before	10.4690
PPL after	10.6908
Delta	+0.222 (weight-level)
Config	k=64, INT3

How It Works

The model weights are rotated into an eigenbasis that separates important V-cache directions from noise. At inference, the KV cache concentrates information in fewer dimensions — using 3.5x less memory.

Our runtime compression (the real product) achieves +0.01 PPL on the same model. Contact us for integration.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-compressed")
tokenizer = AutoTokenizer.from_pretrained("fraQtl/Mistral-7B-compressed")
# KV cache uses 3.5x less memory during inference.

Generation Samples

Prompt: Explain how photosynthesis works in simple terms:

Output: Photosynthesis is the process by which plants use energy from sunlight to make their own food. Plants need carbon dioxide, water, and light to make their own food...

Prompt: The three most important breakthroughs in physics during the 20th century were

Output: The three most important breakthroughs in physics during the 20th century were the theory of relativity, quantum mechanics, and string theory...

Runtime Compression (the full product)

Method	PPL Delta	How
This download (weight-level)	+0.222	Modified weights, download and use
Runtime cache compression	+0.01	fraQtl applied during inference

Runtime compression gives 30x better quality. Available for production deployment.

fraqtl.ai | contact@fraqtl.ai | Patent pending. Paper: arXiv:2604.11501

Downloads last month: 68

Safetensors

Model size

7B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fraQtl/Mistral-7B-optimized

Quantizations

2 models

Paper for fraQtl/Mistral-7B-optimized

Quantization Dominates Rank Reduction for KV-Cache Compression

Paper • 2604.11501 • Published 3 days ago