AI & ML interests

KV cache compression, inference optimization, model compression

Recent Activity

Zenalyzeย  updated a model 1 day ago
fraQtl/TinyLlama-1.1B-optimized
Zenalyzeย  updated a model 1 day ago
fraQtl/Qwen-2.5-3B-optimized
Zenalyzeย  updated a model 1 day ago
fraQtl/Llama-3.2-3B-optimized
View all activity

Organization Card

๐Ÿง  fraQtl

KV Cache Compression (5ร—, near-zero loss)


5ร— smaller KV cache. Same quality. Sometimes better.


โšก What this is

fraQtl compresses the KV cache of transformers without breaking attention.

  • 5ร— memory reduction
  • +0.002 PPL (near-zero loss)
  • 100% needle recall (FP16: 98.7%)
  • 25 second setup
  • zero runtime overhead

๐Ÿง  Why it works

We don't compress blindly.

We:

  • detect attention-critical directions
  • preserve them in high precision
  • compress the rest

Compression preserves decisions, not just values.


๐Ÿงช Key result

Compression can improve models.

  • โ€“0.1 PPL improvement
  • no training
  • no fine-tuning

๐Ÿ“Š Results

Model Params Architecture PPL Delta (k=32) Compression
Mistral 7B 7B GQA-8 +0.007 5ร—
Llama 3.2 3B 3B GQA-3 +0.011 5ร—
Llama-2 7B 7B MHA-32 +0.007 5ร—
Qwen 2.5 3B 3B GQA-2 +0.010 5ร—
Llama 3.1 8B 8B GQA-8 +0.025 5ร—
Llama-2 13B 13B MHA-40 +0.005 5ร—
Llama 3.1 70B 70B GQA-8 +0.019 5ร—

๐Ÿงช Long Context (Needle-in-a-Haystack)

fraQtl achieves:

  • 100% recall (1K โ†’ 16K)
  • beats FP16 baseline (98.7%)

โš”๏ธ Comparison

Method Behavior
Quantization adds noise
Low-rank removes information
fraQtl preserves signal

๐Ÿš€ How to Use

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("fraQtl/MODEL")

โšก Setup

  • Calibration: ~25 seconds
  • Runtime overhead: ~0%
  • Works across multiple models

๐Ÿ“š Resources


๐Ÿ” Status

  • Patent filed (April 6, 2026)
  • Multi-model validated
  • Production-ready

๐Ÿง  One line

Compression that understands attention.

datasets 0

None public yet