Qwen3.5-27B-RotorQuant -- RotorQuant KV Cache Compression

Qwen3.5-27B with RotorQuant KV cache compression applied. RotorQuant uses block-diagonal rotations derived from Clifford algebra to compress KV caches with substantially better speed and efficiency than prior methods. At 3-bit precision, it achieves approximately 10x KV cache compression while maintaining strong output quality.

The base model is Qwen/Qwen3.5-27B, a 27B parameter hybrid transformer combining gated delta networks with sparse mixture-of-experts. It supports 262K native context with extension to 1M+ tokens and operates in thinking mode by default.

What is RotorQuant?

RotorQuant is a KV cache compression framework that replaces the dense random rotation used in methods like TurboQuant with block-diagonal rotations grounded in Clifford algebra. This architectural choice yields major practical advantages:

  • 28% faster decode and 5.3x faster prefill compared to TurboQuant
  • 44x fewer parameters (128 vs 16,384) for the rotation matrices
  • O(d) complexity vs O(d log d) for the rotation step
  • Lower perplexity: 6.91 vs 7.07 (TurboQuant) on standard benchmarks

RotorQuant ships three backend implementations, each offering a different speed/quality tradeoff:

Backend Algebra Best For
PlanarQuant 2D Givens rotations Fastest inference -- production deployments
IsoQuant 4D quaternion rotations Balanced speed and quality
RotorQuant 3D Clifford rotors Research and maximum quality

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import IsoQuantCache

model = AutoModelForCausalLM.from_pretrained(
    "majentik/Qwen3.5-27B-RotorQuant",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("majentik/Qwen3.5-27B-RotorQuant")

# Apply chat template (Qwen3.5 supports thinking mode)
messages = [{"role": "user", "content": "Explain quantum computing"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# 3-bit IsoQuant cache -- recommended setting (~10x KV compression)
cache = IsoQuantCache(bits=3)
output = model.generate(**inputs, max_new_tokens=2048, past_key_values=cache, use_cache=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Switching Backends

from turboquant import PlanarQuantCache, IsoQuantCache, RotorQuantCache

# Fastest -- 2D Givens rotations (production)
cache = PlanarQuantCache(bits=3)

# Balanced -- 4D quaternion rotations
cache = IsoQuantCache(bits=3)

# Research -- 3D Clifford rotors (highest quality)
cache = RotorQuantCache(bits=3)

Configuration

Bit Width Quality Compression Recommended Use
4-bit Near-lossless ~4x KV cache Quality-sensitive applications
3-bit Strong (ppl 6.91) ~10x KV cache Recommended default -- best quality/compression tradeoff
2-bit Moderate degradation ~16x KV cache Extreme memory constraints

The 3-bit setting is recommended as the default. It provides approximately 10x KV cache compression with a perplexity of 6.91, which is lower (better) than TurboQuant's 7.07 at the same bit width.

Memory Savings

Qwen3.5-27B has substantial KV caches due to its 27B parameter count. RotorQuant's 3-bit mode provides approximately 10x compression, making long-context inference practical on fewer GPUs.

Context Length FP16 KV Cache 4-bit RotorQuant 3-bit RotorQuant 2-bit RotorQuant
8K ~3.4 GB ~0.85 GB ~0.34 GB ~0.21 GB
32K ~13.5 GB ~3.4 GB ~1.35 GB ~0.84 GB
128K ~54 GB ~13.5 GB ~5.4 GB ~3.4 GB
262K (native) ~110 GB ~27.5 GB ~11 GB ~6.9 GB

Estimates based on Qwen3.5-27B KV cache dimensions. Actual savings depend on model configuration and batch size.

Performance vs TurboQuant

Metric RotorQuant TurboQuant
Decode speed 28% faster Baseline
Prefill speed 5.3x faster Baseline
Rotation parameters 128 16,384
Rotation complexity O(d) O(d log d)
Perplexity (3-bit) 6.91 7.07

Thinking Mode

Qwen3.5-27B generates extended chain-of-thought reasoning before producing its final response. These thinking tokens can consume substantial KV cache memory -- often thousands of tokens of internal reasoning before a single output token is emitted. RotorQuant is especially valuable here because:

  • Thinking tokens are generated autoregressively and cached, so KV cache grows rapidly during the reasoning phase.
  • At 3-bit with ~10x compression, you can sustain much longer reasoning chains within the same VRAM budget.
  • The 5.3x faster prefill directly accelerates the initial prompt processing, which matters for long system prompts and multi-turn conversations.
  • The 28% faster decode speeds up the token-by-token generation during both thinking and response phases.

See Also

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for majentik/Qwen3.5-27B-RotorQuant

Base model

Qwen/Qwen3.5-27B
Finetuned
(228)
this model

Paper for majentik/Qwen3.5-27B-RotorQuant