Qwen3.5-27B-RotorQuant -- RotorQuant KV Cache Compression

Qwen3.5-27B with RotorQuant KV cache compression applied. RotorQuant uses block-diagonal rotations derived from Clifford algebra to compress KV caches with substantially better speed and efficiency than prior methods. At 3-bit precision, it achieves approximately 10x KV cache compression while maintaining strong output quality.

The base model is Qwen/Qwen3.5-27B, a 27B parameter hybrid transformer combining gated delta networks with sparse mixture-of-experts. It supports 262K native context with extension to 1M+ tokens and operates in thinking mode by default.

What is RotorQuant?

RotorQuant is a KV cache compression framework that replaces the dense random rotation used in methods like TurboQuant with block-diagonal rotations grounded in Clifford algebra. This architectural choice yields major practical advantages:

28% faster decode and 5.3x faster prefill compared to TurboQuant
44x fewer parameters (128 vs 16,384) for the rotation matrices
O(d) complexity vs O(d log d) for the rotation step
Lower perplexity: 6.91 vs 7.07 (TurboQuant) on standard benchmarks

RotorQuant ships three backend implementations, each offering a different speed/quality tradeoff:

Backend	Algebra	Best For
PlanarQuant	2D Givens rotations	Fastest inference -- production deployments
IsoQuant	4D quaternion rotations	Balanced speed and quality
RotorQuant	3D Clifford rotors	Research and maximum quality

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import IsoQuantCache

model = AutoModelForCausalLM.from_pretrained(
    "majentik/Qwen3.5-27B-RotorQuant",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("majentik/Qwen3.5-27B-RotorQuant")

# Apply chat template (Qwen3.5 supports thinking mode)
messages = [{"role": "user", "content": "Explain quantum computing"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# 3-bit IsoQuant cache -- recommended setting (~10x KV compression)
cache = IsoQuantCache(bits=3)
output = model.generate(**inputs, max_new_tokens=2048, past_key_values=cache, use_cache=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Switching Backends

from turboquant import PlanarQuantCache, IsoQuantCache, RotorQuantCache

# Fastest -- 2D Givens rotations (production)
cache = PlanarQuantCache(bits=3)

# Balanced -- 4D quaternion rotations
cache = IsoQuantCache(bits=3)

# Research -- 3D Clifford rotors (highest quality)
cache = RotorQuantCache(bits=3)

Configuration

Bit Width	Quality	Compression	Recommended Use
4-bit	Near-lossless	~4x KV cache	Quality-sensitive applications
3-bit	Strong (ppl 6.91)	~10x KV cache	Recommended default -- best quality/compression tradeoff
2-bit	Moderate degradation	~16x KV cache	Extreme memory constraints

The 3-bit setting is recommended as the default. It provides approximately 10x KV cache compression with a perplexity of 6.91, which is lower (better) than TurboQuant's 7.07 at the same bit width.

Memory Savings

Qwen3.5-27B has substantial KV caches due to its 27B parameter count. RotorQuant's 3-bit mode provides approximately 10x compression, making long-context inference practical on fewer GPUs.

Context Length	FP16 KV Cache	4-bit RotorQuant	3-bit RotorQuant	2-bit RotorQuant
8K	~3.4 GB	~0.85 GB	~0.34 GB	~0.21 GB
32K	~13.5 GB	~3.4 GB	~1.35 GB	~0.84 GB
128K	~54 GB	~13.5 GB	~5.4 GB	~3.4 GB
262K (native)	~110 GB	~27.5 GB	~11 GB	~6.9 GB

Estimates based on Qwen3.5-27B KV cache dimensions. Actual savings depend on model configuration and batch size.

Performance vs TurboQuant

Metric	RotorQuant	TurboQuant
Decode speed	28% faster	Baseline
Prefill speed	5.3x faster	Baseline
Rotation parameters	128	16,384
Rotation complexity	O(d)	O(d log d)
Perplexity (3-bit)	6.91	7.07

Thinking Mode

Qwen3.5-27B generates extended chain-of-thought reasoning before producing its final response. These thinking tokens can consume substantial KV cache memory -- often thousands of tokens of internal reasoning before a single output token is emitted. RotorQuant is especially valuable here because:

Thinking tokens are generated autoregressively and cached, so KV cache grows rapidly during the reasoning phase.
At 3-bit with ~10x compression, you can sustain much longer reasoning chains within the same VRAM budget.
The 5.3x faster prefill directly accelerates the initial prompt processing, which matters for long system prompts and multi-turn conversations.
The 28% faster decode speeds up the token-by-token generation during both thinking and response phases.

Model tree for majentik/Qwen3.5-27B-RotorQuant

Base model

Qwen/Qwen3.5-27B

Finetuned

(228)

this model

Paper for majentik/Qwen3.5-27B-RotorQuant

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 32

majentik
/

Qwen3.5-27B-RotorQuant