Qwen3.5-27B-TurboQuant-MLX-4bit
MLX 4-bit weight-quantized variant of Qwen/Qwen3.5-27B with TurboQuant KV cache compression for efficient inference on Apple Silicon.
Overview
This model combines two complementary compression techniques:
- MLX 4-bit weight quantization (affine, group size 64) — reduces model size from ~54GB to ~15GB
- TurboQuant KV cache compression — compresses key-value caches during inference via online vector quantization, enabling longer contexts with less VRAM
Quickstart
from mlx_lm import load, generate
from turboquant import TurboQuantCache
model, tokenizer = load("majentik/Qwen3.5-27B-TurboQuant-MLX-4bit")
# Standard generation
prompt = "Explain the theory of relativity"
response = generate(model, tokenizer, prompt=prompt, max_tokens=2048)
print(response)
Specifications
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-27B |
| Parameters | 27B |
| Weight Quantization | MLX 4-bit affine (group size 64) |
| KV Cache Method | TurboQuant (4-bit online vector quantization) |
| Model Size | ~15 GB |
| Context Length | 262K (native), 1M+ (extended) |
| Platform | Apple Silicon (M1/M2/M3/M4/M5) |
What is TurboQuant?
TurboQuant (arXiv: 2504.19874) applies online vector quantization to key-value caches during inference, compressing them to 4-bit precision with near-lossless quality. This reduces the memory footprint of the KV cache by approximately 4x compared to FP16, allowing significantly longer context windows to fit in memory without retraining or fine-tuning.
Thinking Mode
Qwen3.5-27B generates extended reasoning before responses by default. The combination of weight quantization and KV cache compression is especially valuable here — thinking tokens consume significant memory that is reduced by both techniques working together.
Memory Estimate
| Configuration | Model Weights | KV Cache (128K ctx) | Total |
|---|---|---|---|
| FP16 (baseline) | ~54 GB | ~13 GB | ~67 GB |
| MLX 4-bit + TurboQuant | ~15 GB | ~3.3 GB | ~18.3 GB |
See Also
- Downloads last month
- 516
4-bit
Model tree for majentik/Qwen3.5-27B-TurboQuant-MLX-4bit
Base model
Qwen/Qwen3.5-27B