Qwen3.5-35B-A3B — TurboQuant+ GGUF
Q4_K_M size, near-Q8_0 quality. Better perplexity than Q4_K_M at the same size (6.61 vs 6.86).
When combined with TurboQuant+ KV cache compression, this enables significantly larger context windows within the same memory budget.
Best option if Q8_0 doesn't fit in memory but Q4_K_M quality isn't good enough.
TurboQuant+ Config-I quantization of Qwen/Qwen3.5-35B-A3B. Config-I applies WHT-domain compression (TQ4_1S) to attention and gate/up tensors while keeping boundary layers and ffn_down at higher precision for optimal quality. See the getting started guide for details.
Requires TurboQuant+ llama.cpp fork at tag
tqp-v0.1.0. Will NOT work with stock llama.cpp. TurboQuant+ is an independent research project. These quantization types have not been merged into upstream ggml-org/llama.cpp. Do not file issues there for TQ models.
Files
| File | Quant | Size | vs Q8_0 | PPL | PPL vs Q8_0 | PPL vs Q4_K_M |
|---|---|---|---|---|---|---|
| Qwen3.5-35B-A3B-Config-I.gguf | Config-I | 21.0 GB | 61% of Q8_0 (34.4 GB) | 6.61 | +0.5% | 3.7% better |
Why Config-I over Q4_K_M?
At nearly the same file size (~4% larger), Config-I achieves substantially better perplexity than Q4_K_M (6.61 vs 6.86) with faster decode.
| Quant | Size | PPL | vs Q8_0 |
|---|---|---|---|
| Q8_0 | 34.4 GB | 6.57 | baseline |
| Q4_K_M | 20.2 GB | 6.86 | +4.4% |
| Config-I | 21.0 GB | 6.61 | +0.5% |
Compatibility
| Field | Value |
|---|---|
| Fork | TheTom/llama-cpp-turboquant |
| Tag | tqp-v0.1.0 |
| Backends | Metal, CUDA, ROCm/HIP, Vulkan |
| Quantized on | 2026-04-08 |
No forward-compatibility guarantee. This model was built and validated against the tag above. Future fork updates may change the format. If decode produces garbage, rebuild from this tag.
Benchmarks (Apple M5 Max 128GB, Metal)
Speed
On Metal, Config-I trades some decode speed (~1.6x slower than Q8_0) for memory savings. The recommended setup below combines Config-I weights with turbo4 KV cache compression.
Metal (Apple M5 Max 128GB, recommended -ctk q8_0 -ctv turbo4)
| Config | pp512 | pp2048 | tg128 | Size |
|---|---|---|---|---|
| Q8_0 baseline (f16 KV) | 2,903 t/s | 2,762 t/s | 68.2 t/s | 34.4 GB |
| Config-I + turbo4 KV | 2,386 t/s | 2,284 t/s | 42.4 t/s | 21.0 GB + ~60% KV cache memory reduction |
This Metal result prioritizes memory efficiency over raw decode speed. On NVIDIA GPUs (Ada/Blackwell), native TQ4_1S dp4a kernels reduce or eliminate much of this decode gap.
Context Scaling (tg128 decode at varying context length, Metal)
| Context | 1K | 4K | 8K | 16K | 32K |
|---|---|---|---|---|---|
| t/s | 40.5 | 39.3 | 41.6 | 40.2 | 42.0 |
Effectively flat decode performance across all context lengths. No degradation cliff.
Perplexity (wikitext-2-raw, 512 context, 20 chunks)
| Config | PPL | vs Q8_0 |
|---|---|---|
| Q8_0 | 6.5692 | baseline |
| Q4_K_M | 6.8575 | +4.4% |
| Config-I | 6.6054 | +0.5% |
Recommended KV Cache Settings
TurboQuant+ also compresses the KV cache at runtime via -ctk and -ctv flags. This doesn't change the model file, just how much memory the context window uses. For long-context workloads, this lets you fit more context in the same memory.
K cache (-ctk) |
V cache (-ctv) |
PPL | vs default | Speed impact | KV buffer (16K ctx) | Recommendation |
|---|---|---|---|---|---|---|
| f16 | f16 | 6.61 | baseline | none | 320 MiB | default, short context |
| q8_0 | turbo4 | 6.62 | +0.3% | negligible | 128 MiB (60% smaller) | recommended for long context |
| q8_0 | turbo3 | 6.63 | +0.3% | negligible | ~120 MiB (63% smaller) | aggressive, still good quality |
The recommended config (-ctk q8_0 -ctv turbo4) reduces KV cache memory by 60% at only +0.3% PPL cost. Measured at 16K context: 320 MiB down to 128 MiB.
Other combinations (q8_0/q8_0, f16/turbo4, turbo4/f16) either crash or have severe prefill regressions. Stick with the configs above.
Download
# Install huggingface-cli
brew install huggingface-cli
# or: pip install huggingface-hub
# Download
hf download pidtom/Qwen3.5-35B-A3B-TQPlus Qwen3.5-35B-A3B-Config-I.gguf --local-dir .
How to Run
# Clone and build the TurboQuant+ fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout tqp-v0.1.0
# Build for Metal (macOS)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# Build for CUDA (NVIDIA)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# Build for Vulkan
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# Run (default KV cache)
./build/bin/llama-cli -m Qwen3.5-35B-A3B-Config-I.gguf -ngl 99 -c 8192
# Run with recommended KV compression (for long context)
./build/bin/llama-cli -m Qwen3.5-35B-A3B-Config-I.gguf -ngl 99 -c 32768 -ctk q8_0 -ctv turbo4
What is TurboQuant+?
TurboQuant+ applies Walsh-Hadamard Transform (WHT) domain quantization to compress model weights beyond standard GGUF quant types. This achieves lower bits-per-weight at equivalent or better quality by exploiting structured redundancy in the weight matrices.
- Downloads last month
- 603
We're not able to determine the quantization variants.