Very impressed by Qwen3.5-397B-A17B-UD-TQ1_0.gguf quant

#12
by dzupin - opened

I normally prefer to use at least 4-bit quants for local inference, but because this model is so massive, I had to go with the smallest one available. I have to admit, the quality of this highly compressed version really surprised me and made me question if my strict 4-bit "rule" is even necessary.
I'm currently using Qwen3.5-397B-A17B-UD-TQ1_0.gguf with llama-server (llama.cpp) and just wanted to share my appreciation for how well this aggressive quantization performs when writing error-free Python code. I expected to see a lot of mistakes in the generated code, but it's been flawless so far!

Out of all models I ever tried that fit entirely into 96GB VRAM this one is indeed "the best". Its knowledge is noticeably degraded but it's still usable. The largest context size I can run it with is 323072 (94.602Gi/95.593Gi), with 323073 it never finishes loading. In normal use GPU utilization is around 90% (some entirely-gpu models are much lower) and yet power utilization is only 300ish (out of 600W), temperature way below 50°C, tg around 63. Only the benchmark pushes both GPU and POW utilizations to their maximums.

llama-bench -m unsloth_Qwen3.5-397B-A17B-GGUF_Qwen3.5-397B-A17B-UD-TQ1_0.gguf -ctk q8_0 -ctv q8_0 -fa 1
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA RTX PRO 6000 Blackwell Workstation Edition (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat

model size params backend threads type_k type_v fa test t/s
qwen35moe ?B IQ1_S - 1.5625 bpw 87.68 GiB 396.35 B CUDA,Vulkan,BLAS 16 q8_0 q8_0 1 pp512 1155.78 ± 6.74
qwen35moe ?B IQ1_S - 1.5625 bpw 87.68 GiB 396.35 B CUDA,Vulkan,BLAS 16 q8_0 q8_0 1 tg128 62.71 ± 0.14

| qwen35moe ?B IQ1_S - 1.5625 bpw | 87.68 GiB | 396.35 B | CUDA,Vulkan,BLAS | 16 | q8_0 | q8_0 | 1 | tg128 | 62.71 ± 0.14 |

Thank you. This data-point gives me certainty of the value proposition of a pro 6000 for me.

There's a quality of response I see in >10B active parameters that I think holds that emergent thing I feel is the spark of intelligence, not just better-training.

Sign up or log in to comment