Very impressed by Qwen3.5-397B-A17B-UD-TQ1_0.gguf quant
I normally prefer to use at least 4-bit quants for local inference, but because this model is so massive, I had to go with the smallest one available. I have to admit, the quality of this highly compressed version really surprised me and made me question if my strict 4-bit "rule" is even necessary.
I'm currently using Qwen3.5-397B-A17B-UD-TQ1_0.gguf with llama-server (llama.cpp) and just wanted to share my appreciation for how well this aggressive quantization performs when writing error-free Python code. I expected to see a lot of mistakes in the generated code, but it's been flawless so far!
Out of all models I ever tried that fit entirely into 96GB VRAM this one is indeed "the best". Its knowledge is noticeably degraded but it's still usable. The largest context size I can run it with is 323072 (94.602Gi/95.593Gi), with 323073 it never finishes loading. In normal use GPU utilization is around 90% (some entirely-gpu models are much lower) and yet power utilization is only 300ish (out of 600W), temperature way below 50°C, tg around 63. Only the benchmark pushes both GPU and POW utilizations to their maximums.
llama-bench -m unsloth_Qwen3.5-397B-A17B-GGUF_Qwen3.5-397B-A17B-UD-TQ1_0.gguf -ctk q8_0 -ctv q8_0 -fa 1
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA RTX PRO 6000 Blackwell Workstation Edition (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
| model | size | params | backend | threads | type_k | type_v | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| qwen35moe ?B IQ1_S - 1.5625 bpw | 87.68 GiB | 396.35 B | CUDA,Vulkan,BLAS | 16 | q8_0 | q8_0 | 1 | pp512 | 1155.78 ± 6.74 |
| qwen35moe ?B IQ1_S - 1.5625 bpw | 87.68 GiB | 396.35 B | CUDA,Vulkan,BLAS | 16 | q8_0 | q8_0 | 1 | tg128 | 62.71 ± 0.14 |
| qwen35moe ?B IQ1_S - 1.5625 bpw | 87.68 GiB | 396.35 B | CUDA,Vulkan,BLAS | 16 | q8_0 | q8_0 | 1 | tg128 | 62.71 ± 0.14 |
Thank you. This data-point gives me certainty of the value proposition of a pro 6000 for me.
There's a quality of response I see in >10B active parameters that I think holds that emergent thing I feel is the spark of intelligence, not just better-training.