Very impressed by Qwen3.5-397B-A17B-UD-TQ1_0.gguf quant

#12

by dzupin - opened Feb 22

Feb 22

I normally prefer to use at least 4-bit quants for local inference, but because this model is so massive, I had to go with the smallest one available. I have to admit, the quality of this highly compressed version really surprised me and made me question if my strict 4-bit "rule" is even necessary.
I'm currently using Qwen3.5-397B-A17B-UD-TQ1_0.gguf with llama-server (llama.cpp) and just wanted to share my appreciation for how well this aggressive quantization performs when writing error-free Python code. I expected to see a lot of mistakes in the generated code, but it's been flawless so far!

johnsmithxx

Feb 25

•

edited Feb 25

Out of all models I ever tried that fit entirely into 96GB VRAM this one is indeed "the best". Its knowledge is noticeably degraded but it's still usable. The largest context size I can run it with is 323072 (94.602Gi/95.593Gi), with 323073 it never finishes loading. In normal use GPU utilization is around 90% (some entirely-gpu models are much lower) and yet power utilization is only 300ish (out of 600W), temperature way below 50°C, tg around 63. Only the benchmark pushes both GPU and POW utilizations to their maximums.

llama-bench -m unsloth_Qwen3.5-397B-A17B-GGUF_Qwen3.5-397B-A17B-UD-TQ1_0.gguf -ctk q8_0 -ctv q8_0 -fa 1
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, compute capability 12.0, VMM: yes
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA RTX PRO 6000 Blackwell Workstation Edition (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat

model	size	params	backend	threads	type_k	type_v	fa	test	t/s
qwen35moe ?B IQ1_S - 1.5625 bpw	87.68 GiB	396.35 B	CUDA,Vulkan,BLAS	16	q8_0	q8_0	1	pp512	1155.78 ± 6.74
qwen35moe ?B IQ1_S - 1.5625 bpw	87.68 GiB	396.35 B	CUDA,Vulkan,BLAS	16	q8_0	q8_0	1	tg128	62.71 ± 0.14

BingoBird

Mar 10

•

edited Mar 10

| qwen35moe ?B IQ1_S - 1.5625 bpw | 87.68 GiB | 396.35 B | CUDA,Vulkan,BLAS | 16 | q8_0 | q8_0 | 1 | tg128 | 62.71 ± 0.14 |

Thank you. This data-point gives me certainty of the value proposition of a pro 6000 for me.

There's a quality of response I see in >10B active parameters that I think holds that emergent thing I feel is the spark of intelligence, not just better-training.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment