Qwen3.5-35B-A3B NVFP4 gguf
qwen 3.5 quant with experts quantized to nvfp4. It is still slow but shows to be having a better accuracy than mxfp4. This makes sense because both Q4_K and nvfp4 use a superblock scale. I hope that since its a proprietary format from nvidia that it will get better support and be faster than int4. Quantized from BF16, no imatrix used. Used llama.cpp version 3a14a542f5ce8666713c6e6ea44f7f3e01dd6e45 to quantize and calculate kld.
Quant config:
token_embd=Q8_0
attn_gate=Q8_0
attn_norm=F32
attn_qkv=Q8_0
ffn_down_exps=NVFP4
ffn_down_shexp=Q8_0
ffn_gate_exps=NVFP4
ffn_gate_inp=F32
ffn_gate_inp_shexp=Q8_0
ffn_gate_shexp=Q8_0
ffn_up_exps=NVFP4
ffn_up_shexp=Q8_0
post_attention_norm=F32
ssm_a=F32
ssm_alpha=Q8_0
ssm_beta=Q8_0
ssm_conv1d=F32
ssm_dt.bias=F32
ssm_norm=F32
ssm_out=Q8_0
attn_k=Q8_0
attn_k_norm=F32
attn_q=Q8_0
attn_q_norm=F32
attn_v=Q8_0
attn_output=Q8_0
output=Q8_0
output_norm=F32
kld data
UPDATE 20260420: https://github.com/ggml-org/llama.cpp/discussions/22042 It looks like the per tensor scale is not implemented yet. Hopefully this will be implemented soon as we can expect a significant KLD improvement from that. Excited to see this!
Final estimate: PPL = 6.8908 +/- 0.04680
| Provider | Quant | Size GB | Mean PPL | Mean KLD | Same Top p |
|---|---|---|---|---|---|
| Unsloth | f16 | 6.8908 +/- 0.04680 | baseline | baseline | |
| Unsloth | UD-Q6_K_XL | 32.1 | 5.859151 ± 0.036117 | 0.005158 ± 0.000170 | 96.957 ± 0.045 % |
| Unsloth | UD-Q5-K_XL | 26.4 | 5.856634 ± 0.036095 | 0.006890 ± 0.000211 | 96.543 ± 0.047 % |
| Unsloth | Q5_K_M | 26.2 | 5.857466 ± 0.036094 | 0.006996 ± 0.000206 | 96.496 ± 0.048 % |
| AES | Q5_K_M | 26.3 | 5.854688 ± 0.036066 | 0.007233 ± 0.000248 | 96.499 ± 0.048 % |
| Unsloth | UD_Q6_K_S | 28.5 | 5.875750 ± 0.036290 | 0.007825 ± 0.000240 | 96.223 ± 0.049 % |
| Unsloth | Q5_K_S | 24.8 | 5.851724 ± 0.036036 | 0.008098 ± 0.000241 | 96.338 ± 0.049 % |
| AES | Q4_K_M | 22.2 | 5.900010 ± 0.036455 | 0.010871 ± 0.000245 | 95.731 ± 0.052 % |
| Unsloth | UD_Q4_K_XL | 22.2 | 5.885954 ± 0.036317 | 0.011072 ± 0.000276 | 95.783 ± 0.052 % |
| Unsloth | Q4_K_M | 22 | 5.892703 ± 0.036382 | 0.011525 ± 0.000270 | 95.590 ± 0.053 % |
| Mraderbache | Q5_K_M | 24.8 | 5.889589 ± 0.036408 | 0.012260 ± 0.000276 | 95.306 ± 0.055 % |
| me | Q4_K | 20.8 | 5.922957 ± 0.036659 | 0.013359 ± 0.000249 | 95.349 ± 0.055 % |
| Unsloth | Q4_K_S | 20.7 | 5.930525 ± 0.036716 | 0.013894 ± 0.000266 | 95.233 ± 0.055 % |
| AES | IQ4_XS | 17.6 | 5.984069 ± 0.037076 | 0.024742 ± 0.000364 | 93.753 ± 0.063 % |
| Unsloth | UD_IQ4_X_S | 17.5 | 5.979239 ± 0.037032 | 0.025096 ± 0.000336 | 93.543 ± 0.064 % |
| Unsloth | UD_IQ4_NL | 17.8 | 5.981326 ± 0.037050 | 0.025159 ± 0.000350 | 93.596 ± 0.064 % |
| me | nvfp4 | 20.8 | 5.863949 ± 0.035996 | 0.027935 ± 0.000391 | 93.281 ± 0.065 % |
| me | mxfp4 | 19.8 | 5.996372 ± 0.036990 | 0.054779 ± 0.000531 | 90.355 ± 0.077 % |
| Unsloth | UD-Q2_K_XL | 12.2 | 6.393534 ± 0.040151 | 0.091808 ± 0.000669 | 87.288 ± 0.086 % |
| Intel AR | q2 mixed | 12.5 | 6.733538 ± 0.043640 | 0.149627 ± 0.000910 | 84.540 ± 0.094 % |
llama-perplexity results
I did not use an imatrix because nvfp4 should not be affected by it.
performance
On 2x 5060ti
CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
nvfp4 (update 20260403)
performance improved slightly making this usable.
| model | size | params | backend | ngl | fa | mmap | test | t/s |
|---|---|---|---|---|---|---|---|---|
| qwen35moe 35B.A3B Q8_0 | 19.36 GiB | 34.66 B | CUDA | 999 | 1 | 0 | pp512 | 2185.12 ± 10.77 |
| qwen35moe 35B.A3B Q8_0 | 19.36 GiB | 34.66 B | CUDA | 999 | 1 | 0 | tg128 | 95.05 ± 0.09 |
nvfp4 (update 20260420)
I tested michaels branch here: https://github.com/ggml-org/llama.cpp/pull/21896 So it is getting closer performance wise.
| model | size | params | backend | ngl | test | t/s |
|---|---|---|---|---|---|---|
| qwen35moe 35B.A3B Q8_0 | 19.36 GiB | 34.66 B | CUDA | 99 | pp512 | 2900.56 ± 8.25 |
| qwen35moe 35B.A3B Q8_0 | 19.36 GiB | 34.66 B | CUDA | 99 | tg128 | 94.56 ± 0.04 |
mxfp4
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen35moe 35B.A3B Q8_0 | 18.43 GiB | 34.66 B | CUDA | 99 | 1 | pp512 | 3018.48 ± 12.24 |
| qwen35moe 35B.A3B Q8_0 | 18.43 GiB | 34.66 B | CUDA | 99 | 1 | tg128 | 96.39 ± 0.05 |
q4_k
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen35moe 35B.A3B Q8_0 | 19.36 GiB | 34.66 B | CUDA | 99 | 1 | pp512 | 2494.68 ± 14.59 |
| qwen35moe 35B.A3B Q8_0 | 19.36 GiB | 34.66 B | CUDA | 99 | 1 | tg128 | 96.60 ± 0.08 |
- Downloads last month
- 2,562
4-bit