Qwen3.5-35B-A3B NVFP4 gguf

qwen 3.5 quant with experts quantized to nvfp4. It is still slow but shows to be having a better accuracy than mxfp4. This makes sense because both Q4_K and nvfp4 use a superblock scale. I hope that since its a proprietary format from nvidia that it will get better support and be faster than int4. Quantized from BF16, no imatrix used. Used llama.cpp version 3a14a542f5ce8666713c6e6ea44f7f3e01dd6e45 to quantize and calculate kld.

Quant config:

token_embd=Q8_0
attn_gate=Q8_0
attn_norm=F32
attn_qkv=Q8_0
ffn_down_exps=NVFP4
ffn_down_shexp=Q8_0
ffn_gate_exps=NVFP4
ffn_gate_inp=F32
ffn_gate_inp_shexp=Q8_0
ffn_gate_shexp=Q8_0
ffn_up_exps=NVFP4
ffn_up_shexp=Q8_0
post_attention_norm=F32
ssm_a=F32
ssm_alpha=Q8_0
ssm_beta=Q8_0
ssm_conv1d=F32
ssm_dt.bias=F32
ssm_norm=F32
ssm_out=Q8_0
attn_k=Q8_0
attn_k_norm=F32
attn_q=Q8_0
attn_q_norm=F32
attn_v=Q8_0
attn_output=Q8_0
output=Q8_0
output_norm=F32

kld data

UPDATE 20260420: https://github.com/ggml-org/llama.cpp/discussions/22042 It looks like the per tensor scale is not implemented yet. Hopefully this will be implemented soon as we can expect a significant KLD improvement from that. Excited to see this!

Final estimate: PPL = 6.8908 +/- 0.04680

Provider Quant Size GB Mean PPL Mean KLD Same Top p
Unsloth f16 6.8908 +/- 0.04680 baseline baseline
Unsloth UD-Q6_K_XL 32.1 5.859151 ± 0.036117 0.005158 ± 0.000170 96.957 ± 0.045 %
Unsloth UD-Q5-K_XL 26.4 5.856634 ± 0.036095 0.006890 ± 0.000211 96.543 ± 0.047 %
Unsloth Q5_K_M 26.2 5.857466 ± 0.036094 0.006996 ± 0.000206 96.496 ± 0.048 %
AES Q5_K_M 26.3 5.854688 ± 0.036066 0.007233 ± 0.000248 96.499 ± 0.048 %
Unsloth UD_Q6_K_S 28.5 5.875750 ± 0.036290 0.007825 ± 0.000240 96.223 ± 0.049 %
Unsloth Q5_K_S 24.8 5.851724 ± 0.036036 0.008098 ± 0.000241 96.338 ± 0.049 %
AES Q4_K_M 22.2 5.900010 ± 0.036455 0.010871 ± 0.000245 95.731 ± 0.052 %
Unsloth UD_Q4_K_XL 22.2 5.885954 ± 0.036317 0.011072 ± 0.000276 95.783 ± 0.052 %
Unsloth Q4_K_M 22 5.892703 ± 0.036382 0.011525 ± 0.000270 95.590 ± 0.053 %
Mraderbache Q5_K_M 24.8 5.889589 ± 0.036408 0.012260 ± 0.000276 95.306 ± 0.055 %
me Q4_K 20.8 5.922957 ± 0.036659 0.013359 ± 0.000249 95.349 ± 0.055 %
Unsloth Q4_K_S 20.7 5.930525 ± 0.036716 0.013894 ± 0.000266 95.233 ± 0.055 %
AES IQ4_XS 17.6 5.984069 ± 0.037076 0.024742 ± 0.000364 93.753 ± 0.063 %
Unsloth UD_IQ4_X_S 17.5 5.979239 ± 0.037032 0.025096 ± 0.000336 93.543 ± 0.064 %
Unsloth UD_IQ4_NL 17.8 5.981326 ± 0.037050 0.025159 ± 0.000350 93.596 ± 0.064 %
me nvfp4 20.8 5.863949 ± 0.035996 0.027935 ± 0.000391 93.281 ± 0.065 %
me mxfp4 19.8 5.996372 ± 0.036990 0.054779 ± 0.000531 90.355 ± 0.077 %
Unsloth UD-Q2_K_XL 12.2 6.393534 ± 0.040151 0.091808 ± 0.000669 87.288 ± 0.086 %
Intel AR q2 mixed 12.5 6.733538 ± 0.043640 0.149627 ± 0.000910 84.540 ± 0.094 %

llama-perplexity results

I did not use an imatrix because nvfp4 should not be affected by it.

performance

On 2x 5060ti

CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

nvfp4 (update 20260403)

performance improved slightly making this usable.

model size params backend ngl fa mmap test t/s
qwen35moe 35B.A3B Q8_0 19.36 GiB 34.66 B CUDA 999 1 0 pp512 2185.12 ± 10.77
qwen35moe 35B.A3B Q8_0 19.36 GiB 34.66 B CUDA 999 1 0 tg128 95.05 ± 0.09

nvfp4 (update 20260420)

I tested michaels branch here: https://github.com/ggml-org/llama.cpp/pull/21896 So it is getting closer performance wise.

model size params backend ngl test t/s
qwen35moe 35B.A3B Q8_0 19.36 GiB 34.66 B CUDA 99 pp512 2900.56 ± 8.25
qwen35moe 35B.A3B Q8_0 19.36 GiB 34.66 B CUDA 99 tg128 94.56 ± 0.04

mxfp4

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q8_0 18.43 GiB 34.66 B CUDA 99 1 pp512 3018.48 ± 12.24
qwen35moe 35B.A3B Q8_0 18.43 GiB 34.66 B CUDA 99 1 tg128 96.39 ± 0.05

q4_k

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q8_0 19.36 GiB 34.66 B CUDA 99 1 pp512 2494.68 ± 14.59
qwen35moe 35B.A3B Q8_0 19.36 GiB 34.66 B CUDA 99 1 tg128 96.60 ± 0.08
Downloads last month
2,562
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for krampenschiesser/Qwen3.5-35B-A3B-NVFP4.gguf

Quantized
(243)
this model