Qwen3.5-35B-A3B NVFP4 gguf

qwen 3.5 quant with experts quantized to nvfp4. It is still slow but shows to be having a better accuracy than mxfp4. This makes sense because both Q4_K and nvfp4 use a superblock scale. I hope that since its a proprietary format from nvidia that it will get better support and be faster than int4. Quantized from BF16, no imatrix used. Used llama.cpp version 3a14a542f5ce8666713c6e6ea44f7f3e01dd6e45 to quantize and calculate kld.

Quant config:

token_embd=Q8_0
attn_gate=Q8_0
attn_norm=F32
attn_qkv=Q8_0
ffn_down_exps=NVFP4
ffn_down_shexp=Q8_0
ffn_gate_exps=NVFP4
ffn_gate_inp=F32
ffn_gate_inp_shexp=Q8_0
ffn_gate_shexp=Q8_0
ffn_up_exps=NVFP4
ffn_up_shexp=Q8_0
post_attention_norm=F32
ssm_a=F32
ssm_alpha=Q8_0
ssm_beta=Q8_0
ssm_conv1d=F32
ssm_dt.bias=F32
ssm_norm=F32
ssm_out=Q8_0
attn_k=Q8_0
attn_k_norm=F32
attn_q=Q8_0
attn_q_norm=F32
attn_v=Q8_0
attn_output=Q8_0
output=Q8_0
output_norm=F32

kld data

UPDATE 20260420: https://github.com/ggml-org/llama.cpp/discussions/22042 It looks like the per tensor scale is not implemented yet. Hopefully this will be implemented soon as we can expect a significant KLD improvement from that. Excited to see this!

Final estimate: PPL = 6.8908 +/- 0.04680

Provider	Quant	Size GB	Mean PPL	Mean KLD	Same Top p
Unsloth	f16		6.8908 +/- 0.04680	baseline	baseline
Unsloth	UD-Q6_K_XL	32.1	5.859151 ± 0.036117	0.005158 ± 0.000170	96.957 ± 0.045 %
Unsloth	UD-Q5-K_XL	26.4	5.856634 ± 0.036095	0.006890 ± 0.000211	96.543 ± 0.047 %
Unsloth	Q5_K_M	26.2	5.857466 ± 0.036094	0.006996 ± 0.000206	96.496 ± 0.048 %
AES	Q5_K_M	26.3	5.854688 ± 0.036066	0.007233 ± 0.000248	96.499 ± 0.048 %
Unsloth	UD_Q6_K_S	28.5	5.875750 ± 0.036290	0.007825 ± 0.000240	96.223 ± 0.049 %
Unsloth	Q5_K_S	24.8	5.851724 ± 0.036036	0.008098 ± 0.000241	96.338 ± 0.049 %
AES	Q4_K_M	22.2	5.900010 ± 0.036455	0.010871 ± 0.000245	95.731 ± 0.052 %
Unsloth	UD_Q4_K_XL	22.2	5.885954 ± 0.036317	0.011072 ± 0.000276	95.783 ± 0.052 %
Unsloth	Q4_K_M	22	5.892703 ± 0.036382	0.011525 ± 0.000270	95.590 ± 0.053 %
Mraderbache	Q5_K_M	24.8	5.889589 ± 0.036408	0.012260 ± 0.000276	95.306 ± 0.055 %
me	Q4_K	20.8	5.922957 ± 0.036659	0.013359 ± 0.000249	95.349 ± 0.055 %
Unsloth	Q4_K_S	20.7	5.930525 ± 0.036716	0.013894 ± 0.000266	95.233 ± 0.055 %
AES	IQ4_XS	17.6	5.984069 ± 0.037076	0.024742 ± 0.000364	93.753 ± 0.063 %
Unsloth	UD_IQ4_X_S	17.5	5.979239 ± 0.037032	0.025096 ± 0.000336	93.543 ± 0.064 %
Unsloth	UD_IQ4_NL	17.8	5.981326 ± 0.037050	0.025159 ± 0.000350	93.596 ± 0.064 %
me	nvfp4	20.8	5.863949 ± 0.035996	0.027935 ± 0.000391	93.281 ± 0.065 %
me	mxfp4	19.8	5.996372 ± 0.036990	0.054779 ± 0.000531	90.355 ± 0.077 %
Unsloth	UD-Q2_K_XL	12.2	6.393534 ± 0.040151	0.091808 ± 0.000669	87.288 ± 0.086 %
Intel AR	q2 mixed	12.5	6.733538 ± 0.043640	0.149627 ± 0.000910	84.540 ± 0.094 %

llama-perplexity results

I did not use an imatrix because nvfp4 should not be affected by it.

performance

On 2x 5060ti

CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

nvfp4 (update 20260403)

performance improved slightly making this usable.

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen35moe 35B.A3B Q8_0	19.36 GiB	34.66 B	CUDA	999	1	0	pp512	2185.12 ± 10.77
qwen35moe 35B.A3B Q8_0	19.36 GiB	34.66 B	CUDA	999	1	0	tg128	95.05 ± 0.09

nvfp4 (update 20260420)

I tested michaels branch here: https://github.com/ggml-org/llama.cpp/pull/21896 So it is getting closer performance wise.

model	size	params	backend	ngl	test	t/s
qwen35moe 35B.A3B Q8_0	19.36 GiB	34.66 B	CUDA	99	pp512	2900.56 ± 8.25
qwen35moe 35B.A3B Q8_0	19.36 GiB	34.66 B	CUDA	99	tg128	94.56 ± 0.04

mxfp4

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q8_0	18.43 GiB	34.66 B	CUDA	99	1	pp512	3018.48 ± 12.24
qwen35moe 35B.A3B Q8_0	18.43 GiB	34.66 B	CUDA	99	1	tg128	96.39 ± 0.05

q4_k

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q8_0	19.36 GiB	34.66 B	CUDA	99	1	pp512	2494.68 ± 14.59
qwen35moe 35B.A3B Q8_0	19.36 GiB	34.66 B	CUDA	99	1	tg128	96.60 ± 0.08

Downloads last month: 2,562

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

4-bit

Model tree for krampenschiesser/Qwen3.5-35B-A3B-NVFP4.gguf

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(243)

this model