Qwen3.5-27B-GGUF-4.915bpw
This is a 4.915 BPW quantized model for the GPU poors with 24 GiB of VRAM. It uses the SOTA IQK quants, and thus works in ik_llama.cpp only.
From local testing with llama-perplexity, it has the best quality compared to the quants tested in https://www.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/, while being 1 GiB smaller than UD-Q4_K_XL.
There are 2 variants, one without imatrix, and one with imatrix from mradermacher.
With 24 GiB of VRAM, we can fit a context size of 131072 with F16 KV cache:
-c 131072 -ub 256
or a context size of 262144 with quantized KV cache:
-c 262144 -ctk q8_0 -ctv q6_0 -khad
Size
Size from llama-server output:
llm_load_print_meta: model size = 15.391 GiB (4.915 BPW)
llm_load_print_meta: repeating layers = 13.744 GiB (4.848 BPW, 24.353 B parameters)
...
llm_load_tensors: CUDA_Host buffer size = 682.03 MiB
llm_load_tensors: CUDA0 buffer size = 15077.86 MiB
Quality
Recipe
blk\..*\.attn_q\.weight=iq6_k
blk\..*\.attn_k\.weight=iq6_k
blk\..*\.attn_v\.weight=iq6_k
blk\..*\.attn_output\.weight=iq6_k
blk\..*\.attn_gate\.weight=iq6_k
blk\..*\.attn_qkv\.weight=iq5_k
blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=iq6_k
blk\..*\.ffn_down\.weight=iq4_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks
token_embd\.weight=iq4_k
output\.weight=iq6_k
PPL/KLD/RMS result with wikitext2_test.txt (no imatrix):
Mean PPL(Q) : 6.965793 ± 0.048592
Mean PPL(base) : 6.799430 ± 0.046581
Cor(ln(PPL(Q)), ln(PPL(base))): 98.14%
...
Mean KLD: 0.064612 ± 0.001992
...
RMS Δp : 5.419 ± 0.085 %
Same top p: 94.219 ± 0.061 %
PPL/KLD/RMS result with wikitext2_test.txt (with imatrix from mradermacher):
Mean PPL(Q) : 6.749744 ± 0.046150
Mean PPL(base) : 6.799430 ± 0.046581
Cor(ln(PPL(Q)), ln(PPL(base))): 98.46%
...
Mean KLD: 0.054023 ± 0.001835
...
RMS Δp : 5.385 ± 0.090 %
Same top p: 94.866 ± 0.057 %
In general, llama-perplexity results are better with imatrix, but there is a possibility that imatrix will cause an unexpected token to be chosen in actual tasks (see https://huggingface.co/ubergarm/GLM-4.5-GGUF/discussions/3).
- Downloads last month
- 107
We're not able to determine the quantization variants.
Model tree for sokann/Qwen3.5-27B-GGUF-4.915bpw
Base model
Qwen/Qwen3.5-27B