Qwen3.5-27B-GGUF-4.915bpw

This is a 4.915 BPW quantized model for the GPU poors with 24 GiB of VRAM. It uses the SOTA IQK quants, and thus works in ik_llama.cpp only.

From local testing with llama-perplexity, it has the best quality compared to the quants tested in https://www.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/, while being 1 GiB smaller than UD-Q4_K_XL.

There are 2 variants, one without imatrix, and one with imatrix from mradermacher.

With 24 GiB of VRAM, we can fit a context size of 131072 with F16 KV cache:

-c 131072 -ub 256

or a context size of 262144 with quantized KV cache:

-c 262144 -ctk q8_0 -ctv q6_0 -khad

Size

Size from llama-server output:

llm_load_print_meta: model size       = 15.391 GiB (4.915 BPW)
llm_load_print_meta: repeating layers = 13.744 GiB (4.848 BPW, 24.353 B parameters)
...
llm_load_tensors:  CUDA_Host buffer size =   682.03 MiB
llm_load_tensors:      CUDA0 buffer size = 15077.86 MiB

Quality

Recipe
blk\..*\.attn_q\.weight=iq6_k
blk\..*\.attn_k\.weight=iq6_k
blk\..*\.attn_v\.weight=iq6_k
blk\..*\.attn_output\.weight=iq6_k
blk\..*\.attn_gate\.weight=iq6_k
blk\..*\.attn_qkv\.weight=iq5_k

blk\..*\.ssm_alpha\.weight=q8_0
blk\..*\.ssm_beta\.weight=q8_0
blk\..*\.ssm_out\.weight=iq6_k

blk\..*\.ffn_down\.weight=iq4_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks

token_embd\.weight=iq4_k
output\.weight=iq6_k

PPL/KLD/RMS result with wikitext2_test.txt (no imatrix):

Mean PPL(Q)                   :   6.965793 ±   0.048592
Mean PPL(base)                :   6.799430 ±   0.046581
Cor(ln(PPL(Q)), ln(PPL(base))):  98.14%
...
Mean    KLD:   0.064612 ±   0.001992
...
RMS Δp    :  5.419 ± 0.085 %
Same top p: 94.219 ± 0.061 %

PPL/KLD/RMS result with wikitext2_test.txt (with imatrix from mradermacher):

Mean PPL(Q)                   :   6.749744 ±   0.046150
Mean PPL(base)                :   6.799430 ±   0.046581
Cor(ln(PPL(Q)), ln(PPL(base))):  98.46%
...
Mean    KLD:   0.054023 ±   0.001835
...
RMS Δp    :  5.385 ± 0.090 %
Same top p: 94.866 ± 0.057 %

In general, llama-perplexity results are better with imatrix, but there is a possibility that imatrix will cause an unexpected token to be chosen in actual tasks (see https://huggingface.co/ubergarm/GLM-4.5-GGUF/discussions/3).

Downloads last month
107
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/Qwen3.5-27B-GGUF-4.915bpw

Base model

Qwen/Qwen3.5-27B
Quantized
(203)
this model