Qwen3.5-27B-GGUF-4.165bpw

This is a 4.165 BPW quantized model for the GPU poors with 16 GiB of VRAM. It works in both ik_llama.cpp and mainline llama.cpp.

It was quantized following the old wisdom from https://github.com/ggml-org/llama.cpp/issues/1256#issuecomment-1535758958, specifically:

Quantize first 1/4, then every 3rd layer with more bits

The FFN tensors were quantized following this strategy, to either q4_K or q3_K.

PPL is very good, and the model also performs very well in actual Q&A and agentic tasks.

UPDATE: There are now 2 variants, the original one without imatrix, and a new one with imatrix from mradermacher. More llama-perplexity results added below.

Size

Size from llama-server output:

llm_load_print_meta: model size       = 13.040 GiB (4.165 BPW)
llm_load_print_meta: repeating layers = 11.708 GiB (4.130 BPW, 24.353 B parameters)
...
llm_load_tensors:  CUDA_Host buffer size =   682.03 MiB
llm_load_tensors:      CUDA0 buffer size = 12671.04 MiB

Quality

Recipe
blk\..*\.attn_q\.weight=q4_K
blk\..*\.attn_k\.weight=q4_K
blk\..*\.attn_v\.weight=q4_K
blk\..*\.attn_output\.weight=q4_K
blk\..*\.attn_gate\.weight=q4_K
blk\..*\.attn_qkv\.weight=q4_K

blk\..*\.ssm_alpha\.weight=q4_K
blk\..*\.ssm_beta\.weight=q4_K
blk\..*\.ssm_out\.weight=q4_K

blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|18|21|24|27|30|33|36|39|42|45|48|51|54|57|60|63)\.ffn_(down|gate|up)\.weight=q4_K
blk\..*\.ffn_(down|gate|up)\.weight=q3_K

token_embd\.weight=q4_K
output\.weight=q4_K

PPL result with wiki.test.raw (no imatrix):

Final estimate: PPL over 580 chunks for n_ctx=512 = 6.8931 +/- 0.04448

This was quantized without using imatrix, because PPL is somehow worse with imatrix.

PPL result with wiki.test.raw (with imatrix from mradermacher):

Final estimate: PPL over 580 chunks for n_ctx=512 = 6.9863 +/- 0.04539

It looks like PPL alone is not a good enough metric for this model. As such, further test was done using the same methodology as https://www.reddit.com/r/LocalLLaMA/comments/1rk5qmr/qwen3527b_q4_quantization_comparison/. The quant did well enough to keep up, while being significantly smaller.

PPL/KLD/RMS result with wikitext2_test.txt (no imatrix):

Mean PPL(Q)                   :   6.501285 ±   0.042748
Mean PPL(base)                :   6.799430 ±   0.046581
Cor(ln(PPL(Q)), ln(PPL(base))):  95.92%
...
Mean    KLD:   0.135754 ±   0.002773
...
RMS Δp    :  8.422 ± 0.085 %
Same top p: 90.236 ± 0.077 %

PPL/KLD/RMS result with wikitext2_test.txt (with imatrix from mradermacher):

Mean PPL(Q)                   :   6.783163 ±   0.045910
Mean PPL(base)                :   6.799430 ±   0.046581
Cor(ln(PPL(Q)), ln(PPL(base))):  97.26%
...
Mean    KLD:   0.101915 ±   0.002372
...
RMS Δp    :  7.196 ± 0.081 %
Same top p: 91.563 ± 0.072 %

In general, llama-perplexity results are better with imatrix, but there is a possibility that imatrix will cause an unexpected token to be chosen in actual tasks (see https://huggingface.co/ubergarm/GLM-4.5-GGUF/discussions/3).

Downloads last month
904
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/Qwen3.5-27B-GGUF-4.165bpw

Base model

Qwen/Qwen3.5-27B
Quantized
(203)
this model