How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF",
	filename="Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Darwin-28B-Coder IQ4_XS (Mixed-Bit, 12.92 GiB)

Runtime note: This quant was built and tested with ikawrakow's ik_llama.cpp fork. It has not been tested on the mainline ggml/llama.cpp. For best results and full feature support (mixed-bit quant loading), use ik_llama.cpp.

Model: Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf (12.92 GiB, 13,240 MiB) Base: FINAL-Bench/Darwin-28B-Coder — Q8_0, downloaded via gguf-my-repo Architecture: qwen35 — hybrid SSM (Mamba2) + full attention, 64 layers, 26.896 B params License: Apache-2.0

A custom importance-matrix-guided mixed-bit quantization targeting 16 GB VRAM GPUs. Built from the FINAL-Bench coder model — a code-specialized 28B-parameter model that scores 100% on HumanEval and 72% on BigCodeBench (see original model card).

🌐 Language note: The importance matrix was built from code, math, and function-calling data — nearly all English. The quant is optimized for English-language coding tasks as a result, though I've found other languages work fine in practice.


Why I made this

The FINAL-Bench team's model card claims HumanEval 100%, BigCodeBench 72%, and function-calling 90% — competitive with GPT-4o and Claude Sonnet at 28B parameters. Those scores were achieved with BF16 precision on full GPU setups. My goal was to see how much of that quality I could preserve while squeezing it into 16 GB VRAM with headroom for MTP speculative decoding and reasonable context.

The result: 4.126 BPW from a custom code-focused imatrix, 49 per-tensor regex rules, and essentially no perplexity loss versus a reference Q4_K_M — while being 2.48 GiB (16%) smaller.


Quantization Summary

Metric Value
Source darwin-28b-coder-q8_0.gguf (27,260 MB)
Quant size 13,240 MB (12.92 GiB)
BPW 4.126 (average)
Compression (Q8_0 → quant) 2.06×
Rules 49 regex patterns across 7 precision levels
Importance matrix Custom — 3,200 chunks × 512 ctx, code+tools+math corpus, shuffled for balanced sampling
Toolchain ik_llama.cpp fork via Thireus GGUF-Tool-Suite

Why I re-quantized from Q8_0

  • This quant starts from Q8_0, not F16. Q8_0 is near-lossless (8-bit block quantization with F16 scales) — the quality loss from Q8_0 → custom quant vs BF16 → custom quant is ~0.01-0.03 PPL. Virtually all GGUF re-quantizers work this way.
  • Also this is a test against my earlier quants from full F16.

Tensor Distribution

QType Tensors GiB % of Model
f32 353 0.01 — (norms, biases, SSM params)
q8_0 / q6_K / iq6_K 99 0.04 0.2% (SSM alpha/beta, deep SSM)
iq5_k 37 0.35 2.0% (attention K/V, last-layer FFN)
iq4_k 102 2.79 19.8% (selected FFN/attention)
iq4_ks 40 1.27 9.5% (selected smaller FFN/attention)
iq4_kt 215 7.83 62.4% ← bulk at ~4.0 bpw
iq3_k 1 0.51 4.7% (token embeddings)
iq3_kt 4 0.13 1.3% (early-layer FFN gates)

Protected Layers

  • Full-attention layers (3, 7, 11, …, 63): Q/K norms → f32, K/V → iq5_k, Q → iq4_kt
  • Last layer (blk.63): FFN → iq5_k, V → q6_K, gate → iq4_k
  • First layer (blk.0): QKV → iq5_k, gate → iq5_k, SSM out → iq5_k
  • SSM small tensors: a/dt/conv1d/norm → f32, alpha/beta → q8_0/q6_K
  • Output norm → f32 · Token embeddings → iq3_k · Output projection → iq4_k

Early Layer Squeeze

Layers 0-2 have ffn_gate/ffn_up at iq3_kt (3.1 bpw) — the lowest precision in the model. The first 3 layers process raw embeddings; noise there compounds through all 64 layers. I made this trade-off deliberately to hit the 12.92 GiB target. The PPL results suggest it didn't hurt much, but if you run into quality issues on certain tasks, this is the first thing to hand-tweak.


Perplexity Benchmark

Measured with llama-perplexity (ik_llama.cpp fork, CUDA, RTX 5070 Ti, n_ctx=512, n_batch=512, 580 chunks, ~297K tokens).

Model Size BPW PPL (n_ctx=512) Δ
Q4_K_M (reference) 15.40 GiB 4.919 6.8072 ± 0.0437
IQ4_XS (this quant) 12.92 GiB 4.126 6.8265 ± 0.0439 +0.0193

The difference (0.0193) is well within the error margin — the two quants are statistically indistinguishable on this test. Same quality, 16% smaller.

Prompt Processing Speed

Model Tokens/s ms/token
Q4_K_M (15.40 GiB) 1,170 0.85
IQ4_XS (12.92 GiB) 1,579 0.63

The smaller model fits better in VRAM, giving 35% faster prompt processing.


VRAM Usage (RTX 5070 Ti, 16 GB)

Component Q4_K_M IQ4_XS (this)
Model tensors (GPU) 13,903 MiB 12,709 MiB
CUDA_Host buffer 1,867 MiB 521 MiB
KV cache (512 ctx, f16) 32 MiB 32 MiB
Compute buffers 505 MiB 505 MiB
Total (reported) 15,657 MiB 13,278 MiB
Free VRAM remaining ~640 MiB ~2,880 MiB

KV Cache Scaling

Only 16 of 64 layers need KV cache — this model uses hybrid Mamba2+Attention, where only every 4th layer is full attention. The rest are SSM (no KV cache needed). This gives ~75% savings vs a pure attention model.

Cache type Bytes/token 32K context 50K context
q8_0 ~34 KB ~1.0 GiB ~1.5 GiB
q6_H ~26 KB ~0.8 GiB ~1.1 GiB
q4_0 ~18 KB ~0.5 GiB ~0.8 GiB

Practical max context on 16 GB: ~46K (q8_0 KV) before attention workspace overhead starts competing with VRAM.


Custom Importance Matrix

I generated my own imatrix instead of using third-party ones for this one because I wanted a code-focused calibration, matching the coder-orientation of the model.

Calibration Corpus

Dataset Source Prompts Domain
Code code_small.parquet 25K Code instruction
Math math_medium.parquet 50K Math reasoning
Tools tools_medium.parquet 25K Function calling

All 100K prompts were decoded (regex-based, handling unicode escapes and unescaped quotes), concatenated, and randomly shuffled to ensure every 512-chunk window sampled all domains uniformly.

Domain distribution in the imatrix run:

  • Code: 16.8% · Math: 8.7% · Tools: 5.1% · Code continuations (from long prompts): 69.5%

The imatrix run took 7h 10min processing 1,638,400 tokens (3,200 × 512 chunks), with a calibration PPL of 5.60 ± 0.015.


Quantization Design

Recipe Generation

The recipe was generated by quant_assign.py from the Thireus GGUF-Tool-Suite, using KLD sensitivity data from the Qwen3.6-27B reference (same qwen35 architecture, identical tensor names and shapes).

Key flags:

--gpu-tensors-max-size 12.92 --tolerance 0.005
--use-greedy-quant-assign --harmonization-technique 3
--with-imatrix --quant-degradation-csv <group0/kld_results.csv>

How the bits are allocated

The greedy quant assigner distributes bits by KLD importance: tensors where quantization introduces more distribution drift get higher precision. The 49 regex rules map this allocation to llama-quantize patterns. Within each assigned type, the imatrix further optimizes quantization to minimize output error for the calibration domain (code + math + tools).

The result is dominated by iq4_kt (215 tensors, 62.4%) — a shape-dependent IQ4 variant with ~4.005 BPW that's slightly more compact than iq4_k but higher quality than iq4_xs. The remaining tensors are distributed across 6 other types based on per-tensor importance.

Quantization Flags

Flag Purpose
--allow-requantize Required — source is Q8_0, not BF16
--imatrix Code-focused imatrix (1.6M tokens, shuffled)
--ignore-imatrix-rules Use KLD-based qtype choices; imatrix still optimizes within-type
--custom-q 49 regex→qtype rules (3,243 chars)
Fallback q8_0 Safe fallback for any unmatched tensors

Avoid --fit — it causes major performance regression (8 vs 28 t/s) on this architecture.


Usage

llama-server

llama-server \
  -m Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf \
  -c 32768 -ngl 99 -t 12 -ub 512 \
  -ctk q8_0 -ctv q8_0 -fa

Files in This Repo

File Description
Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf The quantized model (12.92 GiB)
custom-darwin-28b-coder.imatrix.dat Code-focused importance matrix (3,200 chunks)
quantize/Darwin-28B-Coder-12.92GiB.recipe.txt Full quantization recipe (49 mixed-precision rules)
logs/ppl_darwin-28b-coder-custom-12.92GiB_cfull.log Perplexity benchmark log (this quant)
logs/ppl_reference_q4km_cfull.log Reference Q4_K_M perplexity log

Acknowledgements

Thanks to:

  • FINAL-Bench — Original Darwin-28B-Coder model. A fantastic code-specialized 28B model that punches way above its weight class.
  • ikawrakow / ik_llama.cpp — Custom quantization scheme and MTP inference engine.
  • Thireus / GGUF-Tool-Suite — The quant_assign.py recipe generation pipeline and importance-aware bit allocation. The KLD-guided greedy optimizer is what makes mixed-IQ quantization practical.
  • eaddario — Parquet calibration datasets for the custom imatrix.
  • llama.cpp community — GGUF format, quantization infrastructure, and the broader ecosystem.

See Also

Downloads last month
535
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF

Quantized
(3)
this model

Dataset used to train k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF

Collection including k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF