Instructions to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF",
	filename="Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
# Run inference directly in the terminal:
llama-cli -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
# Run inference directly in the terminal:
llama-cli -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
# Run inference directly in the terminal:
./llama-cli -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
# Run inference directly in the terminal:
./build/bin/llama-cli -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS

Use Docker

docker model run hf.co/k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS

LM Studio
Jan

vLLM

How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS

Ollama
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Ollama:
```
ollama run hf.co/k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
```

Unsloth Studio new

How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF to start chatting

Pi new

How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS

Run Hermes

hermes

Docker Model Runner
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Docker Model Runner:
```
docker model run hf.co/k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
```

Lemonade

How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS

Run and chat with the model

lemonade run user.Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF-IQ4_XS

List all available models

lemonade list

Darwin-28B-Coder IQ4_XS (Mixed-Bit, 12.92 GiB)

Runtime note: This quant was built and tested with ikawrakow's ik_llama.cpp fork. It has not been tested on the mainline ggml/llama.cpp. For best results and full feature support (mixed-bit quant loading), use ik_llama.cpp.

Model: Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf (12.92 GiB, 13,240 MiB) Base: FINAL-Bench/Darwin-28B-Coder — Q8_0, downloaded via gguf-my-repo Architecture: qwen35 — hybrid SSM (Mamba2) + full attention, 64 layers, 26.896 B params License: Apache-2.0

A custom importance-matrix-guided mixed-bit quantization targeting 16 GB VRAM GPUs. Built from the FINAL-Bench coder model — a code-specialized 28B-parameter model that scores 100% on HumanEval and 72% on BigCodeBench (see original model card).

🌐 Language note: The importance matrix was built from code, math, and function-calling data — nearly all English. The quant is optimized for English-language coding tasks as a result, though I've found other languages work fine in practice.

Why I made this

The FINAL-Bench team's model card claims HumanEval 100%, BigCodeBench 72%, and function-calling 90% — competitive with GPT-4o and Claude Sonnet at 28B parameters. Those scores were achieved with BF16 precision on full GPU setups. My goal was to see how much of that quality I could preserve while squeezing it into 16 GB VRAM with headroom for MTP speculative decoding and reasonable context.

The result: 4.126 BPW from a custom code-focused imatrix, 49 per-tensor regex rules, and essentially no perplexity loss versus a reference Q4_K_M — while being 2.48 GiB (16%) smaller.

Quantization Summary

Metric	Value
Source	`darwin-28b-coder-q8_0.gguf` (27,260 MB)
Quant size	13,240 MB (12.92 GiB)
BPW	4.126 (average)
Compression (Q8_0 → quant)	2.06×
Rules	49 regex patterns across 7 precision levels
Importance matrix	Custom — 3,200 chunks × 512 ctx, code+tools+math corpus, shuffled for balanced sampling
Toolchain	ik_llama.cpp fork via Thireus GGUF-Tool-Suite

Why I re-quantized from Q8_0

This quant starts from Q8_0, not F16. Q8_0 is near-lossless (8-bit block quantization with F16 scales) — the quality loss from Q8_0 → custom quant vs BF16 → custom quant is ~0.01-0.03 PPL. Virtually all GGUF re-quantizers work this way.
Also this is a test against my earlier quants from full F16.

Tensor Distribution

QType	Tensors	GiB	% of Model
f32	353	0.01	— (norms, biases, SSM params)
q8_0 / q6_K / iq6_K	99	0.04	0.2% (SSM alpha/beta, deep SSM)
iq5_k	37	0.35	2.0% (attention K/V, last-layer FFN)
iq4_k	102	2.79	19.8% (selected FFN/attention)
iq4_ks	40	1.27	9.5% (selected smaller FFN/attention)
iq4_kt	215	7.83	62.4% ← bulk at ~4.0 bpw
iq3_k	1	0.51	4.7% (token embeddings)
iq3_kt	4	0.13	1.3% (early-layer FFN gates)

Protected Layers

Full-attention layers (3, 7, 11, …, 63): Q/K norms → f32, K/V → iq5_k, Q → iq4_kt
Last layer (blk.63): FFN → iq5_k, V → q6_K, gate → iq4_k
First layer (blk.0): QKV → iq5_k, gate → iq5_k, SSM out → iq5_k
SSM small tensors: a/dt/conv1d/norm → f32, alpha/beta → q8_0/q6_K
Output norm → f32 · Token embeddings → iq3_k · Output projection → iq4_k

Early Layer Squeeze

Layers 0-2 have ffn_gate/ffn_up at iq3_kt (3.1 bpw) — the lowest precision in the model. The first 3 layers process raw embeddings; noise there compounds through all 64 layers. I made this trade-off deliberately to hit the 12.92 GiB target. The PPL results suggest it didn't hurt much, but if you run into quality issues on certain tasks, this is the first thing to hand-tweak.

Perplexity Benchmark

Measured with llama-perplexity (ik_llama.cpp fork, CUDA, RTX 5070 Ti, n_ctx=512, n_batch=512, 580 chunks, ~297K tokens).

Model	Size	BPW	PPL (n_ctx=512)	Δ
Q4_K_M (reference)	15.40 GiB	4.919	6.8072 ± 0.0437	—
IQ4_XS (this quant)	12.92 GiB	4.126	6.8265 ± 0.0439	+0.0193

The difference (0.0193) is well within the error margin — the two quants are statistically indistinguishable on this test. Same quality, 16% smaller.

Prompt Processing Speed

Model	Tokens/s	ms/token
Q4_K_M (15.40 GiB)	1,170	0.85
IQ4_XS (12.92 GiB)	1,579	0.63

The smaller model fits better in VRAM, giving 35% faster prompt processing.

VRAM Usage (RTX 5070 Ti, 16 GB)

Component	Q4_K_M	IQ4_XS (this)
Model tensors (GPU)	13,903 MiB	12,709 MiB
CUDA_Host buffer	1,867 MiB	521 MiB
KV cache (512 ctx, f16)	32 MiB	32 MiB
Compute buffers	505 MiB	505 MiB
Total (reported)	15,657 MiB	13,278 MiB
Free VRAM remaining	~640 MiB	~2,880 MiB

KV Cache Scaling

Only 16 of 64 layers need KV cache — this model uses hybrid Mamba2+Attention, where only every 4th layer is full attention. The rest are SSM (no KV cache needed). This gives ~75% savings vs a pure attention model.

Cache type	Bytes/token	32K context	50K context
q8_0	~34 KB	~1.0 GiB	~1.5 GiB
q6_H	~26 KB	~0.8 GiB	~1.1 GiB
q4_0	~18 KB	~0.5 GiB	~0.8 GiB

Practical max context on 16 GB: ~46K (q8_0 KV) before attention workspace overhead starts competing with VRAM.

Custom Importance Matrix

I generated my own imatrix instead of using third-party ones for this one because I wanted a code-focused calibration, matching the coder-orientation of the model.

Calibration Corpus

Dataset	Source	Prompts	Domain
Code	`code_small.parquet`	25K	Code instruction
Math	`math_medium.parquet`	50K	Math reasoning
Tools	`tools_medium.parquet`	25K	Function calling

All 100K prompts were decoded (regex-based, handling unicode escapes and unescaped quotes), concatenated, and randomly shuffled to ensure every 512-chunk window sampled all domains uniformly.

Domain distribution in the imatrix run:

Code: 16.8% · Math: 8.7% · Tools: 5.1% · Code continuations (from long prompts): 69.5%

The imatrix run took 7h 10min processing 1,638,400 tokens (3,200 × 512 chunks), with a calibration PPL of 5.60 ± 0.015.

Quantization Design

Recipe Generation

The recipe was generated by quant_assign.py from the Thireus GGUF-Tool-Suite, using KLD sensitivity data from the Qwen3.6-27B reference (same qwen35 architecture, identical tensor names and shapes).

Key flags:

--gpu-tensors-max-size 12.92 --tolerance 0.005
--use-greedy-quant-assign --harmonization-technique 3
--with-imatrix --quant-degradation-csv <group0/kld_results.csv>

How the bits are allocated

The greedy quant assigner distributes bits by KLD importance: tensors where quantization introduces more distribution drift get higher precision. The 49 regex rules map this allocation to llama-quantize patterns. Within each assigned type, the imatrix further optimizes quantization to minimize output error for the calibration domain (code + math + tools).

The result is dominated by iq4_kt (215 tensors, 62.4%) — a shape-dependent IQ4 variant with ~4.005 BPW that's slightly more compact than iq4_k but higher quality than iq4_xs. The remaining tensors are distributed across 6 other types based on per-tensor importance.

Quantization Flags

Flag	Purpose
`--allow-requantize`	Required — source is Q8_0, not BF16
`--imatrix`	Code-focused imatrix (1.6M tokens, shuffled)
`--ignore-imatrix-rules`	Use KLD-based qtype choices; imatrix still optimizes within-type
`--custom-q`	49 regex→qtype rules (3,243 chars)
Fallback `q8_0`	Safe fallback for any unmatched tensors

Avoid --fit — it causes major performance regression (8 vs 28 t/s) on this architecture.

Usage

llama-server

llama-server \
  -m Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf \
  -c 32768 -ngl 99 -t 12 -ub 512 \
  -ctk q8_0 -ctv q8_0 -fa

Files in This Repo

File	Description
`Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf`	The quantized model (12.92 GiB)
`custom-darwin-28b-coder.imatrix.dat`	Code-focused importance matrix (3,200 chunks)
`quantize/Darwin-28B-Coder-12.92GiB.recipe.txt`	Full quantization recipe (49 mixed-precision rules)
`logs/ppl_darwin-28b-coder-custom-12.92GiB_cfull.log`	Perplexity benchmark log (this quant)
`logs/ppl_reference_q4km_cfull.log`	Reference Q4_K_M perplexity log

Acknowledgements

Thanks to:

FINAL-Bench — Original Darwin-28B-Coder model. A fantastic code-specialized 28B model that punches way above its weight class.
ikawrakow / ik_llama.cpp — Custom quantization scheme and MTP inference engine.
Thireus / GGUF-Tool-Suite — The quant_assign.py recipe generation pipeline and importance-aware bit allocation. The KLD-guided greedy optimizer is what makes mixed-IQ quantization practical.
eaddario — Parquet calibration datasets for the custom imatrix.
llama.cpp community — GGUF format, quantization infrastructure, and the broader ecosystem.

Model tree for k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF

Base model

FINAL-Bench/Darwin-28B-Coder

Quantized

(3)

this model

Dataset used to train k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF

Collection including k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF

k0valik/Qwen3.6-27B

Collection

My experimental IQ4_XS small size quants for Qwen3.6-27B so I can squeeze them into 16GB VRAM with relatively large context • 3 items • Updated 3 days ago

k0valik
/

Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF