Instructions to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF", filename="Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS # Run inference directly in the terminal: llama-cli -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS # Run inference directly in the terminal: llama-cli -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS # Run inference directly in the terminal: ./llama-cli -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS # Run inference directly in the terminal: ./build/bin/llama-cli -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
Use Docker
docker model run hf.co/k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
- LM Studio
- Jan
- vLLM
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
- Ollama
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Ollama:
ollama run hf.co/k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
- Unsloth Studio new
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF to start chatting
- Pi new
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
Run Hermes
hermes
- Docker Model Runner
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Docker Model Runner:
docker model run hf.co/k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
- Lemonade
How to use k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF:IQ4_XS
Run and chat with the model
lemonade run user.Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF-IQ4_XS
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Darwin-28B-Coder IQ4_XS (Mixed-Bit, 12.92 GiB)
Runtime note: This quant was built and tested with ikawrakow's
ik_llama.cppfork. It has not been tested on the mainline ggml/llama.cpp. For best results and full feature support (mixed-bit quant loading), useik_llama.cpp.
Model: Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf (12.92 GiB, 13,240 MiB)
Base: FINAL-Bench/Darwin-28B-Coder — Q8_0, downloaded via gguf-my-repo
Architecture: qwen35 — hybrid SSM (Mamba2) + full attention, 64 layers, 26.896 B params
License: Apache-2.0
A custom importance-matrix-guided mixed-bit quantization targeting 16 GB VRAM GPUs. Built from the FINAL-Bench coder model — a code-specialized 28B-parameter model that scores 100% on HumanEval and 72% on BigCodeBench (see original model card).
🌐 Language note: The importance matrix was built from code, math, and function-calling data — nearly all English. The quant is optimized for English-language coding tasks as a result, though I've found other languages work fine in practice.
Why I made this
The FINAL-Bench team's model card claims HumanEval 100%, BigCodeBench 72%, and function-calling 90% — competitive with GPT-4o and Claude Sonnet at 28B parameters. Those scores were achieved with BF16 precision on full GPU setups. My goal was to see how much of that quality I could preserve while squeezing it into 16 GB VRAM with headroom for MTP speculative decoding and reasonable context.
The result: 4.126 BPW from a custom code-focused imatrix, 49 per-tensor regex rules, and essentially no perplexity loss versus a reference Q4_K_M — while being 2.48 GiB (16%) smaller.
Quantization Summary
| Metric | Value |
|---|---|
| Source | darwin-28b-coder-q8_0.gguf (27,260 MB) |
| Quant size | 13,240 MB (12.92 GiB) |
| BPW | 4.126 (average) |
| Compression (Q8_0 → quant) | 2.06× |
| Rules | 49 regex patterns across 7 precision levels |
| Importance matrix | Custom — 3,200 chunks × 512 ctx, code+tools+math corpus, shuffled for balanced sampling |
| Toolchain | ik_llama.cpp fork via Thireus GGUF-Tool-Suite |
Why I re-quantized from Q8_0
- This quant starts from Q8_0, not F16. Q8_0 is near-lossless (8-bit block quantization with F16 scales) — the quality loss from Q8_0 → custom quant vs BF16 → custom quant is ~0.01-0.03 PPL. Virtually all GGUF re-quantizers work this way.
- Also this is a test against my earlier quants from full F16.
Tensor Distribution
| QType | Tensors | GiB | % of Model |
|---|---|---|---|
| f32 | 353 | 0.01 | — (norms, biases, SSM params) |
| q8_0 / q6_K / iq6_K | 99 | 0.04 | 0.2% (SSM alpha/beta, deep SSM) |
| iq5_k | 37 | 0.35 | 2.0% (attention K/V, last-layer FFN) |
| iq4_k | 102 | 2.79 | 19.8% (selected FFN/attention) |
| iq4_ks | 40 | 1.27 | 9.5% (selected smaller FFN/attention) |
| iq4_kt | 215 | 7.83 | 62.4% ← bulk at ~4.0 bpw |
| iq3_k | 1 | 0.51 | 4.7% (token embeddings) |
| iq3_kt | 4 | 0.13 | 1.3% (early-layer FFN gates) |
Protected Layers
- Full-attention layers (3, 7, 11, …, 63): Q/K norms → f32, K/V → iq5_k, Q → iq4_kt
- Last layer (blk.63): FFN → iq5_k, V → q6_K, gate → iq4_k
- First layer (blk.0): QKV → iq5_k, gate → iq5_k, SSM out → iq5_k
- SSM small tensors: a/dt/conv1d/norm → f32, alpha/beta → q8_0/q6_K
- Output norm → f32 · Token embeddings → iq3_k · Output projection → iq4_k
Early Layer Squeeze
Layers 0-2 have ffn_gate/ffn_up at iq3_kt (3.1 bpw) — the lowest precision in the model. The first 3 layers process raw embeddings; noise there compounds through all 64 layers. I made this trade-off deliberately to hit the 12.92 GiB target. The PPL results suggest it didn't hurt much, but if you run into quality issues on certain tasks, this is the first thing to hand-tweak.
Perplexity Benchmark
Measured with llama-perplexity (ik_llama.cpp fork, CUDA, RTX 5070 Ti, n_ctx=512, n_batch=512, 580 chunks, ~297K tokens).
| Model | Size | BPW | PPL (n_ctx=512) | Δ |
|---|---|---|---|---|
| Q4_K_M (reference) | 15.40 GiB | 4.919 | 6.8072 ± 0.0437 | — |
| IQ4_XS (this quant) | 12.92 GiB | 4.126 | 6.8265 ± 0.0439 | +0.0193 |
The difference (0.0193) is well within the error margin — the two quants are statistically indistinguishable on this test. Same quality, 16% smaller.
Prompt Processing Speed
| Model | Tokens/s | ms/token |
|---|---|---|
| Q4_K_M (15.40 GiB) | 1,170 | 0.85 |
| IQ4_XS (12.92 GiB) | 1,579 | 0.63 |
The smaller model fits better in VRAM, giving 35% faster prompt processing.
VRAM Usage (RTX 5070 Ti, 16 GB)
| Component | Q4_K_M | IQ4_XS (this) |
|---|---|---|
| Model tensors (GPU) | 13,903 MiB | 12,709 MiB |
| CUDA_Host buffer | 1,867 MiB | 521 MiB |
| KV cache (512 ctx, f16) | 32 MiB | 32 MiB |
| Compute buffers | 505 MiB | 505 MiB |
| Total (reported) | 15,657 MiB | 13,278 MiB |
| Free VRAM remaining | ~640 MiB | ~2,880 MiB |
KV Cache Scaling
Only 16 of 64 layers need KV cache — this model uses hybrid Mamba2+Attention, where only every 4th layer is full attention. The rest are SSM (no KV cache needed). This gives ~75% savings vs a pure attention model.
| Cache type | Bytes/token | 32K context | 50K context |
|---|---|---|---|
| q8_0 | ~34 KB | ~1.0 GiB | ~1.5 GiB |
| q6_H | ~26 KB | ~0.8 GiB | ~1.1 GiB |
| q4_0 | ~18 KB | ~0.5 GiB | ~0.8 GiB |
Practical max context on 16 GB: ~46K (q8_0 KV) before attention workspace overhead starts competing with VRAM.
Custom Importance Matrix
I generated my own imatrix instead of using third-party ones for this one because I wanted a code-focused calibration, matching the coder-orientation of the model.
Calibration Corpus
| Dataset | Source | Prompts | Domain |
|---|---|---|---|
| Code | code_small.parquet |
25K | Code instruction |
| Math | math_medium.parquet |
50K | Math reasoning |
| Tools | tools_medium.parquet |
25K | Function calling |
All 100K prompts were decoded (regex-based, handling unicode escapes and unescaped quotes), concatenated, and randomly shuffled to ensure every 512-chunk window sampled all domains uniformly.
Domain distribution in the imatrix run:
- Code: 16.8% · Math: 8.7% · Tools: 5.1% · Code continuations (from long prompts): 69.5%
The imatrix run took 7h 10min processing 1,638,400 tokens (3,200 × 512 chunks), with a calibration PPL of 5.60 ± 0.015.
Quantization Design
Recipe Generation
The recipe was generated by quant_assign.py from the Thireus GGUF-Tool-Suite, using KLD sensitivity data from the Qwen3.6-27B reference (same qwen35 architecture, identical tensor names and shapes).
Key flags:
--gpu-tensors-max-size 12.92 --tolerance 0.005
--use-greedy-quant-assign --harmonization-technique 3
--with-imatrix --quant-degradation-csv <group0/kld_results.csv>
How the bits are allocated
The greedy quant assigner distributes bits by KLD importance: tensors where quantization introduces more distribution drift get higher precision. The 49 regex rules map this allocation to llama-quantize patterns. Within each assigned type, the imatrix further optimizes quantization to minimize output error for the calibration domain (code + math + tools).
The result is dominated by iq4_kt (215 tensors, 62.4%) — a shape-dependent IQ4 variant with ~4.005 BPW that's slightly more compact than iq4_k but higher quality than iq4_xs. The remaining tensors are distributed across 6 other types based on per-tensor importance.
Quantization Flags
| Flag | Purpose |
|---|---|
--allow-requantize |
Required — source is Q8_0, not BF16 |
--imatrix |
Code-focused imatrix (1.6M tokens, shuffled) |
--ignore-imatrix-rules |
Use KLD-based qtype choices; imatrix still optimizes within-type |
--custom-q |
49 regex→qtype rules (3,243 chars) |
Fallback q8_0 |
Safe fallback for any unmatched tensors |
Avoid --fit — it causes major performance regression (8 vs 28 t/s) on this architecture.
Usage
llama-server
llama-server \
-m Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf \
-c 32768 -ngl 99 -t 12 -ub 512 \
-ctk q8_0 -ctv q8_0 -fa
Files in This Repo
| File | Description |
|---|---|
Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf |
The quantized model (12.92 GiB) |
custom-darwin-28b-coder.imatrix.dat |
Code-focused importance matrix (3,200 chunks) |
quantize/Darwin-28B-Coder-12.92GiB.recipe.txt |
Full quantization recipe (49 mixed-precision rules) |
logs/ppl_darwin-28b-coder-custom-12.92GiB_cfull.log |
Perplexity benchmark log (this quant) |
logs/ppl_reference_q4km_cfull.log |
Reference Q4_K_M perplexity log |
Acknowledgements
Thanks to:
- FINAL-Bench — Original Darwin-28B-Coder model. A fantastic code-specialized 28B model that punches way above its weight class.
- ikawrakow / ik_llama.cpp — Custom quantization scheme and MTP inference engine.
- Thireus / GGUF-Tool-Suite — The
quant_assign.pyrecipe generation pipeline and importance-aware bit allocation. The KLD-guided greedy optimizer is what makes mixed-IQ quantization practical. - eaddario — Parquet calibration datasets for the custom imatrix.
- llama.cpp community — GGUF format, quantization infrastructure, and the broader ecosystem.
See Also
- FINAL-Bench/Darwin-28B-Coder — Original model card with benchmark scores (HumanEval 100%, MBPP 84%, BigCodeBench 72%)
- Downloads last month
- 535
4-bit
Model tree for k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF
Base model
FINAL-Bench/Darwin-28B-Coder
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="k0valik/Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF", filename="Darwin-28B-Coder-IQ4_XS12.9GiB-GGUF.gguf", )