Granite 4.0 Hybrid Small โ Dutch Calibrated Q2 GGUF
โ ๏ธ Fluency-first quantization. This model is optimised for Dutch language fluency on consumer hardware. At Q2 bit depth, fluency and register are preserved far better than factual recall, reasoning, or instruction-following precision. Read the scope section before use.
Quantized variants of IBM Granite 4.0 Hybrid Small at Q2 bit depth, using a Dutch importance matrix derived from the Leesplank corpus.
This is the final and largest model in the Dutch Granite quantization series, which spans six models across three architectures, multiple quantization levels, and three bit depths โ from 350M to 32B parameters. Built on an Asus ROG Strix Halo (128 GB unified memory, AMD RDNA4 iGPU, Vulkan/RADV llama.cpp build) specifically to make a 32B Dutch-fluent model available on consumer 16 GB GPU hardware.
Available files
| File | Size | Use |
|---|---|---|
granite-4.0-h-small-q2_K_NL.gguf |
~12 GB | โ Recommended โ imatrix + Unsloth layer map |
granite-4.0-h-small-q2_K_plain.gguf |
~12 GB | ๐ฌ Validation only โ imatrix, no layer map |
Both fit on a 16 GB GPU with headroom for a practical KV cache (~4 GB).
Scope: what this model is for
Use this model for:
- Dutch text generation where fluency, register, and grammatical coherence matter
- Office use cases: summarisation, rewriting, clarification, RAG in Dutch
- Applications where a fluent 32B Dutch model needs to fit on consumer hardware
Do not use this model for:
- Factual question answering
- Multi-step reasoning or logic
- Instruction following that requires precise output formats
- Any task where you need to trust the output content, not just its fluency
For quantization fidelity over raw fluency, use granite-4.0-h-micro-dutch-calibrated-gguf (Q4, 7B) or the dense micro series instead.
Why Q2 at 32B and not Q4 at 7B?
Scale survives aggressive quantization better than quantization degrades fluency.
Fluency โ grammatical coherence, register consistency, morphological correctness, natural Dutch sentence rhythm โ is encoded redundantly across a large model's weight space. Q2 distributes rounding errors across 32 billion parameters. The result is a model where fluency largely survives while precision does not. This is precisely the rationale behind Unsloth's 2-bit and 1-bit series: at sufficient scale, aggressively quantized models remain useful for generation tasks even when they are unreliable for retrieval or reasoning.
The data confirms this holds for Dutch specifically. The 32B model at Q2 outperforms the 7B model at Q4 on every single one of the 70 evaluation clusters.
Perplexity results (Q2, 70 Dutch clusters)
Evaluated on held-out Dutch texts from the Leesplank corpus. FP16 baseline mean PPL: 6.405, median: 6.675.
| Model | Mean PPL | Median PPL | Mean ฮ vs FP16 | Mean % ฮ |
|---|---|---|---|---|
| FP16 baseline | 6.405 | 6.675 | โ | โ |
| Q2_K_NL (this model) | 7.009 | 7.393 | +0.604 | +9.29% |
| Q2_K_plain | 7.159 | 7.560 | +0.754 | +11.62% |
q_nl beats q_plain in 63 out of 70 clusters (90%) โ the highest NL win rate in the entire series. The Unsloth layer map reduces mean PPL degradation by 20% and mean KLD by 16.3% relative to calibration alone.
KL-Divergence results (Q2, 70 Dutch clusters)
KLD measures full output probability distribution fidelity โ more sensitive than perplexity, which only captures the most likely token. These are the first complete KLD measurements for a Q2 model in this series.
| Model | Avg mean KLD | Avg flip rate | Avg p99 KLD |
|---|---|---|---|
| Q2_K_NL (this model) | 0.2564 | 27.1% | 1.483 |
| Q2_K_plain | 0.3062 | 29.7% | 1.674 |
| Micro Q4_K_NL (reference) | 0.0489 | 15.0% | 0.192 |
The small Q2 shows 5.29ร higher mean KLD than the micro Q4 (median 4.89ร). This is the expected and accepted cost of Q2 at this scale โ the model produces more distributional noise per token, while still generating fluent Dutch because the overall output register is preserved at the sentence level.
The flip rate of 27.1% means roughly 1 in 4 token positions has a different top-1 prediction than FP16. For generation tasks this manifests as stylistic variation rather than incoherence. For precision tasks it would be disqualifying.
Comparison to Granite 4.0 Hybrid Micro Q4
| Model | Quant | Params | Mean PPL | Mean % ฮ | Avg KLD mean | Avg flip rate |
|---|---|---|---|---|---|---|
| Granite 4.0 Hybrid Small Q2_K_NL | Q2 | 32B | 7.009 | +9.29% | 0.2564 | 27.1% |
| Granite 4.0 Hybrid Micro Q4_K_NL | Q4 | 7B | 11.037 | +1.27% | 0.0489 | 15.0% |
The small Q2 achieves lower absolute perplexity in 70/70 clusters with a mean advantage of 4.03 PPL points. It pays for this with 5ร higher distributional noise. The micro Q4 is far more faithful to its own FP16 distribution. Both models are correct choices โ for different tasks.
Notable cluster findings
Cluster 012: the masked distribution collapse
Cluster 012 is the most important finding in this dataset. Its FP16 PPL is 2.58 (very predictable Dutch text). The Q2 PPL barely changes: โ0.2%. Perplexity says nothing is wrong.
But KLD tells a completely different story: mean KLD = 1.186, flip rate = 0.500, p99 = 3.733. Half of all token positions change their top prediction. The distribution has collapsed while the top token happens to survive intact.
The micro Q4 on the same cluster shows mean KLD = 0.051 โ normal, clean. This failure is specific to Q2 on high-confidence, structured Dutch text. The model's sharp probability peaks are precisely what Q2 rounds away, producing a flat, distorted distribution even as the argmax survives by chance.
This is a direct, measured demonstration of why perplexity alone is insufficient for evaluating quantization quality. Perplexity is blind to this failure mode entirely.
Cluster 025: where the layer map hurts
Cluster 025 is one of 7 clusters where q_plain outperforms q_nl on KLD (plain: 0.407, NL: 0.545). FP16 PPL is 1.95 โ the most predictable text in the dataset. The Unsloth layer map, which promotes attention and FFN weights to higher precision, appears to push the weight space in the wrong direction for this specific text type. The micro Q4 also shows elevated KLD here (0.104 โ its worst mean cluster). Something about this text type is systematically difficult for the calibration + promotion approach across both models and both bit depths.
The low-PPL cluster problem
Seven clusters have small FP16 PPL below 3.0 (clusters 015, 025, 037, 024, 044, 050, 012). These are texts the 32B model finds extremely predictable โ formulaic, repetitive, or domain-specific Dutch. As a group they show the worst KLD in the dataset. High model confidence at FP16 is a risk factor for Q2 distributional degradation, not a safety factor.
Cross-model KLD analysis: two confirmed failure modes
Combining small Q2 KLD with micro Q4 KLD across all 70 clusters confirms the two-pathology hypothesis established earlier in this series.
Correlation between small Q2 KLD and micro Q4 KLD: r = 0.323 (mean), r = 0.429 (p99). Weak positive โ the same clusters tend to stress both models somewhat, but the relationship is far from deterministic. Flip rates are essentially uncorrelated (r = โ0.102), meaning the two models change different tokens even on the same problematic text.
Type A clusters (micro Q4 distributional tail failure, clusters 017, 020, 025, 036, 050, 058): These showed elevated micro Q4 KLD p99 (>0.35) and were predicted to also stress the small Q2. They do โ all show small Q2 KLD means of 0.17โ0.51, and ratios of 2.6รโ10.2ร over micro Q4. The prediction was confirmed. These are text types where the probability distribution is inherently hard to preserve at any aggressive quantization level.
Type B clusters (small Q2 PPL stress, clean micro Q4 KLD, clusters 000, 008, 028, 039, 043, 055, 069): These showed high small Q2 PPL degradation (+13โ27%) but clean micro Q4 KLD (p99 all below 0.22). As predicted, they show small Q2 KLD ratios of 3.0รโ5.3ร over micro Q4 โ elevated but not catastrophic. The small model's confidence at FP16 makes it more vulnerable at Q2, but the distribution degrades more gracefully than in Type A clusters because the text type itself is not inherently ambiguous.
Usage
# llama.cpp (Vulkan/ROCm โ recommended for AMD GPUs)
llama-cli -m granite-4.0-h-small-q2_K_NL.gguf -ngl 99 \
--temp 0.7 \
-sp "Je bent een behulpzame assistent. Je antwoordt altijd in het Nederlands." \
--chat-template granite -cnv
# Note: add --cache-ram 0 if using llama-server, to avoid a crash
# in the prompt cache serialisation with hybrid SSM architectures
llama-server -m granite-4.0-h-small-q2_K_NL.gguf -ngl 99 --cache-ram 0
| Setting | Value |
|---|---|
| Context length | 128K (as per base model) |
| Chat template | Granite instruct |
| GPU offload | All layers (-ngl 99) |
| Recommended VRAM | 16 GB (model ~12 GB, ~4 GB headroom for KV cache) |
| System prompt | Always use a Dutch system prompt โ model defaults to English otherwise |
The Dutch Granite quantization series โ final overview
This model closes the series. All models evaluated on the same 70-cluster held-out Dutch corpus using the same pipeline.
| Model | Arch | Params | Quant | FP16 PPL | Mean % ฮ | NL win rate | Avg KLD mean |
|---|---|---|---|---|---|---|---|
| granite-4.0-h-350m | Hybrid | 350M | Q5 | 26.58 | +3.8% | 67% | โ |
| granite-4.0-h-1b | Hybrid | 1B | Q4 | 14.45 | +3.3% | 81% | โ |
| granite-4.0-micro (dense) | Dense | 3B | Q4โQ6 | 8.94 | +1.5โ+0.06% | 86% | 0.049 |
| granite-4.0-h-tiny | Hybrid MoE | 7B | Q4 | 9.59 | +1.4% | 59% | โ |
| granite-4.0-h-micro | Hybrid | 7B | Q4 | 10.89 | +1.27% | 77% | 0.049 |
| granite-4.0-h-small | Hybrid | 32B | Q2 | 6.40 | +9.29% | 90% | 0.256 |
Series findings: what was confirmed, what surprised, what remains open
Confirmed
Scale beats quantization fidelity for fluency. The 32B Q2 outperforms the 7B Q4 on every cluster. This was the central hypothesis of the series and it holds cleanly.
The Dutch imatrix helps consistently. Across all six models and all bit depths, the NL imatrix improves on plain quantization in 59โ90% of clusters. The smallest gain (Tiny MoE at 59%) is architectural โ a model large enough that the quantizer preserves Dutch weights regardless. The largest gain (Small Q2 at 90%) is the expected ceiling for a bit-starved model where calibration matters most.
Layer map contribution scales with quantization aggressiveness. At Q5 on the dense micro, the layer map reduces PPL degradation by ~10ร. At Q4 it contributes 14โ23%. At Q2 it contributes 16โ20% on KLD. The pattern is consistent: the lower the bit depth, the more selective precision promotion earns its cost.
Perplexity is blind to distribution collapse. Cluster 012 proves this definitively. PPL change: โ0.2%. KLD: catastrophic. Any evaluation of aggressive quantization that uses only perplexity is missing a significant failure mode.
The two cluster pathology types are real and structurally distinct. Type A (distributional tail failure, correlation with FP16 model confidence) and Type B (PPL stress from sharp-distribution degradation) have different causes, different scales of impact, and different implications for downstream tasks. Their flip rates are uncorrelated across models (r = โ0.102), confirming they represent fundamentally different quantization failure mechanisms.
Surprises
The hybrid micro degrades less than the dense micro at Q4 (+1.27% vs +1.50%), despite SSM layers being theoretically more fragile. The Mamba-2 recurrent layers appear less weight-sensitive at Q4 than expected. Whether this holds at Q2 is untested.
Cluster 012's KLD catastrophe is invisible to PPL across both models and both bit depths. The micro Q4 shows normal KLD on cluster 012 (0.051), which makes it a purely Q2 failure โ but the text type itself is not pathological by any measure visible in the data. The failure mechanism remains unclear.
The layer map occasionally hurts. Seven clusters show q_plain beating q_nl on KLD for the small Q2, and cluster 025 is the clearest case (NL KLD 0.545 vs plain 0.404). The Unsloth layer map was derived from English/code-dominated training signal. On highly structured, formulaic Dutch text (FP16 PPL < 2.0), the promoted weight subspaces may not align with Dutch token distribution โ a form of calibration mismatch at the extremes.
The 32B model responds to instruction in English by default. Despite Dutch calibration and stronger Dutch fluency, the model's instruction-following defaults to English. A Dutch system prompt is required. The imatrix affects weight sensitivity during quantization but does not shift the model's language selection heuristic.
The llama-server prompt cache crashes on hybrid SSM architectures. The Mamba-2 recurrent state cannot be serialised by the standard KV cache save path (--cache-ram 0 is required). This is a llama.cpp bug specific to hybrid architectures and affects any server-based evaluation pipeline.
Open questions
What does Q2 KLD look like on the micro? We have Q4 KLD for the micro and Q2 KLD for the small, but not Q2 KLD for a smaller model. The 5ร KLD ratio between small Q2 and micro Q4 is a combined effect of bit depth and scale โ we cannot separate them without a micro Q2 measurement.
Why is cluster 012 a masked collapse at Q2 but clean at Q4? The text type (FP16 PPL 2.58, highly predictable Dutch) is not unusual. The micro Q4 handles it cleanly. Something about the interaction of Q2 quantization and the 32B model's sharp distribution on this text type produces catastrophic tail degradation while preserving the argmax. Understanding this would require inspecting which specific weight tensors are responsible.
Does a Dutch LoRA on the 32B Q2 close the PPL gap toward the FP16 baseline? The target for a fluency LoRA on office Dutch (summarisation, rewriting, RAG) would be the small FP16 median of 6.67 โ only 0.72 PPL points above the Q2_NL median of 7.39. A Dutch LoRA trained on texts with small FP16 PPL in the 6.0โ8.5 range, using per-row PPL filtering against Levenshtein distance as difficulty proxy, could plausibly close most of this gap. The cluster-level data needed to design that training set is in this dataset.
Is the NL win rate at Q2 (90%) a ceiling effect or a genuine Q2 advantage for language-specific calibration? The imatrix matters more at lower bit depths โ that much is clear. But 90% is higher than any other model in the series. It may reflect that at Q2, without calibration, the quantizer is simply guessing wrong about which weights matter, and any signal โ even imperfect Dutch signal โ is substantially better than none.
Citation / links
- Base model: ibm-granite/granite-4.0-h-small
- Calibration dataset: Leesplank-vloeiend-nl-curriculum-cp2
- Full methodology: granite-4.0-micro-dutch-calibrated-gguf
- Quantization pipeline: dutchdynamicquant (
quant_one_model.py)
- Downloads last month
- 57
We're not able to determine the quantization variants.
Model tree for MichielBuisman/granite-4.0-h-small-dutch-calibrated-gguf
Base model
ibm-granite/granite-4.0-h-small