Gemma 4 E2B it — Q5_K_M GGUF

5-bit medium quantized GGUF version of google/gemma-4-e2b-it.
High-fidelity quantization — outputs are very close to F16 with much lower KL divergence than Q4.

Other quantizations in this series:
Q2_K · Q3_K_S · Q3_K_M · Q4_K_S · Q4_K_M · Q5_K_S · Q6_K · Q8

File Info

Property	Value
Format	GGUF Q5_K_M
File size	3.63 GB
Bits per weight	~5
Size vs F16	2.6× smaller

Benchmark Results

Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding, 200 max new tokens. Metrics compare logit distributions vs F16 baseline.

Results by Category

Category	Speed (tok/s)	SQNR	Top-1 Agreement	KL Divergence
🔢 Math	21.9	23.3 dB	88.7%	0.0880
🧠 Logic	21.8	22.7 dB	86.6%	0.1607
💻 Code	22.0	24.6 dB	88.2%	0.0869
🔬 Science	22.2	22.4 dB	84.3%	0.1638
Overall	22.0	23.25 dB	86.9%	0.1248

Quantization Comparison

Model	Size	Speed (tok/s)	vs F16 speed	SQNR	Top-1 Agree	KL Div
F16 (baseline)	8.67 GB	5.7	1.0×	baseline	baseline	baseline
Q3_K_M	2.98 GB	27.4	4.8×	13.93 dB	63.2%	1.6747
Q4_K_M	3.19 GB	24.0	4.2×	20.33 dB	82.4%	0.3356
Q5_K_S	3.35 GB	21.9	3.9×	23.32 dB	87.7%	0.1547
Q5_K_M (this)	3.63 GB	22.0	3.9×	23.25 dB	86.9%	0.1248
Q6_K	3.58 GB	19.9	3.5×	28.72 dB	94.1%	0.0743
Q8	4.63 GB	16.2	2.9×	37.11 dB	96.0%	0.0171

Key Findings

Quality: KL divergence drops dramatically from Q4_K_M (0.34 → 0.12) — the probability distributions are much closer to F16
Speed: 22.0 tok/s — 3.9× faster than F16
Size: 3.38 GB — only 190 MB more than Q4_K_M
vs Q5_K_S: Essentially identical speed; Q5_K_M has lower KL (0.12 vs 0.15) while Q5_K_S has marginally better Top-1 (87.7 vs 86.9%). Both are excellent.
Best for: Tasks where output fidelity matters more than raw speed — complex reasoning, multi-step math, detailed code generation

Usage

# llama.cpp CLI
./llama-cli -m gemma-4-e2b-q5km.gguf -p "Explain how a transformer neural network works." -n 200

# llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="gemma-4-e2b-q5km.gguf", n_ctx=2048)
output = llm("Explain how a transformer neural network works.", max_tokens=200)
print(output["choices"][0]["text"])

Hardware

Tested on: CPU inference (llama.cpp)
Context: 2048 tokens | Greedy decoding

Downloads last month: 796

GGUF

Model size

5B params

Architecture

gemma4

Hardware compatibility

5-bit