Gemma 4 E2B it — Q5_K_S GGUF

5-bit small quantized GGUF version of google/gemma-4-e2b-it.
High-fidelity quantization — slightly smaller than Q5_K_M with nearly identical quality.

Other quantizations in this series:
Q2_K · Q3_K_S · Q3_K_M · Q4_K_S · Q4_K_M · Q5_K_M · Q6_K · Q8

File Info

Property	Value
Format	GGUF Q5_K_S
File size	3.6 GB
Bits per weight	~5
Size vs F16	2.6× smaller

Benchmark Results

Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding, 200 max new tokens. Metrics compare logit distributions vs F16 baseline.

Results by Category

Category	Speed (tok/s)	SQNR	Top-1 Agreement	KL Divergence
🔢 Math	22.3	23.0 dB	88.2%	0.1086
🧠 Logic	21.9	22.6 dB	86.0%	0.2110
💻 Code	22.4	24.9 dB	92.3%	0.0912
🔬 Science	21.2	22.9 dB	84.2%	0.2079
Overall	21.9	23.32 dB	87.7%	0.1547

Quantization Comparison

Model	Size	Speed (tok/s)	vs F16 speed	SQNR	Top-1 Agree	KL Div
F16 (baseline)	8.67 GB	5.7	1.0×	baseline	baseline	baseline
Q3_K_M	2.98 GB	27.4	4.8×	13.93 dB	63.2%	1.6747
Q4_K_M	3.19 GB	24.0	4.2×	20.33 dB	82.4%	0.3356
Q5_K_S (this)	3.6 GB	21.9	3.9×	23.32 dB	87.7%	0.1547
Q5_K_M	3.38 GB	22.0	3.9×	23.25 dB	86.9%	0.1248
Q6_K	3.58 GB	19.9	3.5×	28.72 dB	94.1%	0.0743
Q8	4.63 GB	16.2	2.9×	37.11 dB	96.0%	0.0171

Key Findings

Quality: 87.7% Top-1 agreement — slightly better than Q5_K_M on this metric; exceptional code quality (92.3% on code tasks)
Speed: 21.9 tok/s — identical to Q5_K_M in practice
Size: 3.35 GB — 30 MB smaller than Q5_K_M (negligible difference)
vs Q5_K_M: Q5_K_S has marginally higher Top-1 but higher KL divergence; pick either one — the practical difference is minimal
Best for: Same as Q5_K_M; particularly strong on code generation tasks

Usage

# llama.cpp CLI
./llama-cli -m gemma-4-e2b-q5ks.gguf -p "Explain how a transformer neural network works." -n 200

# llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="gemma-4-e2b-q5ks.gguf", n_ctx=2048)
output = llm("Explain how a transformer neural network works.", max_tokens=200)
print(output["choices"][0]["text"])

Hardware

Tested on: CPU inference (llama.cpp)
Context: 2048 tokens | Greedy decoding

Downloads last month: 603

GGUF

Model size

5B params

Architecture

gemma4

Hardware compatibility

5-bit