Gemma 4 E2B it โ€” Q3_K_M GGUF

3-bit medium quantized GGUF version of google/gemma-4-e2b-it.
Usable quality at very small size โ€” a step up from Q2_K with only 0.2 GB more.

Other quantizations in this series:
Q2_K ยท Q3_K_S ยท Q4_K_S ยท Q4_K_M ยท Q5_K_S ยท Q5_K_M ยท Q6_K ยท Q8


File Info

Property Value
Format GGUF Q3_K_M
File size 2.98 GB
Bits per weight ~3
Size vs F16 2.9ร— smaller

Benchmark Results

Tested across 4 categories (Math, Logic, Code, Science), 3 prompts each.
Greedy decoding, 200 max new tokens. Metrics compare logit distributions vs F16 baseline.

Results by Category

Category Speed (tok/s) SQNR Top-1 Agreement KL Divergence
๐Ÿ”ข Math 27.4 13.7 dB 66.1% 1.4605
๐Ÿง  Logic 27.7 14.1 dB 59.5% 2.0306
๐Ÿ’ป Code 27.3 15.1 dB 65.7% 1.4831
๐Ÿ”ฌ Science 27.3 12.8 dB 61.3% 1.7248
Overall 27.4 13.93 dB 63.2% 1.6747

Quantization Comparison

Model Size Speed (tok/s) vs F16 speed SQNR Top-1 Agree KL Div
F16 (baseline) 8.67 GB 5.7 1.0ร— baseline baseline baseline
Q2_K 2.99 GB 31.6 5.6ร— 5.85 dB 32.0% 4.1149
Q3_K_S 3.11 GB 28.9 5.1ร— 10.12 dB 63.2% 1.2605
Q3_K_M (this) 2.98 GB 27.4 4.8ร— 13.93 dB 63.2% 1.6747
Q4_K_S 3.37 GB 25.0 4.4ร— 19.10 dB 80.9% 0.3456
Q4_K_M 3.43 GB 24.0 4.2ร— 20.33 dB 82.4% 0.3356
Q5_K_S 3.6 GB 21.9 3.9x 23.32 dB 87.7% 0.1547
Q5_K_M 3.63 GB 22.0 3.9ร— 23.25 dB 86.9% 0.1248
Q8 4.97 GB 16.2 2.9ร— 37.11 dB 96.0% 0.0171

Key Findings

  • Quality: 63.2% Top-1 agreement โ€” coherent outputs, but noticeably different from F16 on complex prompts
  • Speed: 27.4 tok/s โ€” 4.8ร— faster than F16
  • Size: 2.98 GB โ€” fits under 4 GB RAM with room to spare
  • vs Q3_K_S: Q3_K_M has higher SQNR (13.93 vs 10.12 dB) but higher KL divergence (1.67 vs 1.26); Q3_K_S is slightly faster. In practice, quality is similar.
  • Best for: Low-RAM devices where Q4 variants don't fit; simple chat and factual Q&A

Usage

# llama.cpp CLI
./llama-cli -m gemma-4-e2b-q3km.gguf -p "Explain the water cycle." -n 200
# llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="gemma-4-e2b-q3km.gguf", n_ctx=2048)
output = llm("Explain the water cycle.", max_tokens=200)
print(output["choices"][0]["text"])

Hardware

Tested on: CPU inference (llama.cpp)
Context: 2048 tokens | Greedy decoding

Downloads last month
673
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support