Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:# Run inference directly in the terminal:
llama-cli -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:# Run inference directly in the terminal:
./llama-cli -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:# Run inference directly in the terminal:
./build/bin/llama-cli -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Use Docker
docker model run hf.co/AtomicChat/gemma-4-E2B-it-assistant-GGUF:Gemma 4 E2B Assistant โ GGUF (Atomic Chat)
GGUF builds of google/gemma-4-E2B-it-assistant โ the official Gemma 4
Multi-Token Prediction (MTP) drafter for
google/gemma-4-E2B-it. Use it as a speculative-decoding
draft model alongside the matching Gemma 4 target to get a meaningful decoding
speedup at zero quality loss.
Approximate size: 78M (assistant) / 5.1B target.
These GGUFs use the custom
gemma4_assistantarchitecture and will not load in stockllama.cpp. They require theatomic-llama-cpp-turboquantfork, which adds:
- the
gemma4_assistantMTP drafter arch (incl. the centroid LM head for E2B/E4B),- TurboQuant KV-cache quantization (
-ctk turbo3 -ctv turbo3),- the
--mtp-head/--spec-type mtpruntime flags.Loading these files in upstream
ggml-org/llama.cppwill fail with an unknown architecture error.
Files
| File | Quant | Size | Notes |
|---|---|---|---|
gemma-4-E2B-it-assistant.F16.gguf |
F16 | 164.3 MB | reference (smallest quality loss vs source) |
gemma-4-E2B-it-assistant.Q8_0.gguf |
Q8_0 | 94.8 MB | near-lossless 8-bit |
gemma-4-E2B-it-assistant.Q5_K_M.gguf |
Q5_K_M | 75.7 MB | balanced k-quant |
gemma-4-E2B-it-assistant.Q4_K_M.gguf |
Q4_K_M | 74.5 MB | recommended default for speculative-decoding draft |
gemma-4-E2B-it-assistant.Q4_K_S.gguf |
Q4_K_S | 74.3 MB | smallest k-quant |
For E2B/E4B, the assistant uses an ordered-embedding centroid head (mtp.centroids.weight + mtp.token_ordering.weight) that compresses the LM head over the 262K-vocab into 2048 centroids; this structure is preserved across every quantization level in this repo.
Quick start
Build the fork:
git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant
cd atomic-llama-cpp-turboquant
# Pick one of the platform-specific configurations:
cmake -B build -DGGML_METAL=ON # Apple Silicon
# cmake -B build -DGGML_CUDA=ON # NVIDIA
# cmake -B build # CPU-only
cmake --build build --target llama-server llama-cli llama-quantize -j
Download the assistant drafter (this repo) and the matching Gemma 4 target:
hf download AtomicChat/gemma-4-E2B-it-assistant-GGUF \
--include "*Q4_K_M.gguf" --local-dir ./models
# Any GGUF build of the matching target model works; e.g. unsloth's:
hf download unsloth/gemma-4-E2B-it-GGUF \
--include "*Q4_K_M*.gguf" --local-dir ./models
Run llama-server with MTP speculative decoding + TurboQuant KV cache:
./build/bin/llama-server \
-m ./models/gemma-4-E2B-it-Q4_K_M.gguf \
--mtp-head ./models/gemma-4-E2B-it-assistant.Q4_K_M.gguf \
--spec-type mtp \
--draft-block-size 3 --draft-max 8 --draft-min 0 \
-ngl 99 -ngld 99 \
-ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \
-fa on -c 16384 --host 127.0.0.1 --port 8080
A ready-made launcher lives at scripts/run-gemma4-e2b-mtp-server.sh
in the fork (MTP_PRESET=throughput|lift|balanced|quality).
How MTP works here
Gemma 4 ships with a small "assistant" head that predicts several future tokens
from the target model's last hidden state. In atomic-llama-cpp-turboquant it
is loaded as a separate GGUF via --mtp-head and drives a custom speculative
decoder (block_size 2-3, draft_max 6-8 typical). The verifier runs the target model in
parallel, guaranteeing the same output distribution as plain greedy/sampled
decoding.
TurboQuant KV cache
turbo3 is the KV-cache quantization scheme used in this fork; it significantly
reduces KV memory and bandwidth at long contexts with no measurable quality
regression on Gemma 4. Apply it to both target and drafter via
-ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3.
License & attribution
Released under the Gemma Terms of Use.
- Original model card:
google/gemma-4-E2B-it-assistant - Target model:
google/gemma-4-E2B-it - License text: https://ai.google.dev/gemma/docs/gemma_4_license
Acknowledgements
- Google DeepMind โ Gemma 4 family and the MTP drafters.
ggml-org/llama.cppโ upstream inference engine.- TurboQuant primitives โ KV-cache quantization scheme integrated in the fork.
โ Atomic Chat
- Downloads last month
- -
4-bit
5-bit
8-bit
16-bit
Model tree for AtomicChat/gemma-4-E2B-it-assistant-GGUF
Base model
google/gemma-4-E2B-it-assistant
Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:# Run inference directly in the terminal: llama-cli -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF: