Instructions to use AtomicChat/gemma-4-E2B-it-assistant-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AtomicChat/gemma-4-E2B-it-assistant-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AtomicChat/gemma-4-E2B-it-assistant-GGUF",
	filename="gemma-4-E2B-it-assistant.F16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use AtomicChat/gemma-4-E2B-it-assistant-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M

Use Docker

docker model run hf.co/AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use AtomicChat/gemma-4-E2B-it-assistant-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AtomicChat/gemma-4-E2B-it-assistant-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AtomicChat/gemma-4-E2B-it-assistant-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M

Ollama
How to use AtomicChat/gemma-4-E2B-it-assistant-GGUF with Ollama:
```
ollama run hf.co/AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M
```

Unsloth Studio new

How to use AtomicChat/gemma-4-E2B-it-assistant-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AtomicChat/gemma-4-E2B-it-assistant-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AtomicChat/gemma-4-E2B-it-assistant-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AtomicChat/gemma-4-E2B-it-assistant-GGUF to start chatting

Docker Model Runner
How to use AtomicChat/gemma-4-E2B-it-assistant-GGUF with Docker Model Runner:
```
docker model run hf.co/AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M
```

Lemonade

How to use AtomicChat/gemma-4-E2B-it-assistant-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull AtomicChat/gemma-4-E2B-it-assistant-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.gemma-4-E2B-it-assistant-GGUF-Q4_K_M

List all available models

lemonade list

Gemma 4 E2B Assistant — GGUF (Atomic Chat)

GGUF builds of google/gemma-4-E2B-it-assistant — the official Gemma 4 Multi-Token Prediction (MTP) drafter for google/gemma-4-E2B-it. Use it as a speculative-decoding draft model alongside the matching Gemma 4 target to get a meaningful decoding speedup at zero quality loss.

Approximate size: 78M (assistant) / 5.1B target.

These GGUFs use the custom gemma4_assistant architecture and will not load in stock llama.cpp. They require the atomic-llama-cpp-turboquant fork, which adds:

the gemma4_assistant MTP drafter arch (incl. the centroid LM head for E2B/E4B),

TurboQuant KV-cache quantization (-ctk turbo3 -ctv turbo3),

the --mtp-head / --spec-type mtp runtime flags.

Loading these files in upstream ggml-org/llama.cpp will fail with an unknown architecture error.

Files

File	Quant	Size	Notes
`gemma-4-E2B-it-assistant.F16.gguf`	F16	164.3 MB	reference (smallest quality loss vs source)
`gemma-4-E2B-it-assistant.Q8_0.gguf`	Q8_0	94.8 MB	near-lossless 8-bit
`gemma-4-E2B-it-assistant.Q5_K_M.gguf`	Q5_K_M	75.7 MB	balanced k-quant
`gemma-4-E2B-it-assistant.Q4_K_M.gguf`	Q4_K_M	74.5 MB	recommended default for speculative-decoding draft
`gemma-4-E2B-it-assistant.Q4_K_S.gguf`	Q4_K_S	74.3 MB	smallest k-quant

For E2B/E4B, the assistant uses an ordered-embedding centroid head (mtp.centroids.weight + mtp.token_ordering.weight) that compresses the LM head over the 262K-vocab into 2048 centroids; this structure is preserved across every quantization level in this repo.

Quick start

Build the fork:

git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant
cd atomic-llama-cpp-turboquant
# Pick one of the platform-specific configurations:
cmake -B build -DGGML_METAL=ON          # Apple Silicon
# cmake -B build -DGGML_CUDA=ON         # NVIDIA
# cmake -B build                        # CPU-only
cmake --build build --target llama-server llama-cli llama-quantize -j

Download the assistant drafter (this repo) and the matching Gemma 4 target:

hf download AtomicChat/gemma-4-E2B-it-assistant-GGUF \
    --include "*Q4_K_M.gguf" --local-dir ./models
# Any GGUF build of the matching target model works; e.g. unsloth's:
hf download unsloth/gemma-4-E2B-it-GGUF \
    --include "*Q4_K_M*.gguf" --local-dir ./models

Run llama-server with MTP speculative decoding + TurboQuant KV cache:

./build/bin/llama-server \
    -m         ./models/gemma-4-E2B-it-Q4_K_M.gguf \
    --mtp-head ./models/gemma-4-E2B-it-assistant.Q4_K_M.gguf \
    --spec-type mtp \
    --draft-block-size 3 --draft-max 8 --draft-min 0 \
    -ngl 99 -ngld 99 \
    -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \
    -fa on -c 16384 --host 127.0.0.1 --port 8080

A ready-made launcher lives at scripts/run-gemma4-e2b-mtp-server.sh in the fork (MTP_PRESET=throughput|lift|balanced|quality).

How MTP works here

Gemma 4 ships with a small "assistant" head that predicts several future tokens from the target model's last hidden state. In atomic-llama-cpp-turboquant it is loaded as a separate GGUF via --mtp-head and drives a custom speculative decoder (block_size 2-3, draft_max 6-8 typical). The verifier runs the target model in parallel, guaranteeing the same output distribution as plain greedy/sampled decoding.

TurboQuant KV cache

turbo3 is the KV-cache quantization scheme used in this fork; it significantly reduces KV memory and bandwidth at long contexts with no measurable quality regression on Gemma 4. Apply it to both target and drafter via -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3.