How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AtomicChat/gemma-4-31B-it-assistant-GGUF:
# Run inference directly in the terminal:
llama-cli -hf AtomicChat/gemma-4-31B-it-assistant-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AtomicChat/gemma-4-31B-it-assistant-GGUF:
# Run inference directly in the terminal:
llama-cli -hf AtomicChat/gemma-4-31B-it-assistant-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AtomicChat/gemma-4-31B-it-assistant-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf AtomicChat/gemma-4-31B-it-assistant-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AtomicChat/gemma-4-31B-it-assistant-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AtomicChat/gemma-4-31B-it-assistant-GGUF:
Use Docker
docker model run hf.co/AtomicChat/gemma-4-31B-it-assistant-GGUF:
Quick Links

Gemma 4 31B Assistant โ€” GGUF (Atomic Chat)

GGUF builds of google/gemma-4-31B-it-assistant โ€” the official Gemma 4 Multi-Token Prediction (MTP) drafter for google/gemma-4-31B-it. Use it as a speculative-decoding draft model alongside the matching Gemma 4 target to get a meaningful decoding speedup at zero quality loss.

Approximate size: 0.5B (assistant) / 33B target.

These GGUFs use the custom gemma4_assistant architecture and will not load in stock llama.cpp. They require the atomic-llama-cpp-turboquant fork, which adds:

  • the gemma4_assistant MTP drafter arch (incl. the centroid LM head for E2B/E4B),
  • TurboQuant KV-cache quantization (-ctk turbo3 -ctv turbo3),
  • the --mtp-head / --spec-type mtp runtime flags.

Loading these files in upstream ggml-org/llama.cpp will fail with an unknown architecture error.

Files

File Quant Size Notes
gemma-4-31B-it-assistant.F16.gguf F16 910.6 MB reference (smallest quality loss vs source)
gemma-4-31B-it-assistant.Q8_0.gguf Q8_0 490.8 MB near-lossless 8-bit
gemma-4-31B-it-assistant.Q5_K_M.gguf Q5_K_M 359.1 MB balanced k-quant
gemma-4-31B-it-assistant.Q4_K_M.gguf Q4_K_M 337.1 MB recommended default for speculative-decoding draft
gemma-4-31B-it-assistant.Q4_K_S.gguf Q4_K_S 333.0 MB smallest k-quant

Quick start

Build the fork:

git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant
cd atomic-llama-cpp-turboquant
# Pick one of the platform-specific configurations:
cmake -B build -DGGML_METAL=ON          # Apple Silicon
# cmake -B build -DGGML_CUDA=ON         # NVIDIA
# cmake -B build                        # CPU-only
cmake --build build --target llama-server llama-cli llama-quantize -j

Download the assistant drafter (this repo) and the matching Gemma 4 target:

hf download AtomicChat/gemma-4-31B-it-assistant-GGUF \
    --include "*Q4_K_M.gguf" --local-dir ./models
# Any GGUF build of the matching target model works; e.g. unsloth's:
hf download unsloth/gemma-4-31B-it-GGUF \
    --include "*Q4_K_M*.gguf" --local-dir ./models

Run llama-server with MTP speculative decoding + TurboQuant KV cache:

./build/bin/llama-server \
    -m         ./models/gemma-4-31B-it-Q4_K_M.gguf \
    --mtp-head ./models/gemma-4-31B-it-assistant.Q4_K_M.gguf \
    --spec-type mtp \
    --draft-block-size 3 --draft-max 8 --draft-min 0 \
    -ngl 99 -ngld 99 \
    -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \
    -fa on -c 16384 --host 127.0.0.1 --port 8080

A ready-made launcher lives at scripts/run-gemma4-31b-mtp-server.sh in the fork (MTP_PRESET=throughput|lift|balanced|quality).

How MTP works here

Gemma 4 ships with a small "assistant" head that predicts several future tokens from the target model's last hidden state. In atomic-llama-cpp-turboquant it is loaded as a separate GGUF via --mtp-head and drives a custom speculative decoder (block_size 3, draft_max 16 for quality preset). The verifier runs the target model in parallel, guaranteeing the same output distribution as plain greedy/sampled decoding.

TurboQuant KV cache

turbo3 is the KV-cache quantization scheme used in this fork; it significantly reduces KV memory and bandwidth at long contexts with no measurable quality regression on Gemma 4. Apply it to both target and drafter via -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3.

License & attribution

Released under the Gemma Terms of Use.

Acknowledgements

  • Google DeepMind โ€” Gemma 4 family and the MTP drafters.
  • ggml-org/llama.cpp โ€” upstream inference engine.
  • TurboQuant primitives โ€” KV-cache quantization scheme integrated in the fork.

โ€” Atomic Chat

Downloads last month
-
GGUF
Model size
0.5B params
Architecture
gemma4_assistant
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AtomicChat/gemma-4-31B-it-assistant-GGUF

Quantized
(2)
this model

Collection including AtomicChat/gemma-4-31B-it-assistant-GGUF