Biogenic commited on
Commit
171ecca
·
verified ·
1 Parent(s): 1e8b207

feat: initial GGUF release (F16 + Q8_0/Q5_K_M/Q4_K_M/Q4_K_S)

Browse files
.gitattributes CHANGED
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ gemma-4-26B-A4B-it-assistant.F16.gguf filter=lfs diff=lfs merge=lfs -text
37
+ gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
38
+ gemma-4-26B-A4B-it-assistant.Q4_K_S.gguf filter=lfs diff=lfs merge=lfs -text
39
+ gemma-4-26B-A4B-it-assistant.Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
40
+ gemma-4-26B-A4B-it-assistant.Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: https://ai.google.dev/gemma/docs/gemma_4_license
4
+ library_name: llama.cpp
5
+ pipeline_tag: text-generation
6
+ base_model: google/gemma-4-26B-A4B-it-assistant
7
+ base_model_relation: quantized
8
+ tags:
9
+ - gguf
10
+ - llama.cpp
11
+ - mtp
12
+ - multi-token-prediction
13
+ - speculative-decoding
14
+ - gemma
15
+ - gemma-4
16
+ - atomic-chat
17
+ - turboquant
18
+ ---
19
+
20
+ # Gemma 4 26B-A4B Assistant — GGUF (Atomic Chat)
21
+
22
+ GGUF builds of [`google/gemma-4-26B-A4B-it-assistant`](https://huggingface.co/google/gemma-4-26B-A4B-it-assistant) — the official Gemma 4
23
+ **Multi-Token Prediction (MTP)** drafter for
24
+ [`google/gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it). Use it as a speculative-decoding
25
+ draft model alongside the matching Gemma 4 target to get a meaningful decoding
26
+ speedup at zero quality loss.
27
+
28
+ Approximate size: **0.4B (assistant) / 27B MoE target**.
29
+
30
+ > [!IMPORTANT]
31
+ > These GGUFs use the custom `gemma4_assistant` architecture and **will not
32
+ > load in stock `llama.cpp`**. They require the
33
+ > [`atomic-llama-cpp-turboquant`](https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant) fork, which adds:
34
+ > - the `gemma4_assistant` MTP drafter arch (incl. the centroid LM head for E2B/E4B),
35
+ > - **TurboQuant** KV-cache quantization (`-ctk turbo3 -ctv turbo3`),
36
+ > - the `--mtp-head` / `--spec-type mtp` runtime flags.
37
+ >
38
+ > Loading these files in upstream `ggml-org/llama.cpp` will fail with an
39
+ > unknown architecture error.
40
+
41
+ ## Files
42
+
43
+ | File | Quant | Size | Notes |
44
+ |---|---|---:|---|
45
+ | `gemma-4-26B-A4B-it-assistant.F16.gguf` | F16 | 815.6 MB | reference (smallest quality loss vs source) |
46
+ | `gemma-4-26B-A4B-it-assistant.Q8_0.gguf` | Q8_0 | 440.4 MB | near-lossless 8-bit |
47
+ | `gemma-4-26B-A4B-it-assistant.Q5_K_M.gguf` | Q5_K_M | 326.4 MB | balanced k-quant |
48
+ | `gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf` | Q4_K_M | 310.4 MB | recommended default for speculative-decoding draft |
49
+ | `gemma-4-26B-A4B-it-assistant.Q4_K_S.gguf` | Q4_K_S | 306.3 MB | smallest k-quant |
50
+
51
+ ## Quick start
52
+
53
+ Build the fork:
54
+
55
+ ```bash
56
+ git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant
57
+ cd atomic-llama-cpp-turboquant
58
+ # Pick one of the platform-specific configurations:
59
+ cmake -B build -DGGML_METAL=ON # Apple Silicon
60
+ # cmake -B build -DGGML_CUDA=ON # NVIDIA
61
+ # cmake -B build # CPU-only
62
+ cmake --build build --target llama-server llama-cli llama-quantize -j
63
+ ```
64
+
65
+ Download the assistant drafter (this repo) and the matching Gemma 4 target:
66
+
67
+ ```bash
68
+ hf download AtomicChat/gemma-4-26B-A4B-it-assistant-GGUF \
69
+ --include "*Q4_K_M.gguf" --local-dir ./models
70
+ # Any GGUF build of the matching target model works; e.g. unsloth's:
71
+ hf download unsloth/gemma-4-26B-A4B-it-GGUF \
72
+ --include "*Q4_K_M*.gguf" --local-dir ./models
73
+ ```
74
+
75
+ Run `llama-server` with MTP speculative decoding + TurboQuant KV cache:
76
+
77
+ ```bash
78
+ ./build/bin/llama-server \
79
+ -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
80
+ --mtp-head ./models/gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf \
81
+ --spec-type mtp \
82
+ --draft-block-size 3 --draft-max 8 --draft-min 0 \
83
+ -ngl 99 -ngld 99 \
84
+ -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \
85
+ -fa on -c 16384 --host 127.0.0.1 --port 8080
86
+ ```
87
+
88
+ A ready-made launcher lives at `scripts/run-gemma4-26b-a4b-mtp-server.sh`
89
+ in the fork (`MTP_PRESET=throughput|lift|balanced|quality`).
90
+
91
+ ## How MTP works here
92
+
93
+ Gemma 4 ships with a small "assistant" head that predicts several future tokens
94
+ from the target model's last hidden state. In `atomic-llama-cpp-turboquant` it
95
+ is loaded as a separate GGUF via `--mtp-head` and drives a custom speculative
96
+ decoder (block_size 3, draft_max 16 for quality preset). The verifier runs the target model in
97
+ parallel, guaranteeing the same output distribution as plain greedy/sampled
98
+ decoding.
99
+
100
+ ## TurboQuant KV cache
101
+
102
+ `turbo3` is the KV-cache quantization scheme used in this fork; it significantly
103
+ reduces KV memory and bandwidth at long contexts with no measurable quality
104
+ regression on Gemma 4. Apply it to both target and drafter via
105
+ `-ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3`.
106
+
107
+ ## License & attribution
108
+
109
+ Released under the **Gemma Terms of Use**.
110
+
111
+ - Original model card: [`google/gemma-4-26B-A4B-it-assistant`](https://huggingface.co/google/gemma-4-26B-A4B-it-assistant)
112
+ - Target model: [`google/gemma-4-26B-A4B-it`](https://huggingface.co/google/gemma-4-26B-A4B-it)
113
+ - License text: <https://ai.google.dev/gemma/docs/gemma_4_license>
114
+
115
+ ## Acknowledgements
116
+
117
+ - [Google DeepMind](https://deepmind.google/models/gemma/) — Gemma 4 family and the MTP drafters.
118
+ - [`ggml-org/llama.cpp`](https://github.com/ggml-org/llama.cpp) — upstream inference engine.
119
+ - TurboQuant primitives — KV-cache quantization scheme integrated in the fork.
120
+
121
+ — [Atomic Chat](https://atomic.chat)
gemma-4-26B-A4B-it-assistant.F16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3dc3f22a85b4e5d13d6382fbf1d965ebf5c521f4fbfb755c17a172798d80159c
3
+ size 855228832
gemma-4-26B-A4B-it-assistant.Q4_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:851300493d74f5648f4475681852e3b238f33e97c4b0187734a37b4195b488f9
3
+ size 325452192
gemma-4-26B-A4B-it-assistant.Q4_K_S.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:06426693a8a4171adb945085b3c77d34c98e157dbf02f39e3c7b18639c8093c1
3
+ size 321126816
gemma-4-26B-A4B-it-assistant.Q5_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7273f2b238ee527d88ff75fb3b65d6c8ad6aa0919c7c18363d2d3d219c084124
3
+ size 342262176
gemma-4-26B-A4B-it-assistant.Q8_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ff5e851eb69aba552efb7cc5da0b37801b42554b403f8584e0b83b92296f69d
3
+ size 461767072