EXPERIMENTAL, AVOID DEPLOYING TO SENSITIVE ENVIRONMENTS
Gemma-4-96E-A4B-Heretic-TQ GGUF
TurboQuant GGUF builds of blascotobasco/Gemma-4-96E-A4B-Heretic for the TurboQuant llama.cpp branch.
These files were refreshed on 2026-04-08 so that the embedded tokenizer.chat_template matches the updated Gemma 4 Interleaved template. The published TQ GGUFs now work in reasoning mode on the linked branch without requiring --chat-template-file.
Files
| File | Size | Notes |
|---|---|---|
Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf |
12589895200 bytes (11.73 GiB) |
TQ3_1S, updated embedded interleaved template |
Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf |
13985279520 bytes (13.02 GiB) |
TQ4_1S, updated embedded interleaved template |
chat_template.jinja |
standalone Gemma 4 Interleaved template matching the embedded GGUF template |
|
requant_recipe_tq3_1s.txt |
tensor override recipe for TQ3_1S |
|
requant_recipe_tq4_1s.txt |
tensor override recipe for TQ4_1S |
Runtime
Use this exact llama.cpp branch:
Related project documentation:
Build explicitly from that branch:
git clone https://github.com/iamwavecut/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -S . -B build
cmake --build build --config Release -j
Stock upstream llama.cpp is not the supported runtime for these TurboQuant GGUFs.
Template and reasoning
- The default embedded template is
Gemma 4 Interleaved. chat_template.jinjain this repo matches the embedded template.- The 2026-04-08 refresh adds a safe fallback system message when
--reasoning onis enabled and no explicit-sysprompt is provided. - Because of that change,
--chat-template-fileis no longer required for normal reasoning usage on the supported branch. - Tensor data was not requantized again for this refresh; the published files were repacked to update GGUF metadata and embedded template content.
Recommended launch commands
TQ4_1S
Reasoning enabled:
/path/to/llama-cli \
-m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
-ngl 99 \
-fa on \
-ctk q8_0 \
-ctv turbo4 \
-c 8192 \
-cnv \
--jinja \
--reasoning on
Reasoning disabled:
/path/to/llama-cli \
-m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
-ngl 99 \
-fa on \
-ctk q8_0 \
-ctv turbo4 \
-c 8192 \
-cnv \
--jinja \
--reasoning off \
--reasoning-format none
TQ3_1S
Reasoning enabled:
/path/to/llama-cli \
-m /path/to/Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf \
-ngl 99 \
-fa on \
-ctk q8_0 \
-ctv turbo3 \
-c 8192 \
-cnv \
--jinja \
--reasoning on
Reasoning disabled:
/path/to/llama-cli \
-m /path/to/Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf \
-ngl 99 \
-fa on \
-ctk q8_0 \
-ctv turbo3 \
-c 8192 \
-cnv \
--jinja \
--reasoning off \
--reasoning-format none
Notes
- If you already downloaded an older copy of these
TQfiles and reasoning mode failed without an external template override, download the refreshed files from this repo. Q8_0is the clean repacked source checkpoint used locally to produce the published TurboQuant requants, but it is not distributed from this repo.SQvariants are intentionally not published here.
Credits
- Base model author:
blascotobasco - TurboQuant runtime / GGUF work based on
llama.cppand the linked TurboQuant branch
- Downloads last month
- 8,403
Hardware compatibility
Log In to add your hardware
4-bit
Model tree for WaveCut/Gemma-4-96E-A4B-Heretic-TQ
Base model
google/gemma-4-26B-A4B-it Finetuned
coder3101/gemma-4-26B-A4B-it-heretic Finetuned
blascotobasco/Gemma-4-96E-A4B-Heretic