EXPERIMENTAL, AVOID DEPLOYING TO SENSITIVE ENVIRONMENTS


Gemma-4-96E-A4B-Heretic-TQ GGUF

TurboQuant GGUF builds of blascotobasco/Gemma-4-96E-A4B-Heretic for the TurboQuant llama.cpp branch.

These files were refreshed on 2026-04-08 so that the embedded tokenizer.chat_template matches the updated Gemma 4 Interleaved template. The published TQ GGUFs now work in reasoning mode on the linked branch without requiring --chat-template-file.

Files

File Size Notes
Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf 12589895200 bytes (11.73 GiB) TQ3_1S, updated embedded interleaved template
Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf 13985279520 bytes (13.02 GiB) TQ4_1S, updated embedded interleaved template
chat_template.jinja standalone Gemma 4 Interleaved template matching the embedded GGUF template
requant_recipe_tq3_1s.txt tensor override recipe for TQ3_1S
requant_recipe_tq4_1s.txt tensor override recipe for TQ4_1S

Runtime

Use this exact llama.cpp branch:

Related project documentation:

Build explicitly from that branch:

git clone https://github.com/iamwavecut/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
cmake -S . -B build
cmake --build build --config Release -j

Stock upstream llama.cpp is not the supported runtime for these TurboQuant GGUFs.

Template and reasoning

  • The default embedded template is Gemma 4 Interleaved.
  • chat_template.jinja in this repo matches the embedded template.
  • The 2026-04-08 refresh adds a safe fallback system message when --reasoning on is enabled and no explicit -sys prompt is provided.
  • Because of that change, --chat-template-file is no longer required for normal reasoning usage on the supported branch.
  • Tensor data was not requantized again for this refresh; the published files were repacked to update GGUF metadata and embedded template content.

Recommended launch commands

TQ4_1S

Reasoning enabled:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo4 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning on

Reasoning disabled:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo4 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning off \
  --reasoning-format none

TQ3_1S

Reasoning enabled:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo3 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning on

Reasoning disabled:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo3 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning off \
  --reasoning-format none

Notes

  • If you already downloaded an older copy of these TQ files and reasoning mode failed without an external template override, download the refreshed files from this repo.
  • Q8_0 is the clean repacked source checkpoint used locally to produce the published TurboQuant requants, but it is not distributed from this repo.
  • SQ variants are intentionally not published here.

Credits

  • Base model author: blascotobasco
  • TurboQuant runtime / GGUF work based on llama.cpp and the linked TurboQuant branch
Downloads last month
8,403
GGUF
Model size
20B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WaveCut/Gemma-4-96E-A4B-Heretic-TQ