EXPERIMENTAL, AVOID DEPLOYING TO SENSITIVE ENVIRONMENTS

Gemma-4-96E-A4B-Heretic-TQ GGUF

TurboQuant GGUF builds of blascotobasco/Gemma-4-96E-A4B-Heretic for the TurboQuant llama.cpp branch.

These files were refreshed on 2026-04-08 so that the embedded tokenizer.chat_template matches the updated Gemma 4 Interleaved template. The published TQ GGUFs now work in reasoning mode on the linked branch without requiring --chat-template-file.

Files

File	Size	Notes
`Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf`	`12589895200` bytes (`11.73 GiB`)	`TQ3_1S`, updated embedded interleaved template
`Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf`	`13985279520` bytes (`13.02 GiB`)	`TQ4_1S`, updated embedded interleaved template
`chat_template.jinja`	standalone `Gemma 4 Interleaved` template matching the embedded GGUF template
`requant_recipe_tq3_1s.txt`	tensor override recipe for `TQ3_1S`
`requant_recipe_tq4_1s.txt`	tensor override recipe for `TQ4_1S`

Runtime

Use this exact llama.cpp branch:

https://github.com/iamwavecut/llama-cpp-turboquant/tree/feature/turboquant-kv-cache

Template and reasoning

The default embedded template is Gemma 4 Interleaved.
chat_template.jinja in this repo matches the embedded template.
The 2026-04-08 refresh adds a safe fallback system message when --reasoning on is enabled and no explicit -sys prompt is provided.
Because of that change, --chat-template-file is no longer required for normal reasoning usage on the supported branch.
Tensor data was not requantized again for this refresh; the published files were repacked to update GGUF metadata and embedded template content.

Recommended launch commands

TQ4_1S

Reasoning enabled:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo4 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning on

Reasoning disabled:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ4_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo4 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning off \
  --reasoning-format none

TQ3_1S

Reasoning enabled:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo3 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning on

Reasoning disabled:

/path/to/llama-cli \
  -m /path/to/Gemma-4-96E-A4B-Heretic-TQ3_1S.gguf \
  -ngl 99 \
  -fa on \
  -ctk q8_0 \
  -ctv turbo3 \
  -c 8192 \
  -cnv \
  --jinja \
  --reasoning off \
  --reasoning-format none

Notes

If you already downloaded an older copy of these TQ files and reasoning mode failed without an external template override, download the refreshed files from this repo.
Q8_0 is the clean repacked source checkpoint used locally to produce the published TurboQuant requants, but it is not distributed from this repo.
SQ variants are intentionally not published here.