This is a requant of Bartowski google_gemma-4-31B-it-Q8_0.gguf. I used the imatrix from him and requantized it with IQ4_NL and TQ4_1S. I was searching for something like this, but did not find anything. So I made it myself. In my testings it performed quite okay, so I thought I share it with others. This will not work on usual llama.cpp.

You will need a build of this fork on the feature/turboquant-kv-cache branch: https://github.com/TheTom/llama-cpp-turboquant

I hope you like it. It is for testing purposes. Don't use it in production. I run it like this:

./llama-server \
--flash-attn on
--cache-type-k q8_0
--cache-type-v turbo3
-m ~/google_gemma-4-31B-it-IQ4_NL-TQ4_1S.gguf

The metrics seem to be wrong. It is 17G on my disk.

Downloads last month: 1,780

GGUF

Model size

31B params

Architecture

gemma4

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RudiTheRude/google_gemma-4-31B-it-IQ4_NL-TQ4_1S.gguf

Base model

google/gemma-4-31B-it

Quantized

(107)

this model