This is a requant of Bartowski google_gemma-4-31B-it-Q8_0.gguf. I used the imatrix from him and requantized it with IQ4_NL and TQ4_1S. I was searching for something like this, but did not find anything. So I made it myself. In my testings it performed quite okay, so I thought I share it with others. This will not work on usual llama.cpp.

You will need a build of this fork on the feature/turboquant-kv-cache branch: https://github.com/TheTom/llama-cpp-turboquant

I hope you like it. It is for testing purposes. Don't use it in production. I run it like this:

./llama-server \
--flash-attn on
--cache-type-k q8_0
--cache-type-v turbo3
-m ~/google_gemma-4-31B-it-IQ4_NL-TQ4_1S.gguf

The metrics seem to be wrong. It is 17G on my disk.

Downloads last month
1,780
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RudiTheRude/google_gemma-4-31B-it-IQ4_NL-TQ4_1S.gguf

Quantized
(107)
this model