This is a requant of Bartowski google_gemma-4-31B-it-Q8_0.gguf. I used the imatrix from him and requantized it with IQ4_NL and TQ4_1S. I was searching for something like this, but did not find anything. So I made it myself. In my testings it performed quite okay, so I thought I share it with others. This will not work on usual llama.cpp.
You will need a build of this fork on the feature/turboquant-kv-cache branch: https://github.com/TheTom/llama-cpp-turboquant
I hope you like it. It is for testing purposes. Don't use it in production. I run it like this:
./llama-server \
--flash-attn on
--cache-type-k q8_0
--cache-type-v turbo3
-m ~/google_gemma-4-31B-it-IQ4_NL-TQ4_1S.gguf
The metrics seem to be wrong. It is 17G on my disk.
- Downloads last month
- 1,780
4-bit
Model tree for RudiTheRude/google_gemma-4-31B-it-IQ4_NL-TQ4_1S.gguf
Base model
google/gemma-4-31B-it