Do NOT use CUDA 13.2
Hey guys, please do not use CUDA 13.2 to run any quantized model or GGUF. Using CUDA 13.2 can lead to gibberish or otherwise incorrect outputs, and tool calling may break on Gemma 4, GLM-5.1, and all models.
We’ve confirmed this internally, and the issue has also been reported by llama.cpp and 30+ users. This is not an Unsloth GGUF specific issue. See here.
We notified NVIDIA 5–6 days ago, but the issue still does not appear to be fixed. This may explain why some of you have been seeing wildly different results with Gemma 4 or quants in general. It may also explain why some GGUFs seem broken in llama.cpp, leading people to assume it’s a quant/GGUF problem (when it's not), while the same models work fine in Unsloth Studio, Ollama, or LM Studio.
For now, you can:
- use our precompiled
llama.cppbinary, which uses CUDA 13, - use Unsloth Studio, which does not use CUDA 13.2, or
- use any CUDA version lower than 13.2.
Thanks so much and let me know if you have any questions! :)