LLM performance is unstable
The gemma4 claude opus distilled is unstable when coding, sometimes it dose not listen to the prompt, and sometimes it repeatedly output content, sometimes it output duplicated small words/tokens infinitely. I am using the https://github.com/TheTom/llama-cpp-turboquant
What settings are you using? And is it only while coding?
By the way v2 of this model is out with a fixed GGUF
I'm using the running commands, the chat template (tool calling) still has some issues too, but that's the llama-cpp issue.
looking foward to your v2.
#!/bin/bash
F=/home/swei/.cache/huggingface/hub/models--TeichAI--gemma-4-31B-it-Claude-Opus-Distill-GGUF/snapshots/269191f9d73dead54ba941524d121e83fa61aa4c/gemma-4-31B-it-Claude-Opus-Distill.q8_0.gguf
~/llama-cpp-turboquant/build/bin/llama-server -m $F \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--host 0.0.0.0 \
--port 8000 \
--jinja \
--chat-template-file chat_template.jinja
It’s already out TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2-GGUF
Aha, it's so glad to have live chat with you, I'll try it out now, thanks for your contribution.
BTW, I tested with the qwen3.5-27B claude distilled chat templates which works better for tool calling, less issues!
Here:
https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled/tree/main
Personally I like our 27B distill for complex agentic tasks, but thanks for the heads up. It’s possible I made an error in tuning or in my chat template adjustments recently. I will double check later