LLM performance is unstable

#6
by weisunding - opened

The gemma4 claude opus distilled is unstable when coding, sometimes it dose not listen to the prompt, and sometimes it repeatedly output content, sometimes it output duplicated small words/tokens infinitely. I am using the https://github.com/TheTom/llama-cpp-turboquant

What settings are you using? And is it only while coding?

By the way v2 of this model is out with a fixed GGUF

I'm using the running commands, the chat template (tool calling) still has some issues too, but that's the llama-cpp issue.
looking foward to your v2.

#!/bin/bash
F=/home/swei/.cache/huggingface/hub/models--TeichAI--gemma-4-31B-it-Claude-Opus-Distill-GGUF/snapshots/269191f9d73dead54ba941524d121e83fa61aa4c/gemma-4-31B-it-Claude-Opus-Distill.q8_0.gguf
~/llama-cpp-turboquant/build/bin/llama-server -m $F \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --host 0.0.0.0 \
    --port 8000 \
    --jinja \
    --chat-template-file chat_template.jinja

It’s already out TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2-GGUF

Aha, it's so glad to have live chat with you, I'll try it out now, thanks for your contribution.
BTW, I tested with the qwen3.5-27B claude distilled chat templates which works better for tool calling, less issues!

Here:
https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled/tree/main

TeichAI org

Personally I like our 27B distill for complex agentic tasks, but thanks for the heads up. It’s possible I made an error in tuning or in my chat template adjustments recently. I will double check later

Sign up or log in to comment