LLM performance is unstable

by weisunding - opened 11 days ago

The gemma4 claude opus distilled is unstable when coding, sometimes it dose not listen to the prompt, and sometimes it repeatedly output content, sometimes it output duplicated small words/tokens infinitely. I am using the https://github.com/TheTom/llama-cpp-turboquant

Bob-the-Koala

11 days ago

What settings are you using? And is it only while coding?

Bob-the-Koala

11 days ago

By the way v2 of this model is out with a fixed GGUF

weisunding

11 days ago

I'm using the running commands, the chat template (tool calling) still has some issues too, but that's the llama-cpp issue.
looking foward to your v2.

#!/bin/bash
F=/home/swei/.cache/huggingface/hub/models--TeichAI--gemma-4-31B-it-Claude-Opus-Distill-GGUF/snapshots/269191f9d73dead54ba941524d121e83fa61aa4c/gemma-4-31B-it-Claude-Opus-Distill.q8_0.gguf
~/llama-cpp-turboquant/build/bin/llama-server -m $F \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --host 0.0.0.0 \
    --port 8000 \
    --jinja \
    --chat-template-file chat_template.jinja

Bob-the-Koala

11 days ago

It’s already out TeichAI/gemma-4-31B-it-Claude-Opus-Distill-v2-GGUF

weisunding

11 days ago

Aha, it's so glad to have live chat with you, I'll try it out now, thanks for your contribution.
BTW, I tested with the qwen3.5-27B claude distilled chat templates which works better for tool calling, less issues!

Here:
https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled/tree/main

armand0e

TeichAI org 10 days ago

Personally I like our 27B distill for complex agentic tasks, but thanks for the heads up. It’s possible I made an error in tuning or in my chat template adjustments recently. I will double check later

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment