Qwen3.6 seem to fake record too often

#18
by anitman - opened

The model's hallucination rate is extremely high.

The Test Case: Hermes Agent Logic
I conducted a comparison using a Q8_0 quantization for Qwen and a UD_Q4_K_XL for MiniMax. The task was simple:

Setup: In a Hermes Agent session, the model is instructed to read a directory based on a JSON record.

Action: Depending on its "mood," it must select and send a specific GIF from that directory according to the JSON record.

The Results
MiniMax-M2.7 (MoE): Despite being an MoE model with only about 10B active parameters and running on lower quantization (Q4), it never failed this task.

Qwen 3.6 27B: Even at the highest precision (Q8), it failed half of the time. It consistently "hallucinated" a JSON record that didn't exist and then attempted to send a non-existent GIF from that imaginary file, resulting in backend errors in the Hermes Agent.

Same instructions in memory record, very different result.
Really annoying.

I seems that the model was never trained under specific consititutions as guideline even though anthropic provided the constitutional ai trainning mechanism.

anitman changed discussion title from Qwen3.6 seem to to Qwen3.6 seem to fake record too often

Try an officially vetted model from alibaba. The quants need to be made correctly or it could interfere with the new architecture in these models. If you don’t have the hardware to support that, you can try openrouter

what ablout fp16 version?

Try an officially vetted model from alibaba. The quants need to be made correctly or it could interfere with the new architecture in these models. If you don’t have the hardware to support that, you can try openrouter

It's not the case actually, I've tried official implementation via vllm in wsl, unsloth version quants from Q8_0 to UD_Q4_XL. Tool calling infinite loops and fake tool calling happend all the times during agentic use. None of these things happened in minimax m2.7. It fails to create a simple cronjob in hermes agent of checking ticket availability every 20 minutes, It showed that it tended to add schedule to the exec but failing all the time. This is a simple task TBA, I got this problem instantly solved by switching to minimax.

what ablout fp16 version?

It's not related to quantization I'm afraid, you can't explain why a Q8_0 quant have far more tool calling errors than a Q4_K_M quant, it doesnt make any sense.

Sign up or log in to comment