`.cw` is output at the end of the reasoning content

#16
by owao - opened

Hey guys, using your UD-Q4_K_XL, I noticed sometimes the model ends up its reasoning with ".cw". I can't really tell how it impacts the quality of the answer, but something seems wrong.

For now, I can only provide those 3 examples I found back.

Example 1:
"Let's write the response.cw"

Example 2:
"Okay, let's write the response.cw"

Example 3:
"Okay, go.cw"

What I can say so far is:

  • I couldn't reproduce reliably, I need to learn how to do greedy sampling in llama-cpp and work up from there (would for example removing all the samplers from the default top_k;typ_p;top_p;min_p;temperature apart temperature and setting it to 0 work?)
  • I didn't try to use another quant to see if it's specific to this UD-Q4_K_XL

I can of course provide the prompts/full reasoning content if it can help.

Here are all the relevant parameters (it's just the official general reasoning ones) for which the 3 examples popped up:

      --temp 1
      --top-p 0.95
      --top-k 20
      --min-p 0
      --presence-penalty 1.5
      --repeat-penalty 1
      --flash-attn on
      --backend-sampling

This sometimes (not always) happens with UD_Q8_XL and UD_Q6_XL too (KoboldCpp+SillyTavern). I was wondering what it is, but it does not seem to hurt the output in any way, so I just ignore it.
There are other quirks (probably model not quant related) like sometimes it produces answer after , then produces another and one more answer. But again happens rarely enough. That said, it is very smart with reasoning at least in Q6/Q8 I tried, I am quite impressed.

I just encountered right now. This time it was a Let's go.cw\n.
Here are the 6 last reasoning tokens emission before switching to answer content:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" Let"}}],"created":1772984527,"id":"chatcmpl-BpjCAxyJsZgcRqAUoA9eQuMAV1nKAqv9","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8240-d088d5b74","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"'s"}}],"created":1772984527,"id":"chatcmpl-BpjCAxyJsZgcRqAUoA9eQuMAV1nKAqv9","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8240-d088d5b74","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":" go"}}],"created":1772984527,"id":"chatcmpl-BpjCAxyJsZgcRqAUoA9eQuMAV1nKAqv9","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8240-d088d5b74","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":".c"}}],"created":1772984527,"id":"chatcmpl-BpjCAxyJsZgcRqAUoA9eQuMAV1nKAqv9","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8240-d088d5b74","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"w"}}],"created":1772984527,"id":"chatcmpl-BpjCAxyJsZgcRqAUoA9eQuMAV1nKAqv9","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8240-d088d5b74","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"\n"}}],"created":1772984527,"id":"chatcmpl-BpjCAxyJsZgcRqAUoA9eQuMAV1nKAqv9","model":"Qwen3.5-27B-UD-Q4_K_XL.gguf","system_fingerprint":"b8240-d088d5b74","object":"chat.completion.chunk"}

I am also trying Q4_K_L quant from bartowski and so far that one never outputted cw there. So it may be something specific to unsloth quants.
That ".c" token looks strange to me, I guess if ".c" is chosen instead of simple "." then it needs to continue somehow and produces this.

Interesting. Do you also use a presence penalty or not? I'm suspecting it, and ".c", then "w" being a workaround for it to end up. I don't know how using ".c", then "w" could help it, but I don't see any reason it would use it otherwise. I'll experiment disabling the presence penalty (keeping temp to 1 to eventually avoid loops) to see. I can't download new quants for now, but I also have a bartowski one (IQ4_NL) so I'll see what I come back with. But this issue is kind of painful to replicate!
If it turns you finally observe it with your bartow quant too, would you mind posting it back here @McUH ? I'll do the same if after some testing on my side.

owao changed discussion title from `cw` is output at the end of the reasoning content to `.cw` is output at the end of the reasoning content

Update: Bartoswki IQ4_NL suffers from the same, 2/50 runs. Now I need to try tweaking the sampling params to see what it leads to.
Update 2: Still using Bartoswki IQ4_NL, lowering presence penalty to 0.5 --> again, 2/50

Could be related to sampling. Normally I would not use presence penalty, but here I use it because it was recommended and I think I had worse results without it (not with cw, but with reasoning/output). My samplers are mix of what was recommended and what worked better for me:

Temp: 0.7, TopK: 80, TopP: 0.95, Presence Penalty: 1.5, Smoothing: 0.23, DRY: 0.8/1.75/5/0

If it is 2/50 maybe the Q4_K_L does it too, I did not do enough samples checks to be sure with such low probability of occurrence.

I'm going to try your exact set of params for 50 samples too, I'll update here, thanks!

Wait @McUH what do you call smoothing? Maybe you are using another backend than the regular llama-server one? If yes, which one? I can then look what it corresponds to

I use Sillytavern, I think it is called quadratic smoothing or something. I got .cw in Q4_K_L now too, tried also AesSedai Q5KM quant of 122BA10 and it does it too, so seems general for Qwen 3.5.
My theory is, that ".c" is result of all new models over-fitting for STEM/coding as that token does not make much sense for language but I think it is learned as filename suffix, like part of ".com", ".csv", ".c" itself (C-source) etc. Big presence penalty may penalize single "." and ".c" probably has relatively high chance because of that training for working with flies/file suffixes and once it choses ".c" maybe the file name suffix ".cw" is most common in its training data or something.
Well, it does not really hurt and since it is in thinking block, I usually do not include it in further prompts. But I suppose it is something too look out for if one tries to generate training set with Qwen 3.5 outputs.

I'll try to look if there is an equivalent to smoothing in llama.cpp
Unfortunately I don't think your hypothesis can stand because the model doesn't produce the next token only seeing the previous one, it's building up on the whole context it has.
Maybe you are right and it doesn't hurt, but ah.. We can't really know, that's hard to tell :/

A bit out of topic but I'm wondering if 35B-A3B does this too. Will be quick to test out, I'll do soon. I'll also retry 27B using a 0 presence-penalty to get to the bottom, but I guess we'll observe the exact same, but I'll try and update.

But meanwhile, if anyone passing by has the possibility to look for this behavior in the BF16 unquantized model, that would be helpful! Because that could be revealing a potential llama.cpp issue.

Both the BF16 version and EXL3 quants (any bitrate) do the same thing. I think they just accidentally trained the model to do this.

Ah! Thanks for the feedback mate! Happy to hear it's not on llama-cpp side. Nevertheless that is intriguing! I'm going to ask at the official repo then, I'll link this thread.

owao changed discussion status to closed

A bit out of topic but I'm wondering if 35B-A3B does this too.

Yes, it does.

Oh! I don't know if Qwen team spotted it, but I guess it's valuable information!

It's a quirk of the model...

Sign up or log in to comment