Model outputting <|im_end|> tokens
I ran this model through my test that replicates heretic's eval by checking the first 100 tokens of each response against his refusal_markers list. I noticed there are a lot of <|im_end|> tokens in the model responses. For sanity check, I then against mradermacher/Llama-3.3-8B-Instruct-heretic-GGUF from aeon37 and the output from that looks normal. Any idea what could be causing this? Using the latest llama.cpp version: 7641 (da9b8d330). Logs:
mradermacher/Llama-3.3-8B-Opus-Z8-Heretic-GGUF - https://gist.github.com/kth8/c9cdc2bdca8d007b3e77140c8f9f17cb
mradermacher/Llama-3.3-8B-Instruct-heretic-GGUF - https://gist.github.com/kth8/36d891cc832f1cc6eb79cca144d2315d
Hi,
It looks like the issue is a chat template mismatch carried over from the original model, Daemontatox/Llama-Opus-Z8, where it's been trained to end assistant turns with the <|im_end|> token. That model uses a ChatML format.
If llama.cpp is run with the default Llama-3 template, it doesn’t treat <|im_end|> as a stop token, so the model emits it literally and it shows up in your first 100 tokens. The GGUF model you sanity checked with (quantized from aeon37/Llama-3.3-8B-Instruct-heretic) looks normal because it uses native Llama-3 formatting <|eot_id|>, which llama.cpp handles by default.
Try adding the following to your llama.cpp prompt:
--chat-template chatml -r "<|im_end|>"
Alright, adding your suggested arguments worked. Output now properly stopped. It also didn't refuse any prompts, #30 is always a false positive.
https://gist.github.com/kth8/f766fa14e93a2b5d8e6275cb2f8853e5