inference throughput drops by 80% with this template

#28

by froilo - opened 4 days ago

inference throughput drops by 80% with this template

      /llama_mtp/llama.cpp/build/bin/llama-server
      -m /models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_S.gguf
      --host 0.0.0.0
      --port 8080
      -c 32768
      --cache-type-k q8_0
      --cache-type-v q8_0
      --temp 0.7
      --top-p 0.95
      --top-k 20
      --presence-penalty 0.0
      --min-p 0.00
      --spec-type draft-mtp
      --spec-draft-n-max 6
      --spec-draft-p-min 0.75
      --jinja
      --chat-template-file /models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/froggeric__Qwen-Fixed-Chat-Templates.jinja

szwedek

4 days ago

Try these params:
--chat-template-kwargs "{\"preserve_thinking\":true}" --spec-type draft-mtp --spec-draft-n-max 2.

spec-draft-n-max 6 - that's not optimal for this model.

froilo

3 days ago

•

edited 3 days ago

Try these params:
--chat-template-kwargs "{\"preserve_thinking\":true}" --spec-type draft-mtp --spec-draft-n-max 2.

spec-draft-n-max 6 - that's not optimal for this model.

thx it works
it may be even about 8-10% faster than my prior MTP setup without the template (not enough data though)

also havent tested agentic flow yet

szwedek

3 days ago

This might be helpful to understand mtp params.

froilo

3 days ago

thats why ive been using --spec-draft-n-max 6

szwedek

3 days ago

•

edited 3 days ago

Your quant had the best perf between 1-2.

Edit: srry, unsloth suggested that value somewhere else

Interpause

about 11 hours ago

•

edited about 11 hours ago

they changed their suggestion to 2 on that page and that you should benchmark yourself

szwedek

about 11 hours ago

Agree - https://unsloth.ai/docs/models/qwen3.6#mtp-benchmarks

froggeric

Owner about 11 hours ago

•

edited about 11 hours ago

I have done extensive tests with real world scenarios, not artificial speed and perplexity measuring which does not tell the whole story. Usually 3 draft tokens is the sweet spot. For 16bit quants 4 is best. For extreme low quants it gets pulled down the other way, and 2 might be more beneficial.

However, I do not recommend using speculative decoding with MoE of small experts. The benefits are minimal at best, might even be slower, but worse, your preprocessing will be much slower. MTP speculative decoding is meant for dense model, or MoE of medium to large experts; the larger the better.

froilo

about 11 hours ago

they changed their suggestion to 2 on that page and that you should benchmark yourself

who unsloth , llamacpp or froggeric?

i have found out to get best tok rate (and coding results)
with just --spec-type draft-mtp --spec-draft-n-max 2 and no other mtp related params!

szwedek

about 11 hours ago

they changed their suggestion to 2 on that page and that you should benchmark yourself

who unsloth , llamacpp or froggeric?

i have found out to get best tok rate (and coding results)
with just --spec-type draft-mtp --spec-draft-n-max 2 and no other mtp related params!

Unsloth - i shared a link to the benchmark

ManyOtherFunctions

about 9 hours ago

•

edited about 9 hours ago

Hmm so far I found MTP or speculative decoding made qwen significantly dumber than before in agentic AI tasks. I use it for heavy stuff all the time, so... as in a my brain is tired and I dont wanna think right now, do this for me type of deal. Regular Qwen 3.6 27B does OK. Not great, but OK. MTP started hallucinating left and right! The old non MTP Qwen was correctly navigating forum posts with browser MCP tools, MTP qwen on the other hand hallucinated a nonexistent forum thread, and kept going in circles trying to navigate.

MTP ain't it.

(The task I had set out for it was go through this forum thread, 83 pages in total, in chronological order started from first page, and write down all the source code and snippets people posted. as you go through the pages, update the existing notes based on what's newly discovered and what no longer works. Obviously the big cloud models refused to do it, so no way to know how they fare against that, but Qwen3.6 27B with no MTP was chugging along. VERY slow, hitting 262k context in minutes regularly and having to compact regularly. But it was making decent progress at it. Caught some things I missed. But MTP was absolute trash. hallucinations left and right, tool calls interrupted constantly, and forgetting what it was doing. )

froilo

about 5 hours ago

Hmm so far I found MTP or speculative decoding made qwen significantly dumber than before in agentic AI tasks. I use it for heavy stuff all the time, so... as in a my brain is tired and I dont wanna think right now, do
...
MTP ain't it.

i though i just sucked at llama server settings , but ive got similar experience, when i added mtp i saw mangled python syntax or a complete dud of a code... sometimes

Interpause

about 5 hours ago

•

edited about 5 hours ago

Try increasing --spec-draft-p-min, the default is 0.75 which means if the model is 75% happy with the MTP's answer it will accept it. ofc, increase it means there will be less draft acceptance, but they will tend to be more correct.

Also from the other thread, does any one know why its bad for the model to output conversational text before a tool call such that its an instruction in the template itself? Personally, I like it as a summary of what the agent's about to do before it does it versus having to interpret the thinking block.

froilo

about 5 hours ago

Try increasing --spec-draft-p-min, the default is 0.75 which means if the model is 75% happy with the MTP's answer it will accept it. ofc, increase it means there will be less draft acceptance, but they will tend to be more correct.

will try to play with it

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment