MTP 0 accept rate
#4
by AMUN-RA1 - opened
Using --speculative-config.method mtp --speculative-config.num_speculative_tokens 1 to serve, and using
vllm bench serve --model intel/GLM-4.7-Flash-int4-AutoRound --num-prompts 200 --dataset-name random --random-input-len 8192 --random-output-len 1024 --port 8001 --trust-remote-code --served-model-name glm-4.7-flash-int4-autoround
to bench, the mtp acceptence rate is 0:
---------------Speculative Decoding---------------
Acceptance rate (%): 0.01
Acceptance length: 1.00
sorry, we hadn't supported coping mtp and mtp quantizaiton at that time. You could leverage our latest release or manually copy the mpt layer from original model.
sorry, we hadn't supported coping mtp and mtp quantizaiton at that time. You could leverage our latest release or manually copy the mpt layer from original model.
Oh I get it, thansks for your help