MTP 0 accept rate

by AMUN-RA1 - opened 4 days ago

Using --speculative-config.method mtp --speculative-config.num_speculative_tokens 1 to serve, and using

vllm bench serve   --model intel/GLM-4.7-Flash-int4-AutoRound --num-prompts 200   --dataset-name random   --random-input-len 8192   --random-output-len 1024   --port 8001 --trust-remote-code --served-model-name glm-4.7-flash-int4-autoround

to bench, the mtp acceptence rate is 0:

---------------Speculative Decoding---------------
Acceptance rate (%):                     0.01      
Acceptance length:                       1.00

wenhuach

Intel org 4 days ago

•

edited 4 days ago

sorry, we hadn't supported coping mtp and mtp quantizaiton at that time. You could leverage our latest release or manually copy the mpt layer from original model.

AMUN-RA1

4 days ago

sorry, we hadn't supported coping mtp and mtp quantizaiton at that time. You could leverage our latest release or manually copy the mpt layer from original model.

Oh I get it, thansks for your help

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment