0.8B with mtp is slower than without mtp

#15

by WEISHU - opened 13 days ago

Whether under single-threaded or high-concurrency loads, enabling MTP results in degraded latency.
1. without mtp
(APIServer pid=1422509) INFO 04-09 11:27:10 [loggers.py:259] Engine 000: Avg prompt throughput: 967.2 tokens/s, Avg generation throughput: 342.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 44.5%, MM cache hit rate: 16.7%

with mtp
(APIServer pid=1409788) INFO 04-09 11:17:53 [loggers.py:259] Engine 000: Avg prompt throughput: 1357.9 tokens/s, Avg generation throughput: 128.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 22.3%, MM cache hit rate: 6.5%
(APIServer pid=1409788) INFO 04-09 11:17:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.97, Accepted throughput: 63.00 tokens/s, Drafted throughput: 65.20 tokens/s, Accepted: 630 tokens, Drafted: 652 tokens, Per-position acceptance rate: 0.966, Avg Draft acceptance rate: 96.6%

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment