Quants With MTP

by tnhnyzc - opened 12 days ago

First of all, thanks a lot for the quants!

I've been able to load and run both IQ3_XXS and IQ4_XS on my hardware with actually good amount of ctx window (131k) and actually get decent decode speeds (15-16 tps - not fast by any means but ok) but more importantly, with some -ot tuning and --tensor-split, I'm able to get very decent pp speed (around 500 tps for a prompt size of 16K - opencode system prompt). The performance is much better than I expected for my hardware and the sizes are ideal, the quality also seems good.

I'm not sure if I can get more performance with MTP on my hardware and config I'm running but I wanted to try the https://github.com/stepfun-ai/llama.cpp/tree/step3p5-mtp with MTP but I realized these quants don't have the MTP heads. Would you consider uploading ones with MTP heads as well since the support also might merge into main llama.cpp soon?

AesSedai

Owner 10 days ago

If MTP gets merged into mainline llama.cpp, I'll requant it to support that.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment