Quants With MTP
First of all, thanks a lot for the quants!
I've been able to load and run both IQ3_XXS and IQ4_XS on my hardware with actually good amount of ctx window (131k) and actually get decent decode speeds (15-16 tps - not fast by any means but ok) but more importantly, with some -ot tuning and --tensor-split, I'm able to get very decent pp speed (around 500 tps for a prompt size of 16K - opencode system prompt). The performance is much better than I expected for my hardware and the sizes are ideal, the quality also seems good.
I'm not sure if I can get more performance with MTP on my hardware and config I'm running but I wanted to try the https://github.com/stepfun-ai/llama.cpp/tree/step3p5-mtp with MTP but I realized these quants don't have the MTP heads. Would you consider uploading ones with MTP heads as well since the support also might merge into main llama.cpp soon?
If MTP gets merged into mainline llama.cpp, I'll requant it to support that.