mtp
#1
by festr2 - opened
Hello,
is mtp still possible?
Hey @festr2 , we'd need to run our pruning procedure on the MTP block too to keep a model with uniform num_experts. Pruning could also affect speedup from MTP in this case. We'll look into keeping a pruned MTP layer!
Would be best if possible. Enabling MTP in sglang gives me 1.5x ~ 2x speedup for original FP8 model.
on 4x RTX 6000 PRO FP8 - without MTP - 58toknes/sec, with - 90-105 tokens/sec.
need a NVFP4 for 2s 2x6000 pro users!