mtp

by festr2 - opened Jan 12

Discussion

festr2

Jan 12

Hello,

is mtp still possible?

lazarevich

Cerebras org Jan 12

Hey @festr2 , we'd need to run our pruning procedure on the MTP block too to keep a model with uniform num_experts. Pruning could also affect speedup from MTP in this case. We'll look into keeping a pruned MTP layer!

CHNtentes

Jan 13

Would be best if possible. Enabling MTP in sglang gives me 1.5x ~ 2x speedup for original FP8 model.

festr2

Jan 13

on 4x RTX 6000 PRO FP8 - without MTP - 58toknes/sec, with - 90-105 tokens/sec.

mtcl

Jan 23

need a NVFP4 for 2s 2x6000 pro users!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment