Love the quant! Nicely done.

by tcclaviger - opened 28 days ago

Did some kernel work in vllm to use this on AMD and oh yeah baby check it out:
https://huggingface.co/Qwen/Qwen3.5-122B-A10B/discussions/18

Nice work guys, model screams with good outputs when enabled properly.

olka-fi

Owner 28 days ago

Thank you for kind words!

FYI I also dismissed the idea of upstreaming some mxfp4-related changes to vLLM exactly for the reasons you’ve mentioned in your post

tcclaviger

28 days ago

•

edited 28 days ago

Also mxfp4 acceptance rates typical are:
Hard data replacing my memory, sampled from 430 log entries across general use, chat, coding, web crawling etc:

Tells me two things about your quant since MTP is unquantized: first is good choice on not quantizing MTP, second is your quant loses almost nothing in accouracy, otherwise the falloff would be much sharper.

I wanted to use 5 because it's still a net gain despite the falloff, but it overwhelms the 9950x PCIE controller due to so much rccl all_gather traffic when doing concurrent inference.

BTW, if you want vLLM to go fast on AMD TP4...all_gather is the hottest point, by shifting away from ProcessGroupNCCL to a pynccl preallocated buffer without stream isolation I saw 35% decode gains on this model.

Same change on oss 120 showed no single user gains but it shows huge throughout on concurrency. 2k word input document test with 2k output reflecting document edits now holds 37tps decode speed across 64 concurreny, up from ~20 at 64. Basically it doubles the number of concurrency that can run, without all gather change 32 concurrent was 31tps...

olka-fi

Owner 28 days ago

Thanks for sharing this! Reminded me that I need to actually compare the quality loss between quantized and base 🙂

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment