osoleve/Qwen3.5-27B-Text-NVFP4-MTP · Shitty results compared to regular NVFP4 without MTP

Shitty results compared to regular NVFP4 without MTP

by alexcardo - opened Mar 6

Mar 6

I managed to run your quant on RTX 5090. It performed awful ... about 100-120tps vs 750-800tps with this quant Kbenkhaled/Qwen3.5-27B-NVFP4 with vision.

osoleve

Owner Mar 6

Weird! If you want to send me a 5090 to try to reproduce I'm happy to, otherwise not sure what you'd like from me here.

alexcardo

Mar 6

I don't have a 5090. I rent it on vast.ai. You can try it out by yourself. My goal was to inform you not to blame.

osoleve

Owner Mar 6

Okay assuming you don't mean single stream throughput (not physically possible for a single 5090 to hit 750tps single-stream with a 27B model), I'm working on DGX Sparks and have 20tps single stream, batched at 32k I'm pushing 450tps with some per-stream drawdown and could probably push that to 600tps with higher concurrency. On a 5090, benchmarks on Gemma:27B show ~5x speed on the 5090 vs the GB10 for single stream, so I'd expect single stream to be around 100tps. No clue how concurrency scaling factors in with the different cards, but I'd then expect if the 5090 can handle even 8 concurrent that's already 800tps batched. If you can't hit that, may not be allocating enough KV cache.

alexcardo

Mar 6

•

edited Mar 6

Okay assuming you don't mean single stream throughput (not physically possible for a single 5090 to hit 750tps single-stream with a 27B model), I'm working on DGX Sparks and have 20tps single stream, batched at 32k I'm pushing 450tps with some per-stream drawdown and could probably push that to 600tps with higher concurrency. On a 5090, benchmarks on Gemma:27B show ~5x speed on the 5090 vs the GB10 for single stream, so I'd expect single stream to be around 100tps. No clue how concurrency scaling factors in with the different cards, but I'd then expect if the 5090 can handle even 8 concurrent that's already 800tps batched. If you can't hit that, may not be allocating enough KV cache.

Can you please provide the exact vLLM build with the link and an a vLLM command?

Here are my:

export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server
--model Kbenkhaled/Qwen3.5-27B-NVFP4
--host 0.0.0.0
--port 8080
--default-chat-template-kwargs '{"enable_thinking": false}'
--enable-prefix-caching
--kv-cache-dtype fp8
--max-model-len 18432
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--gpu-memory-utilization 0.85
--trust-remote-code

export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server
--model osoleve/Qwen3.5-27B-NVFP4-MTP
--host 0.0.0.0
--port 8080
--default-chat-template-kwargs '{"enable_thinking": false}'
--enable-prefix-caching
--kv-cache-dtype fp8
--max-model-len 18432
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--gpu-memory-utilization 0.85
--trust-remote-code
--quantization modelopt
--language-model-only
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

The 1st one gives me 17-18 parallel threads on RTX 5090 (I'm not that rich to rent rtx 6000 pro Blackwell) and 720-750tps regular with 800+ tps peak.

The second one (your quant) gives me 120-150tps with probably 200tps in peak.

The only thing I want to figure out how to run your quant that supports MTP to achieve 1000+ tps.

According to your report, this quant should give me 1200-1500tps give the MTP.

I would appreciate if you provide the exact setup to run your quant.

In my case vLLM crushes with --gpu-memory-utilization 0.85 thus I'm forced to use 0.80...

As you can see, I can use 18 parallel threads thanks to --kv-cache-dtype fp8

I don't promote another quant, I just managed to run it. If you help me to run your quant with a decent speed I'd appreciate.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment