what the best engine to run this model.

#40

by Shimon324 - opened 28 days ago

Hi everyone,
I am looking to deploy the Qwen 3.5 27B model and my primary optimization goal is maximizing generation throughput (TPS) while keeping latency reasonable.
Hardware Context:
2x NVIDIA RTX A5000 (Ampere architecture, 24GB VRAM each -> 48GB total).
High-end local workstation.
The Constraints:
Since the unquantized 27B model (~54GB in BF16/FP16) exceeds my 48GB VRAM limit, I must use a quantized version with Tensor Parallelism (TP=2). I am aiming for a purely text-based use case (the vision pipeline will be disabled).
My Questions:
Engine Performance: Between SGLang and vLLM, which engine currently has the best optimized kernels and lowest overhead for the Qwen 3.5 architecture on an Ampere multi-GPU setup? Are there any benchmark differences regarding RadixAttention (SGLang) vs PagedAttention (vLLM) for this specific family?
TensorRT-LLM: Should I be looking into TensorRT-LLM instead for bare-metal maximum speed, or is the compilation overhead not worth the TPS gain compared to SGLang for a 27B model?
Quantization Synergy: Which quantization format (FP8, AWQ, EXL2, or GPTQ) pairs best with your recommended engine to avoid compute bottlenecks and maximize tokens per second on Ampere?
Any insights, recent benchmark experiences, or specific CLI flag recommendations (like multi-token prediction) would be highly appreciated.
Thanks

CompactAI

21 days ago

Vllm

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment