Can't get vLLM running on 1xRTX 4090

#1
by slyfox1186 - opened

I can not get --cpu-offload-gb to work. Anyone have this working on a 24GB VRAM Nvidia card?

QuantTrio org

for vllm, one can refer to https://github.com/guqiong96/Lvllm

or

ktransformers via sglang

tclf90 changed discussion title from Can't get vLLM running on RTX 4090 to Can't get vLLM running on 1xRTX 4090

Hi everyone 👋

I’m currently trying to run Qwen3.5-35B-A3B-AWQ locally using vLLM, but I’m running into repeated issues related to KV cache memory and engine initialization.

My setup:
• GPU: NVIDIA RTX 3090 (24GB)
• CUDA: 13.1
• Driver: 590.48.01
• vLLM (latest stable)
• Model: Qwen3.5-35B-A3B-AWQ (downloaded locally)

Typical issues I’m facing:
• Negative or extremely small KV cache memory
• Engine failing during CUDA graph capture
• Assertion errors during warmup
• Instability when increasing max context length

I’ve experimented with:
• --gpu-memory-utilization between 0.70 and 0.96
• --max-model-len from 1024 up to 4096
• --enforce-eager
• Limiting concurrency

But I still haven’t found a stable configuration.

My main questions:
1. Has anyone successfully run Qwen3.5-35B-A3B-AWQ on a single 24GB GPU (like a 3090)?
2. If so, could you share:
• Your full vLLM command
• Max context length used
• Whether you needed swap space
• Any special flags
3. Is this model realistically expected to run reliably on a single 24GB GPU, or is multi-GPU / 48GB+ VRAM effectively required?

Any guidance or known-good configurations would be greatly appreciated 🙏

Thanks in advance!

QuantTrio org
edited Mar 3

You'd probably need to disable the vision part in order to run this model efficiently on one 24GB card (--language-model-only). Also fp8 kv cache (--kv-cache-dtype fp8_e4m3)

Is this model realistically expected to run reliably on a single 24GB GPU, or is multi-GPU / 48GB+ VRAM effectively required?

2x20GB is tested to be enough

Sign up or log in to comment