Can't get vLLM running on 1xRTX 4090
I can not get --cpu-offload-gb to work. Anyone have this working on a 24GB VRAM Nvidia card?
Hi everyone 👋
I’m currently trying to run Qwen3.5-35B-A3B-AWQ locally using vLLM, but I’m running into repeated issues related to KV cache memory and engine initialization.
My setup:
• GPU: NVIDIA RTX 3090 (24GB)
• CUDA: 13.1
• Driver: 590.48.01
• vLLM (latest stable)
• Model: Qwen3.5-35B-A3B-AWQ (downloaded locally)
Typical issues I’m facing:
• Negative or extremely small KV cache memory
• Engine failing during CUDA graph capture
• Assertion errors during warmup
• Instability when increasing max context length
I’ve experimented with:
• --gpu-memory-utilization between 0.70 and 0.96
• --max-model-len from 1024 up to 4096
• --enforce-eager
• Limiting concurrency
But I still haven’t found a stable configuration.
⸻
My main questions:
1. Has anyone successfully run Qwen3.5-35B-A3B-AWQ on a single 24GB GPU (like a 3090)?
2. If so, could you share:
• Your full vLLM command
• Max context length used
• Whether you needed swap space
• Any special flags
3. Is this model realistically expected to run reliably on a single 24GB GPU, or is multi-GPU / 48GB+ VRAM effectively required?
Any guidance or known-good configurations would be greatly appreciated 🙏
Thanks in advance!
You'd probably need to disable the vision part in order to run this model efficiently on one 24GB card (--language-model-only). Also fp8 kv cache (--kv-cache-dtype fp8_e4m3)
Is this model realistically expected to run reliably on a single 24GB GPU, or is multi-GPU / 48GB+ VRAM effectively required?
2x20GB is tested to be enough