Cannot run on RTX PRO 6000 Blackwell + WSL2 β Mamba state cache OOM
#10
by noMugop - opened
Trying to run Qwen3.6-27B-FP8 with vLLM 0.20.0 / 0.17.1 and SGLang 0.5.10 on:
- GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM, sm_120)
- OS: WSL2 Ubuntu 22.04 on Windows 11 host
- NVIDIA driver: 596.36 (also tested 581.80)
Result: model loads successfully (28.5 GB), but Mamba state cache allocation fails with torch.OutOfMemoryError:
torch.OutOfMemoryError: CUDA out of memory.
Tried to allocate 3.48 GiB.
GPU 0 has a total capacity of 95.59 GiB of which 50.40 GiB is free.
this process has 16 GiB memory in use [non-PyTorch CUDA overhead]
8+ hours of testing reveal this is a WSL2 GPU passthrough issue specific to Blackwell + hybrid Mamba models. The 16 GiB hidden overhead consumes invisible VRAM, leaving insufficient contiguous space for Mamba state cache.
Same issue also affects:
- Qwen3.6-35B-A3B-FP8 (MoE version) β fails with 4.99 GiB allocation
- Both 27B and 35B-A3B BF16 versions (likely fail similarly)
Filed bugs
- vLLM: https://github.com/vllm-project/vllm/issues/41619 (main report with full debugging)
Questions for community
- Has anyone successfully run Qwen3.6 family on Blackwell + WSL2?
- If yes β what was your config?
- If only on native Linux β confirmed.
- Are there plans to support llama.cpp / Ollama / MLC for hybrid Mamba models?
Workarounds tested (none ideal)
- β All vLLM/SGLang flag combinations
- β NVIDIA driver downgrade (596.36 β 581.80)
- β vLLM downgrade (0.20.0 β 0.17.1)
- β Tight Mamba memory ratios in SGLang
- β Switch to non-Mamba Qwen (Qwen3-32B-AWQ) β works but loses Qwen3.6 features
- β Dual-boot native Linux β works but Windows lost
Currently waiting for either:
- vLLM patch to allocate Mamba state in chunks
- WSL2/NVIDIA fix for hidden 16 GiB overhead on Blackwell
- llama.cpp adding Qwen3.6 support
Curious if Qwen team or community has any insights.
Thanks for the great model release. Hardware compatibility is the only blocker β Qwen3.6 architecture is otherwise excellent.