vllm 部署oom
8张h100, 会oom
命令:
vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
8张h100, 会oom
命令:
vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
肯定的啊,640G显存怎么能装得下800G模型
等fp8版本吧
reasoning-parser qwen3 --language-model-only
I have the same trouble:
Issue: Cannot start inference server for Qwen3.5-397B-A17B on 8xH100 (80GB) with Python 3.12
Command used:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-397B-A17B \
--host 0.0.0.0 \
--port 8000 \
--tp-size 8 \
--ep-size 8 \
--context-length 8192 \
--model-impl sglang \
--mem-fraction-static 0.60
Error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 4 has a total capacity of 79.19 GiB of which 104.25 MiB is free. Including non-PyTorch memory, this process has 79.08 GiB memory in use. Of the allocated memory 77.28 GiB is allocated by PyTorch, and 16.43 MiB is reserved by PyTorch but unallocated.
What I've tried:
· Reducing context length (--context-length 8192)
· Lowering memory fraction (--mem-fraction-static 0.60)
Problem: The server still crashes with OOM on startup. It seems like the model is trying to use almost all available memory (~79 GiB per GPU), leaving no room for overhead or allocations.
System:
· 8x NVIDIA H100 80GB
· Python 3.12
· PyTorch with CUDA support
· sglang for inference
Question: How can I successfully run this model on 8xH100? Are there additional settings or optimizations I should try?
IT'S BECAUSE THE BF16 WEIGHTS ARE ALREADY 800GB, SO IT IS IMPOSSIBLE TO LOAD THEM ON 8×H100 80GB (640GB TOTAL)—EVEN IF YOU SET THE CONTEXT LENGTH TO ZERO!
Is there a solution for this issue?
I tried to deploy qwen3.5 397B quantized to INT4 on 4 X H100 GPUs and got OOM error
And does it run at scale? can you share your vllm configuration?
Is there a solution for this issue?
I tried to deploy qwen3.5 397B quantized to INT4 on 4 X H100 GPUs and got OOM error
I got the error both while I deployed and when I tested the inference of the model myself
export HF_HOME=/home/jovyan/shares/hf_home
export MODEL_PATH=/home/jovyan/shares/hf_home/models/Qwen/Qwen3.5-397B-A17B-FP8
export PYTORCH_ALLOC_CONF=expandable_segments:True
## warmup for DeepGEMM JIT Compiling
# conda install -c nvidia cuda-nvcc cuda-toolkit
# python -m sglang.compile_deep_gemm \
# --model-path "$MODEL_PATH" \
# --tp 8
python -m sglang.launch_server \
--model-path "$MODEL_PATH" \
--host 0.0.0.0 \
--port 8000 \
--tp-size 8 \
--context-length 262144 \
--model-impl sglang \
--mem-fraction-static 0.8 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
First, run the warmup to trigger DeepGEMM JIT compilation (this prepares the kernels).
After the warmup completes, you can run the server normally without the warmup step.
The second run will use the compiled kernels and work correctly.
From your experience, do you think I’d be able to deploy this version of the model — https://huggingface.co/Qwen/Qwen3.5-397B-A17B-GPTQ-Int4
— on 4× H100 GPUs using vLLM?
My short answer: I don't know, maybe - yes, it should fit - but it'll be tight.
Why it works on paper:
Qwen3.5-397B-A17B is a Mixture-of-Experts model - 397B total parameters, but only ~17B active per forward pass. The full BF16 weights clock in around ~800 GB, which is way too much even for 8× H100s. But the GPTQ-Int4 quantization cuts that roughly in half (down to ~200 GB for the weights alone). Four H100s give you 320 GB of VRAM total, so the weights themselves should fit with room to spare.
The real concerns:
KV cache overhead. The weights aren't the only thing eating VRAM. With a 262K max context length, the KV cache can balloon quickly. At Int4 you'll likely need to keep your context window modest (maybe 8K–16K tokens) to avoid OOM on 4 GPUs.
vLLM support. vLLM does support GPTQ quantization and tensor parallelism (--tensor-parallel-size 4). However, MoE + GPTQ-Int4 at this scale is a relatively niche config. Check for any open issues on vLLM's GitHub - there have historically been edge cases with very large MoE models and certain quant formats.
Throughput will be limited. Even though only ~17B params are active per token, all 397B params still need to live in VRAM. You're essentially paying the memory cost of a 397B model for the inference speed of a ~17B one. On 4× H100, expect decent latency per token but low batch throughput.
What's been confirmed by the community:
The FP8 version has been verified to run on 8× H100 (80 GB each, ~640 GB total) with vLLM using --tensor-parallel-size 8. That's roughly ~400 GB of weights.
The Int4 GPTQ version halves that further, so 4× H100 (320 GB) is a reasonable target — but I couldn't find anyone publicly confirming this exact setup.
Recommendation:
Start conservative with something like (and scale up):
vllm serve Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92
--quantization moe_wna16
--language-model-only # if you are doing text-only
The real solution is to get a couple more GPUs. )
Thank you very much!