Qwen3.5-122B-A10B-NVFP4 by IG1
Quantization
This model has been quantized using llm-compressor v0.10.1.dev31+geb49917e (just after Qwen3.5 support was merged) and transformers v5.3.0. It is based on the official example with a few modifications (see next section).
Quantization particularities
The sequence length has been increased from 4096 to 8192 and the number of samples from 256 to 1024. The 1024 samples come from 4 differents datasets:
- 256 general conversation samples (UltraChat)
- 256 math reasoning samples (GSM8K)
- 256 code samples (CodeAlpaca)
- 256 multilingual samples (Aya)
You can find the quantization script here.
While the quantization needed transformers v5, the original (transformers v4) tokenizer files has been put back for simple execution on current vLLM versions. The transformers v5 tokenizer files produced by llm-compressor can be found in the transformers_v5 folder.
About FP8 KV cache
In our testing, the Qwen3.5 Mamba hybrid architecture did not play well with FP8 KV cache:
vLLM dynamic FP8 KV cache (
--kv-cache-dtype fp8_e4m3 --calculate-kv-scales) appeared to work initially but quality degraded rapidly into gibberish.Static FP8 scales via llm-compressor (
kv_cache_schemein the recipe) corrupted the NVFP4 weight quantization during calibration. Because FP8 is injected into the forward pass during scale computation, layers with mismatched head dimensions (256 for attention vs 128 for linear attention) produced corrupted activations that propagated through the network, poisoning the weight quantization scales. The resulting model output gibberish even when FP8 KV cache was disabled at inference — the weights themselves were permanently damaged. Note that static FP8 KV scales stored in a checkpoint are passive metadata and still require explicit activation via--kv-cache-dtype fp8_e4m3at vLLM startup to be used; however, the corruption occurred during quantization, not at inference time.
Qwen3.5 Profiles
Alongside support for dynamic thinking and non-thinking modes, the Qwen team offers 4 sampling parameter profiles:
- Thinking General
- Thinking Coding
- Instruct General
- Instruct Reasoning (we prefer to call it Instruct Creative internally)
Manually configuring these parameters for every AI client can be difficult. To solve this, we built a lightweight reverse proxy that exposes the 4 profiles as virtual model names. It handles request transformation on the fly using a single inference server as backend. View the project on our GitHub.
Inference
We run this model with vLLM, here is a sample execution command for an RTX 6000 Pro Blackwell:
docker run --rm --name 'Qwen3.5-122B-A10B-NVFP4' \
--runtime=nvidia --gpus 'all' --ipc=host \
-e 'HF_TOKEN' \
-e 'VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1' \
-v '/srv/cache:/root/.cache' \
-p '127.0.0.1:8000:8000' \
'vllm/vllm-openai:v0.18.0-cu130' \
'ig1/Qwen3.5-122B-A10B-NVFP4' \
--served-model-name 'Qwen3.5-122B-A10B' \
--reasoning-parser 'qwen3' \
--enable-auto-tool-choice \
--tool-call-parser 'qwen3_coder' \
--max-model-len 'auto' \
--gpu-memory-utilization '0.95' \
--max-cudagraph-capture-size 256 \
--max-num-seqs 256
A few notes about some of the parameters:
- Adapt the
/srv/cache:/root/.cachemount point to your liking. It contains files you want to keep between multiples run (dynamo bytecode and AOT with torch compile but most importantly the huggingface folder for the model) VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1allows for more precise CUDA graph VRAM estimation. It should become the default once vLLM reaches v0.19.0 which at which point you can simply remove it--max-cudagraph-capture-size 256 --max-num-seqs 256was necessary to reduce the CUDA graphs and avoid an CUDA OOM (see below),- If you deploy the model into several GPUs using Tensor Parallelism, be sure to check the official recipe as others flags are needed.
CUDA graphs
Default vLLM CUDA graphs size was too big for a single RTX 6000 Pro Blackwell so we reduced it to 256. As max-num-seqs can not be greater than max-cudagraph-capture-size we reduced it as well.
| Graph/Seqs value | Available KV cache memory | KV cache tokens size |
|---|---|---|
| 512 | CUDA Out Of Memory | n/a |
| 256 | 12.16 GiB | 132,048 tokens |
| 128 | 12.38 GiB | 134,144 tokens |
| 64 | 12.6 GiB | 136,240 tokens |
- Downloads last month
- 302
Model tree for ig1/Qwen3.5-122B-A10B-NVFP4
Base model
Qwen/Qwen3.5-122B-A10B