[Bug] Model outputs only "!" β quantization_config.ignore missing fused projection names (in_proj_ba / in_proj_qkvz) for linear attention layers
Environment
- vLLM: 0.16.0rc2.dev236+g3b30e6150 (avarok/dgx-vllm-nvfp4-kernel:v23)
- Hardware: NVIDIA DGX Spark / GB10 Grace Blackwell, SM
12.1
Symptom
Model loads and runs at normal speed but outputs only ! (token ID 0) for every token, every response.
Root Cause
This model has a hybrid architecture β 36 of 48 decoder layers use linear (GDN delta-net) attention with plain float BF16
weights in the checkpoint. Those weights are correctly listed in quantization_config.ignore in config.json using their
checkpoint names (in_proj_a, in_proj_b, in_proj_qkv, in_proj_z).
The problem: vLLM fuses these into two stacked parameters (in_proj_ba, in_proj_qkvz) before the ignore check runs. The fused
names aren't in the ignore list, so they get NVFP4-quantized. The weight loader then looks for in_proj_ba.weight in params_dict
, can't find it (quantized params use weight_packed), silently skips loading, and all 36 linear attention layers stay
zero-initialized β garbage output β ! on every token.
Visible at startup as 72 warnings (2 per linear attention layer):
WARNING [qwen3_5.py:500] Parameter layers.0.linear_attn.in_proj_ba.weight not found in params_dict, skip loading
WARNING [qwen3_5.py:500] Parameter layers.0.linear_attn.in_proj_qkvz.weight not found in params_dict, skip loading
Fix
Add the fused names to quantization_config.ignore in config.json:
import json
with open('config.json') as f:
cfg = json.load(f)
layer_types = cfg['text_config']['layer_types']
ignore = cfg['quantization_config']['ignore']
existing = set(ignore)
for i, lt in enumerate(layer_types):
if lt == 'linear_attention':
for name in [f"model.language_model.layers.{i}.linear_attn.in_proj_ba",
f"model.language_model.layers.{i}.linear_attn.in_proj_qkvz"]:
if name not in existing:
ignore.append(name)
with open('config.json', 'w') as f:
json.dump(cfg, f, indent=2)
After applying: zero warnings at startup, model generates correct text.
Please use vllm nightly and make sure you pulled the docker images.
This one? vllm/vllm-openai:nightly
Here is the issue I'm running into, vllm/vllm-openai:cu130-nightly "works" even though it's buggy, and gets around the '!!!!' output issue from the model (MTP enabled/disabled). The problem I found, I don't think it's the same as the main branch from the vllm git repo. I've applied all of the GB10 patches, and I'm still getting '!!!', I'm at a loss for how to fix this. It kind of works with vllm/vllm-openai:cu130-nightly, but I'm getting like 10-11 TPS even with MTP enabled.
@scottglareyourunning are you running this from a docker image? If so, try adding this to your container config:
shm_size: "8gb"
(or args)--shm-size "8gb"