[Bug] Model outputs only "!" β€” quantization_config.ignore missing fused projection names (in_proj_ba / in_proj_qkvz) for linear attention layers

#4
by scottgl - opened

Environment

  • vLLM: 0.16.0rc2.dev236+g3b30e6150 (avarok/dgx-vllm-nvfp4-kernel:v23)
  • Hardware: NVIDIA DGX Spark / GB10 Grace Blackwell, SM
    12.1

Symptom

Model loads and runs at normal speed but outputs only ! (token ID 0) for every token, every response.

Root Cause

This model has a hybrid architecture β€” 36 of 48 decoder layers use linear (GDN delta-net) attention with plain float BF16
weights in the checkpoint. Those weights are correctly listed in quantization_config.ignore in config.json using their
checkpoint names (in_proj_a, in_proj_b, in_proj_qkv, in_proj_z).

The problem: vLLM fuses these into two stacked parameters (in_proj_ba, in_proj_qkvz) before the ignore check runs. The fused
names aren't in the ignore list, so they get NVFP4-quantized. The weight loader then looks for in_proj_ba.weight in params_dict
, can't find it (quantized params use weight_packed), silently skips loading, and all 36 linear attention layers stay
zero-initialized β†’ garbage output β†’ ! on every token.

Visible at startup as 72 warnings (2 per linear attention layer):

WARNING [qwen3_5.py:500] Parameter layers.0.linear_attn.in_proj_ba.weight not found in params_dict, skip loading
WARNING [qwen3_5.py:500] Parameter layers.0.linear_attn.in_proj_qkvz.weight not found in params_dict, skip loading

Fix

Add the fused names to quantization_config.ignore in config.json:

import json

with open('config.json') as f:
cfg = json.load(f)

layer_types = cfg['text_config']['layer_types']
ignore = cfg['quantization_config']['ignore']
existing = set(ignore)

for i, lt in enumerate(layer_types):
if lt == 'linear_attention':
for name in [f"model.language_model.layers.{i}.linear_attn.in_proj_ba",
f"model.language_model.layers.{i}.linear_attn.in_proj_qkvz"]:
if name not in existing:
ignore.append(name)

with open('config.json', 'w') as f:
json.dump(cfg, f, indent=2)

After applying: zero warnings at startup, model generates correct text.

Owner

Please use vllm nightly and make sure you pulled the docker images.

This one? vllm/vllm-openai:nightly

Here is the issue I'm running into, vllm/vllm-openai:cu130-nightly "works" even though it's buggy, and gets around the '!!!!' output issue from the model (MTP enabled/disabled). The problem I found, I don't think it's the same as the main branch from the vllm git repo. I've applied all of the GB10 patches, and I'm still getting '!!!', I'm at a loss for how to fix this. It kind of works with vllm/vllm-openai:cu130-nightly, but I'm getting like 10-11 TPS even with MTP enabled.

@scottglareyourunning are you running this from a docker image? If so, try adding this to your container config:
shm_size: "8gb"
(or args)
--shm-size "8gb"

Sign up or log in to comment