modeling_nemotron_h.py: Multiple bugs in HybridMambaAttentionDynamicCache break generation with CUDA fast path

#13

by trohrbaugh - opened 27 days ago

Discussion

trohrbaugh

27 days ago

Model: nvidia/Nemotron-Cascade-2-30B-A3B

Environment:

transformers (latest)
mamba-ssm 2.3.1
causal-conv1d (latest)
CUDA-capable GPU (RTX PRO 6000 Blackwell)

Summary:

The remote modeling_nemotron_h.py has three bugs in its cache handling that make the model produce garbage output or crash entirely when the Mamba CUDA
fast path is available. Without mamba-ssm installed, the model falls back to a naive implementation that silently produces incoherent output due to
missing state.

Bug 1: prepare_inputs_for_generation passes cache as wrong kwarg name

prepare_inputs_for_generation passes the cache object as "past_key_values" in model_inputs:

model_inputs.update(
{
...
"past_key_values": past_key_values, # ← wrong name
...
}
)

But NemotronHForCausalLM.forward() expects it as cache_params:

def forward(self, input_ids=None, ..., cache_params=None, ...):
nemotron_h_outputs = self.backbone(
input_ids,
cache_params=cache_params, # ← receives None
...
)

Result: The cache silently ends up in **kwargs and is never used. Every generation step runs without Mamba recurrent state, producing garbage token soup
(e.g., "and and and and overt in from ge discipl...").

Fix:

In prepare_inputs_for_generation:

"cache_params": past_key_values, # was: "past_key_values": past_key_values

Bug 2: HybridMambaAttentionDynamicCache.init doesn't set self.conv_kernel_size

The init uses conv_kernel_size as a local variable but never stores it as an instance attribute:

def init(self, config, batch_size, dtype=torch.float16, device=None):
...
conv_kernel_size = config.conv_kernel # local variable only
...
# self.conv_kernel_size is never set

But cuda_kernels_forward accesses it on the cache object:

Line 461

(cache_params.conv_kernel_size - hidden_states_B_C_transposed.shape[-1], 0)

Result: AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute 'conv_kernel_size'

Fix:
conv_kernel_size = config.conv_kernel
self.conv_kernel_size = conv_kernel_size # add this line

Bug 3: update_conv_state / update_ssm_state treat list as tensor

Both methods access .device on self.conv_states and self.ssm_states, which are Python lists, not tensors:

def update_conv_state(self, layer_idx, new_conv_state, cache_init=False):
if cache_init:
self.conv_states[layer_idx] = new_conv_state.to(self.conv_states.device) # ← list has no .device
else:
...
self.conv_states[layer_idx][:, :, -1] = new_conv_state[:, 0, :].to(self.conv_states.device) # ← same

Result: AttributeError: 'list' object has no attribute 'device'

Fix:

Replace all occurrences:

self.conv_states.device → self.conv_states[layer_idx].device
self.ssm_states.device → self.ssm_states[layer_idx].device

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "nvidia/Nemotron-Cascade-2-30B-A3B"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True
)

prompt = tok.apply_chat_template(
[{"role": "user", "content": "What is 2+2?"}],
add_generation_prompt=True, tokenize=False,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=1.0, top_p=0.95)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Without mamba-ssm: Produces incoherent output (Bug 1 — no cache state)
With mamba-ssm: Crashes with AttributeError (Bugs 2 and 3)

Complete fix (sed one-liner for the cached remote file)

FILE="/path/to/.hf_home/modules/transformers_modules/nvidia/Nemotron_hyphen_Cascade_hyphen_2_hyphen_30B_hyphen_A3B//modeling_nemotron_h.py"

Bug 1: Fix kwarg name

sed -i 's/"past_key_values": past_key_values,/"cache_params": past_key_values,/' "$FILE"

Bug 2: Add missing self.conv_kernel_size

sed -i 's/ conv_kernel_size = config.conv_kernel/ conv_kernel_size = config.conv_kernel\n self.conv_kernel_size =
conv_kernel_size/' "$FILE"

Bug 3: Fix list.device → list[layer_idx].device

sed -i 's/self.conv_states.device/self.conv_states[layer_idx].device/g' "$FILE"
sed -i 's/self.ssm_states.device/self.ssm_states[layer_idx].device/g' "$FILE"

ychenNLP

NVIDIA org 26 days ago

•

edited 26 days ago

Hi @trohrbaugh
Thanks for report this issue.
Could you please try use the transformers v5.3.0+ and do not pass trust_remote_code=True. The modeling code in transformers should support the hybrid cache already.

Meanwhile, we will update the modeling_nemotron_h.py to the same as https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/blob/main/modeling_nemotron_h.py
Thank you.

trohrbaugh

26 days ago

The latest transformer totally causes garbled responses. Sorry not more specific than that but there are two other problems discovered with fixes... but maybe you know of a better way...

File: modeling_nemotron_h.py (in the NemotronHCache.init method)
What was wrong: The cache allocated conv_states tensors with intermediate_size (4096 = mamba_num_heads × mamba_head_dim), but the Mamba mixer's conv1d layer operates on the full projected dimension (6144 = intermediate_size + 2 × n_groups × ssm_state_size), which includes the B and C state channels concatenated with the hidden states. When causal_conv1d_update compared conv_state.shape[1] (4096) against weight.shape[0] (6144), it failed with "weight must have shape (dim, width)".
The fix: Added a conv_dim variable matching the model's own formula from NemotronHMamba2Mixer.init, and used it for the cache allocation:
torch.zeros(batch_size, conv_dim, conv_kernel_size, ...)

and

File: modeling_nemotron_h.py (in cuda_kernels_forward)
What was wrong: The single-token decode path was guarded by cache_position[0] > 0, which is true whenever the KV cache has been initialized — but modern transformers can still pass multi-token sequences after cache init (e.g., during the second forward call of generation). The causal_conv1d_update CUDA kernel expects a 2D (batch, dim) input for single-step updates, but received a 3D (batch, seq_len, dim) tensor, causing the shape check to fail.
The fix: Added and hidden_states.shape[1] == 1 to the condition so the single-step CUDA path is only used when there's actually a single token:
if cache_params is not None and cache_position is not None and cache_position[0] > 0 and hidden_states.shape[1] == 1:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment