modeling_nemotron_h.py: Multiple bugs in HybridMambaAttentionDynamicCache break generation with CUDA fast path

#13
by trohrbaugh - opened

Model: nvidia/Nemotron-Cascade-2-30B-A3B

Environment:

  • transformers (latest)
  • mamba-ssm 2.3.1
  • causal-conv1d (latest)
  • CUDA-capable GPU (RTX PRO 6000 Blackwell)

Summary:

The remote modeling_nemotron_h.py has three bugs in its cache handling that make the model produce garbage output or crash entirely when the Mamba CUDA
fast path is available. Without mamba-ssm installed, the model falls back to a naive implementation that silently produces incoherent output due to
missing state.


Bug 1: prepare_inputs_for_generation passes cache as wrong kwarg name

prepare_inputs_for_generation passes the cache object as "past_key_values" in model_inputs:

model_inputs.update(
{
...
"past_key_values": past_key_values, # ← wrong name
...
}
)

But NemotronHForCausalLM.forward() expects it as cache_params:

def forward(self, input_ids=None, ..., cache_params=None, ...):
nemotron_h_outputs = self.backbone(
input_ids,
cache_params=cache_params, # ← receives None
...
)

Result: The cache silently ends up in **kwargs and is never used. Every generation step runs without Mamba recurrent state, producing garbage token soup
(e.g., "and and and and overt in from ge discipl...").

Fix:

In prepare_inputs_for_generation:

"cache_params": past_key_values, # was: "past_key_values": past_key_values


Bug 2: HybridMambaAttentionDynamicCache.init doesn't set self.conv_kernel_size

The init uses conv_kernel_size as a local variable but never stores it as an instance attribute:

def init(self, config, batch_size, dtype=torch.float16, device=None):
...
conv_kernel_size = config.conv_kernel # local variable only
...
# self.conv_kernel_size is never set

But cuda_kernels_forward accesses it on the cache object:

Line 461

(cache_params.conv_kernel_size - hidden_states_B_C_transposed.shape[-1], 0)

Result: AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute 'conv_kernel_size'

Fix:
conv_kernel_size = config.conv_kernel
self.conv_kernel_size = conv_kernel_size # add this line


Bug 3: update_conv_state / update_ssm_state treat list as tensor

Both methods access .device on self.conv_states and self.ssm_states, which are Python lists, not tensors:

def update_conv_state(self, layer_idx, new_conv_state, cache_init=False):
if cache_init:
self.conv_states[layer_idx] = new_conv_state.to(self.conv_states.device) # ← list has no .device
else:
...
self.conv_states[layer_idx][:, :, -1] = new_conv_state[:, 0, :].to(self.conv_states.device) # ← same

Result: AttributeError: 'list' object has no attribute 'device'

Fix:

Replace all occurrences:

self.conv_states.device → self.conv_states[layer_idx].device
self.ssm_states.device → self.ssm_states[layer_idx].device


Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "nvidia/Nemotron-Cascade-2-30B-A3B"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True
)

prompt = tok.apply_chat_template(
[{"role": "user", "content": "What is 2+2?"}],
add_generation_prompt=True, tokenize=False,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=1.0, top_p=0.95)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

  • Without mamba-ssm: Produces incoherent output (Bug 1 — no cache state)
  • With mamba-ssm: Crashes with AttributeError (Bugs 2 and 3)

Complete fix (sed one-liner for the cached remote file)

FILE="/path/to/.hf_home/modules/transformers_modules/nvidia/Nemotron_hyphen_Cascade_hyphen_2_hyphen_30B_hyphen_A3B//modeling_nemotron_h.py"

Bug 1: Fix kwarg name

sed -i 's/"past_key_values": past_key_values,/"cache_params": past_key_values,/' "$FILE"

Bug 2: Add missing self.conv_kernel_size

sed -i 's/ conv_kernel_size = config.conv_kernel/ conv_kernel_size = config.conv_kernel\n self.conv_kernel_size =
conv_kernel_size/' "$FILE"

Bug 3: Fix list.device → list[layer_idx].device

sed -i 's/self.conv_states.device/self.conv_states[layer_idx].device/g' "$FILE"
sed -i 's/self.ssm_states.device/self.ssm_states[layer_idx].device/g' "$FILE"

Hi @trohrbaugh
Thanks for report this issue.
Could you please try use the transformers v5.3.0+ and do not pass trust_remote_code=True. The modeling code in transformers should support the hybrid cache already.

Meanwhile, we will update the modeling_nemotron_h.py to the same as https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/blob/main/modeling_nemotron_h.py
Thank you.

The latest transformer totally causes garbled responses. Sorry not more specific than that but there are two other problems discovered with fixes... but maybe you know of a better way...

File: modeling_nemotron_h.py (in the NemotronHCache.init method)
What was wrong: The cache allocated conv_states tensors with intermediate_size (4096 = mamba_num_heads × mamba_head_dim), but the Mamba mixer's conv1d layer operates on the full projected dimension (6144 = intermediate_size + 2 × n_groups × ssm_state_size), which includes the B and C state channels concatenated with the hidden states. When causal_conv1d_update compared conv_state.shape[1] (4096) against weight.shape[0] (6144), it failed with "weight must have shape (dim, width)".
The fix: Added a conv_dim variable matching the model's own formula from NemotronHMamba2Mixer.init, and used it for the cache allocation:
torch.zeros(batch_size, conv_dim, conv_kernel_size, ...)

and

File: modeling_nemotron_h.py (in cuda_kernels_forward)
What was wrong: The single-token decode path was guarded by cache_position[0] > 0, which is true whenever the KV cache has been initialized — but modern transformers can still pass multi-token sequences after cache init (e.g., during the second forward call of generation). The causal_conv1d_update CUDA kernel expects a 2D (batch, dim) input for single-step updates, but received a 3D (batch, seq_len, dim) tensor, causing the shape check to fail.
The fix: Added and hidden_states.shape[1] == 1 to the condition so the single-step CUDA path is only used when there's actually a single token:
if cache_params is not None and cache_position is not None and cache_position[0] > 0 and hidden_states.shape[1] == 1:

Sign up or log in to comment