Nemotron-Cascade-2-30B-A3B with Block Repeat (Layer 5)
Applying the RYS (Repeat Your Steps) layer duplication method to NVIDIA's hybrid Mamba2+MoE+GQA architecture. Achieves +6.7pp improvement on BBH-style reasoning benchmarks with no training and no weight changes.
Key Finding: Hybrid Architectures Allow Multiple Repeat Strategies
Unlike pure MoE models (GPT-OSS-20B) where only Attention repeat works, Nemotron-Cascade-2's hybrid architecture (Mamba-2 + MoE + GQA Attention) shows that all three mixer types benefit from repetition at specific positions:
| Configuration | Score | Delta | Layer Type |
|---|---|---|---|
| attn-L5 | 11/15 | +6.7pp | GQA Attention (1st attention layer) |
| attn-L19,L26 | 11/15 | +6.7pp | GQA Attention pair |
| moe-L43,L45 | 11/15 | +6.7pp | MoE (late layers) |
| mamba-L44,L46 | 11/15 | +6.7pp | Mamba-2 (late layers) |
| baseline | 10/15 | 0.0pp | - |
| attn-L12 | 10/15 | 0.0pp | GQA Attention |
| attn-L33 | 6/15 | -26.7pp | GQA Attention (middle) |
| attn-ALL | 3/15 | -46.7pp | All 6 GQA layers |
Architecture
Nemotron-Cascade-2 has 52 layers with 3 types of mixer blocks:
Layer 0: Mamba2 Layer 26: Attention*
Layer 1: MoE Layer 27: MoE
Layer 2: Mamba2 Layer 28: Mamba2
Layer 3: MoE ...
Layer 4: Mamba2 Layer 33: Attention
Layer 5: Attention* <-- THIS LAYER IS REPEATED
Layer 6: MoE Layer 34: MoE
Layer 7: Mamba2 ...
... Layer 42: Attention
Layer 12: Attention Layer 43: MoE
Layer 13: MoE Layer 44: Mamba2
... ...
Layer 19: Attention Layer 50: Mamba2
Layer 20: MoE Layer 51: MoE
- Mamba-2: 23 layers (SSM for efficient sequence modeling)
- MoE: 23 layers (128 routed experts + 1 shared, top-6 routing)
- GQA Attention: 6 layers at positions 5, 12, 19, 26, 33, 42
Cross-Model Comparison
| Model | Architecture | Best Config | Improvement |
|---|---|---|---|
| GPT-OSS-20B | Uniform (Attn+MoE) | Attn L19-20 only | +13.3pp |
| Nemotron-Cascade-2 | Hybrid (Mamba+MoE+GQA) | Any mixer at right position | +6.7pp |
Key insight: In uniform architectures, only the dense (Attention) component benefits from repetition. In hybrid architectures, all component types can benefit because each mixer type serves a distinct computational role.
Usage
With the loader script (recommended)
from load_model import load_repeat_model
from mlx_lm import generate
from mlx_lm.sample_utils import make_sampler
model, tokenizer = load_repeat_model()
sampler = make_sampler(temp=0.0)
response = generate(model, tokenizer, prompt="Your question here", max_tokens=512, sampler=sampler)
print(response)
Manual patching
import mlx.nn as nn
from mlx_lm import load
class BlockRepeatWrapper(nn.Module):
def __init__(self, block):
super().__init__()
self.block = block
for attr in ['block_type']:
if hasattr(block, attr):
setattr(self, attr, getattr(block, attr))
def __call__(self, x, *args, **kwargs):
h = self.block(x, *args, **kwargs)
if isinstance(h, tuple): h = h[0]
h2 = self.block(h, *args, **kwargs)
if isinstance(h2, tuple): h2 = h2[0]
return h2
model, tokenizer = load("mlx-community/Nemotron-Cascade-2-30B-A3B-4bit")
model.backbone.layers[5] = BlockRepeatWrapper(model.backbone.layers[5])
Alternative configurations (all +6.7pp)
# MoE repeat (late layers)
for idx in [43, 45]:
model.backbone.layers[idx] = BlockRepeatWrapper(model.backbone.layers[idx])
# Mamba-2 repeat (late layers)
for idx in [44, 46]:
model.backbone.layers[idx] = BlockRepeatWrapper(model.backbone.layers[idx])
# Attention pair repeat
for idx in [19, 26]:
model.backbone.layers[idx] = BlockRepeatWrapper(model.backbone.layers[idx])
Model Details
- Base model: nvidia/Nemotron-Cascade-2-30B-A3B (via mlx-community 4bit)
- Total params: ~30B (3B active per token)
- Modification: Layer 5 (first GQA Attention) repeated once
- Quantization: 4-bit
- License: NVIDIA Open Model License
Method
Based on llm-circuit-finder (RYS method). Extended to hybrid Mamba+MoE+Attention architectures, discovering that all mixer types can benefit from repetition in hybrid models.
Citation
@misc{nemotron-cascade-2-repeat,
title={Layer Repeat for Hybrid MoE Models: RYS Method on Nemotron-Cascade-2},
author={shi3z},
year={2026},
note={Based on RYS method. Extends attention-repeat findings from GPT-OSS-20B to hybrid architectures.}
}
- Downloads last month
- 164
4-bit
Model tree for shi3z/nemotron-cascade-2-attn-repeat-L5
Base model
nvidia/Nemotron-Cascade-2-30B-A3B