Nemotron-Cascade-2-30B-A3B โ€” Q4_K_M GGUF

GGUF quantization of nvidia/Nemotron-Cascade-2-30B-A3B.

  • Architecture: Hybrid Attention + Mamba (SSM) + MoE โ€” 30B total parameters, 3B active
  • Quantization: Q4_K_M (k-quant, mixed precision ~4.5 bpw)

Quantization commands

# Convert HF model to GGUF (bf16)
python llama.cpp/convert_hf_to_gguf.py \
  nvidia/Nemotron-Cascade-2-30B-A3B \
  --outfile Nemotron-Cascade-2-30B-A3B-bf16.gguf \
  --outtype bf16

# Quantize to Q4_K_M
llama-quantize Nemotron-Cascade-2-30B-A3B-bf16.gguf \
  Nemotron-Cascade-2-30B-A3B-Q4_K_M.gguf Q4_K_M

Usage

Load in LM Studio, llama.cpp, or any GGUF-compatible runtime.

Downloads last month
224
GGUF
Model size
32B params
Architecture
nemotron_h_moe
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for shiny-plan/Nemotron-Cascade-2-30B-A3B-Q4_K_M-GGUF

Quantized
(31)
this model