Nemotron-3-Nano-4B-W4A16

INT4 (W4A16) quantization of nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 — NVIDIA's hybrid Mamba2 + Transformer 4B variant, slimmed from Nemotron-Nano-9B-v2 via Nemotron Elastic compression.

Targets the 8–12 GB consumer GPU audience. Native 256K context.

Footprint


Source params	4B
Disk size	~5.4 GB
Inference VRAM (greedy decode, no KV pressure)	~6 GB
Native context window	256K tokens

Quick start

vllm serve drawais/Nemotron-3-Nano-4B-W4A16 \
  --trust-remote-code \
  --mamba_ssm_cache_dtype float32 \
  --gpu-memory-utilization 0.9

--mamba_ssm_cache_dtype float32 is required for stable long-context inference. The Mamba state cache must stay in fp32 to avoid recurrent rounding drift.

Smoke check

Sample greedy generations:

The capital of France is Paris.

**def quicksort(arr):**
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
...

Translate to Spanish: 'I am very happy today.' → Estoy muy feliz hoy.

PPL on a held-out chunk: ~6.