Nemotron-3-Nano-4B-W4A16

INT4 (W4A16) quantization of nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 — NVIDIA's hybrid Mamba2 + Transformer 4B variant, slimmed from Nemotron-Nano-9B-v2 via Nemotron Elastic compression.

Targets the 8–12 GB consumer GPU audience. Native 256K context.

Footprint

Source params 4B
Disk size ~5.4 GB
Inference VRAM (greedy decode, no KV pressure) ~6 GB
Native context window 256K tokens

Quick start

vllm serve drawais/Nemotron-3-Nano-4B-W4A16 \
  --trust-remote-code \
  --mamba_ssm_cache_dtype float32 \
  --gpu-memory-utilization 0.9

--mamba_ssm_cache_dtype float32 is required for stable long-context inference. The Mamba state cache must stay in fp32 to avoid recurrent rounding drift.

Smoke check

Sample greedy generations:

The capital of France is Paris.

**def quicksort(arr):**
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
...

Translate to Spanish: 'I am very happy today.' → Estoy muy feliz hoy.

PPL on a held-out chunk: ~6.

Bench

Score on drawais/needle-1M-bench-mvp 50K:

split recall
paper-anchored (real arxiv facts) 5 / 5 = 100%
synthetic (coded strings) 3 / 5 = 60%
overall 8 / 10 = 80.0%

Greedy decode, max_new_tokens=256, single sample per prompt.

Attribution

Licensed by NVIDIA Corporation under the NVIDIA Nemotron Model License.

This is a Derivative Work of nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16. Source model copyright © NVIDIA Corporation.

License

NVIDIA Nemotron Open Model License. Permissive, commercially usable, derivatives allowed. Full text in LICENSE; required attribution in NOTICE.

Downloads last month
413
Safetensors
Model size
3B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for drawais/Nemotron-3-Nano-4B-W4A16

Collection including drawais/Nemotron-3-Nano-4B-W4A16