Nemotron-3-Nano-4B-W4A16
INT4 (W4A16) quantization of nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 — NVIDIA's hybrid Mamba2 + Transformer 4B variant, slimmed from Nemotron-Nano-9B-v2 via Nemotron Elastic compression.
Targets the 8–12 GB consumer GPU audience. Native 256K context.
Footprint
| Source params | 4B |
| Disk size | ~5.4 GB |
| Inference VRAM (greedy decode, no KV pressure) | ~6 GB |
| Native context window | 256K tokens |
Quick start
vllm serve drawais/Nemotron-3-Nano-4B-W4A16 \
--trust-remote-code \
--mamba_ssm_cache_dtype float32 \
--gpu-memory-utilization 0.9
--mamba_ssm_cache_dtype float32 is required for stable long-context inference. The Mamba state cache must stay in fp32 to avoid recurrent rounding drift.
Smoke check
Sample greedy generations:
The capital of France is Paris.
**def quicksort(arr):**
if len(arr) <= 1:return arrpivot = arr[len(arr) // 2]left = [x for x in arr if x < pivot]
...Translate to Spanish: 'I am very happy today.' → Estoy muy feliz hoy.
PPL on a held-out chunk: ~6.
Bench
Score on drawais/needle-1M-bench-mvp 50K:
| split | recall |
|---|---|
| paper-anchored (real arxiv facts) | 5 / 5 = 100% |
| synthetic (coded strings) | 3 / 5 = 60% |
| overall | 8 / 10 = 80.0% |
Greedy decode, max_new_tokens=256, single sample per prompt.
Attribution
Licensed by NVIDIA Corporation under the NVIDIA Nemotron Model License.
This is a Derivative Work of nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16. Source model copyright © NVIDIA Corporation.
License
NVIDIA Nemotron Open Model License. Permissive, commercially usable, derivatives allowed. Full text in LICENSE; required attribution in NOTICE.
- Downloads last month
- 413
Model tree for drawais/Nemotron-3-Nano-4B-W4A16
Base model
nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base