Quantized NVIDIA-Nemotron-Nano-9B-v2

This repository contains the Q4 and Q5 quantized varients of NVIDIA-Nemotron-Nano-9B-v2 is an LLM trained from scratch by NVIDIA for both reasoning and non-reasoning tasks. It can generate reasoning traces before providing a final answer, improving accuracy for complex prompts, though this can be disabled for faster responses. The model uses a hybrid architecture of Mamba-2, MLP, and four Attention layers, and was trained with Megatron-LM and NeMo-RL. It supports English, German, Spanish, French, Italian, and Japanese.

Model Overview

Original Model: NVIDIA-Nemotron-Nano-9B
Base Model: NVIDIA-Nemotron-Nano-12B-v2-Base
Variants: Instruct-tuned multimodal model
Architecture: Mamba2-Transformer Hybrid
Quantized Versions:
- Q4_K_M (4-bit quantization)
- Q5_K_M (5-bit quantization)
Modalities: Text
Developer: nvidia
License: nvidia-open-model-license
Languages: English, German, Spanish, French, Italian, and Japanese.

Quantization Details

Q4_K_M Version

Approx. ~64% size reduction
Lower memory footprint (~6.08 GB)
Well-suited for deployment on edge devices or low-resource GPUs
Minor performance degradation in highly complex reasoning scenarios

Q5_K_M Version

Approx. ~63% size reduction
Lower memory footprint (~6.58 GB)
Better performance retention, recommended when quality is a priority

Key Features

High inference throughput: up to ~6× faster reasoning performance on long-context tasks (e.g., 8k input + 16k output) compared to similar 9B models.
Unified Reasoning & Completion: Designed for both reasoning and non-reasoning tasks, generating reasoning traces before final responses.
Hybrid Architecture: Combines Mamba-2 and MLP layers with just four Attention layers for efficient processing.
Extended Context Length: Supports up to 128K tokens, enabling long-context reasoning.
Reasoning Budget Control: Allows specification of token limits during inference to manage reasoning depth.

Model Data

The model is pretrained and instruction-tuned on a large and diverse mixture of text and code datasets, designed for efficiency, reasoning, and general-purpose language understanding, including:
- Massive web-scale text corpora covering diverse domains such as news, literature, technical writing, and dialogue.
- High-quality instruction-tuning data curated for reasoning, step-by-step problem solving, and chat-style interactions.
- Code and programming datasets spanning multiple languages to enhance logical and structured reasoning capabilities.
- Synthetic reasoning traces distilled from larger Nemotron and Minitron models to improve long-context performance.
- English-centric pretraining with selective multilingual coverage for generalization and robustness.
- ~20 trillion training tokens processed using NVIDIA’s FP8 training pipeline for optimal efficiency and scalability.

Recommended Use Cases

Instruction following and chatbots: builds responsive conversational agents capable of multi-turn dialogue and contextual understanding.
Complex reasoning and problem solving: excels at step-by-step logical reasoning, explanations, and analytical tasks.
Code generation and understanding: supports multiple programming languages, aiding in code synthesis, debugging, and explanation.
Educational and tutoring assistants: provides structured, step-by-step reasoning for math, science, and technical subjects.
Research and experimentation: ideal for exploring hybrid Mamba-Transformer architectures and efficient inference design.
Edge and production deployment: optimized for high throughput and reduced memory usage on single or multi-GPU setups.

Usage Example

Using llama.cpp for inference:

./llama-cli -hf SandLogicTechnologies/NVIDIA-Nemotron-Nano-9B-v2 -p "Provide me script to print stars in triangle shape in python"

Acknowledgments

These quantized models are based on the original work by the NVIDIA development team.

Special thanks to:

The NVIDIA team for developing and releasing the NVIDIA-Nemotron-Nano-9B-v2 model.
Georgi Gerganov and the entire llama.cpp open-source community for enabling efficient model quantization and inference via the GGUF format.

Contact

For any inquiries or support, please contact us at support@sandlogic.com or visit our Website.

Downloads last month: 4

GGUF

Model size

9B params

Architecture

nemotron_h

Hardware compatibility

4-bit

5-bit