Quantized NVIDIA-Nemotron-Nano-9B-v2
This repository contains the Q4 and Q5 quantized varients of NVIDIA-Nemotron-Nano-9B-v2 is an LLM trained from scratch by NVIDIA for both reasoning and non-reasoning tasks. It can generate reasoning traces before providing a final answer, improving accuracy for complex prompts, though this can be disabled for faster responses. The model uses a hybrid architecture of Mamba-2, MLP, and four Attention layers, and was trained with Megatron-LM and NeMo-RL. It supports English, German, Spanish, French, Italian, and Japanese.
Model Overview
- Original Model: NVIDIA-Nemotron-Nano-9B
- Base Model: NVIDIA-Nemotron-Nano-12B-v2-Base
- Variants: Instruct-tuned multimodal model
- Architecture: Mamba2-Transformer Hybrid
- Quantized Versions:
- Q4_K_M (4-bit quantization)
- Q5_K_M (5-bit quantization)
- Modalities: Text
- Developer: nvidia
- License: nvidia-open-model-license
- Languages: English, German, Spanish, French, Italian, and Japanese.
Quantization Details
Q4_K_M Version
- Approx. ~64% size reduction
- Lower memory footprint (~6.08 GB)
- Well-suited for deployment on edge devices or low-resource GPUs
- Minor performance degradation in highly complex reasoning scenarios
Q5_K_M Version
- Approx. ~63% size reduction
- Lower memory footprint (~6.58 GB)
- Better performance retention, recommended when quality is a priority
Key Features
- High inference throughput: up to ~6× faster reasoning performance on long-context tasks (e.g., 8k input + 16k output) compared to similar 9B models.
- Unified Reasoning & Completion: Designed for both reasoning and non-reasoning tasks, generating reasoning traces before final responses.
- Hybrid Architecture: Combines Mamba-2 and MLP layers with just four Attention layers for efficient processing.
- Extended Context Length: Supports up to 128K tokens, enabling long-context reasoning.
- Reasoning Budget Control: Allows specification of token limits during inference to manage reasoning depth.
Model Data
- The model is pretrained and instruction-tuned on a large and diverse mixture of text and code datasets, designed for efficiency, reasoning, and general-purpose language understanding, including:
- Massive web-scale text corpora covering diverse domains such as news, literature, technical writing, and dialogue.
- High-quality instruction-tuning data curated for reasoning, step-by-step problem solving, and chat-style interactions.
- Code and programming datasets spanning multiple languages to enhance logical and structured reasoning capabilities.
- Synthetic reasoning traces distilled from larger Nemotron and Minitron models to improve long-context performance.
- English-centric pretraining with selective multilingual coverage for generalization and robustness.
- ~20 trillion training tokens processed using NVIDIA’s FP8 training pipeline for optimal efficiency and scalability.
Recommended Use Cases
- Instruction following and chatbots: builds responsive conversational agents capable of multi-turn dialogue and contextual understanding.
- Complex reasoning and problem solving: excels at step-by-step logical reasoning, explanations, and analytical tasks.
- Code generation and understanding: supports multiple programming languages, aiding in code synthesis, debugging, and explanation.
- Educational and tutoring assistants: provides structured, step-by-step reasoning for math, science, and technical subjects.
- Research and experimentation: ideal for exploring hybrid Mamba-Transformer architectures and efficient inference design.
- Edge and production deployment: optimized for high throughput and reduced memory usage on single or multi-GPU setups.
Usage Example
Using llama.cpp for inference:
./llama-cli -hf SandLogicTechnologies/NVIDIA-Nemotron-Nano-9B-v2 -p "Provide me script to print stars in triangle shape in python"
Acknowledgments
These quantized models are based on the original work by the NVIDIA development team.
Special thanks to:
The NVIDIA team for developing and releasing the NVIDIA-Nemotron-Nano-9B-v2 model.
Georgi Gerganov and the entire
llama.cppopen-source community for enabling efficient model quantization and inference via the GGUF format.
Contact
For any inquiries or support, please contact us at support@sandlogic.com or visit our Website.
- Downloads last month
- 4
4-bit
5-bit