Quantized NVIDIA-Nemotron-Nano-9B-v2

This repository contains the Q4 and Q5 quantized varients of NVIDIA-Nemotron-Nano-9B-v2 is an LLM trained from scratch by NVIDIA for both reasoning and non-reasoning tasks. It can generate reasoning traces before providing a final answer, improving accuracy for complex prompts, though this can be disabled for faster responses. The model uses a hybrid architecture of Mamba-2, MLP, and four Attention layers, and was trained with Megatron-LM and NeMo-RL. It supports English, German, Spanish, French, Italian, and Japanese.

Model Overview

  • Original Model: NVIDIA-Nemotron-Nano-9B
  • Base Model: NVIDIA-Nemotron-Nano-12B-v2-Base
  • Variants: Instruct-tuned multimodal model
  • Architecture: Mamba2-Transformer Hybrid
  • Quantized Versions:
    • Q4_K_M (4-bit quantization)
    • Q5_K_M (5-bit quantization)
  • Modalities: Text
  • Developer: nvidia
  • License: nvidia-open-model-license
  • Languages: English, German, Spanish, French, Italian, and Japanese.

Quantization Details

Q4_K_M Version

  • Approx. ~64% size reduction
  • Lower memory footprint (~6.08 GB)
  • Well-suited for deployment on edge devices or low-resource GPUs
  • Minor performance degradation in highly complex reasoning scenarios

Q5_K_M Version

  • Approx. ~63% size reduction
  • Lower memory footprint (~6.58 GB)
  • Better performance retention, recommended when quality is a priority

Key Features

  • High inference throughput: up to ~6× faster reasoning performance on long-context tasks (e.g., 8k input + 16k output) compared to similar 9B models.
  • Unified Reasoning & Completion: Designed for both reasoning and non-reasoning tasks, generating reasoning traces before final responses.
  • Hybrid Architecture: Combines Mamba-2 and MLP layers with just four Attention layers for efficient processing.
  • Extended Context Length: Supports up to 128K tokens, enabling long-context reasoning.
  • Reasoning Budget Control: Allows specification of token limits during inference to manage reasoning depth.

Model Data

  • The model is pretrained and instruction-tuned on a large and diverse mixture of text and code datasets, designed for efficiency, reasoning, and general-purpose language understanding, including:
    • Massive web-scale text corpora covering diverse domains such as news, literature, technical writing, and dialogue.
    • High-quality instruction-tuning data curated for reasoning, step-by-step problem solving, and chat-style interactions.
    • Code and programming datasets spanning multiple languages to enhance logical and structured reasoning capabilities.
    • Synthetic reasoning traces distilled from larger Nemotron and Minitron models to improve long-context performance.
    • English-centric pretraining with selective multilingual coverage for generalization and robustness.
    • ~20 trillion training tokens processed using NVIDIA’s FP8 training pipeline for optimal efficiency and scalability.

Recommended Use Cases

  • Instruction following and chatbots: builds responsive conversational agents capable of multi-turn dialogue and contextual understanding.
  • Complex reasoning and problem solving: excels at step-by-step logical reasoning, explanations, and analytical tasks.
  • Code generation and understanding: supports multiple programming languages, aiding in code synthesis, debugging, and explanation.
  • Educational and tutoring assistants: provides structured, step-by-step reasoning for math, science, and technical subjects.
  • Research and experimentation: ideal for exploring hybrid Mamba-Transformer architectures and efficient inference design.
  • Edge and production deployment: optimized for high throughput and reduced memory usage on single or multi-GPU setups.

Usage Example

Using llama.cpp for inference:

./llama-cli -hf SandLogicTechnologies/NVIDIA-Nemotron-Nano-9B-v2 -p "Provide me script to print stars in triangle shape in python"

Acknowledgments

These quantized models are based on the original work by the NVIDIA development team.

Special thanks to:

  • The NVIDIA team for developing and releasing the NVIDIA-Nemotron-Nano-9B-v2 model.

  • Georgi Gerganov and the entire llama.cpp open-source community for enabling efficient model quantization and inference via the GGUF format.


Contact

For any inquiries or support, please contact us at support@sandlogic.com or visit our Website.

Downloads last month
4
GGUF
Model size
9B params
Architecture
nemotron_h
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support