NeMo Nano Codec (22kHz, 1.78kbps, 12.5fps) - ONNX

ONNX model for nemo-nano-codec-22khz-1.78kbps-12.5fps

Model Description

The codec uses a two-stage architecture: an encoder that compresses audio into discrete tokens via vector quantization, and a decoder that reconstructs audio from these tokens.

Model type: Neural Audio Codec (Encoder-Decoder with Vector Quantization)
License: NVIDIA Open Model License Agreement
Finetuned from model: nvidia/nemo-nano-codec-22khz-1.78kbps-12.5fps

Model Sources

Repository: NVIDIA NeMo GitHub
Original Model: nvidia/nemo-nano-codec-22khz-1.78kbps-12.5fps
Paper: NeMo NanoCodec: A Low-Bitrate Neural Audio Codec with Finite Scalar Quantization and Language Model-Based Training

Prohibited Uses (per NVIDIA Open Model License)

This model must not be used for:

Illegal surveillance
Illegal collection or processing of biometric information without consent (where required by law)
Harassment, abuse, threatening, bullying, or intentionally misleading or deceiving others

How to Get Started with the Model

Installation

pip install onnxruntime numpy

Basic Usage

Input audio must be pre-padded to complete frames (multiples of 1764 samples)

Complete Pipeline Example

import numpy as np
import onnxruntime as ort

# Load both models
encoder = ort.InferenceSession('encoder.onnx')
decoder = ort.InferenceSession('decoder.onnx')

# Helper function for padding
def pad_audio_to_frames(audio, samples_per_frame=1764):
    """Pad audio to complete frames (required for encoder)."""
    current_len = audio.shape[-1]
    num_frames = (current_len + samples_per_frame - 1) // samples_per_frame
    padded_len = num_frames * samples_per_frame
    padding_needed = padded_len - current_len
    if padding_needed > 0:
        audio = np.pad(audio, ((0, 0), (0, padding_needed)), mode='constant')
    return audio, current_len

# Load your audio (must be 22050 Hz, mono)
original_audio = np.random.randn(1, 88200).astype(np.float32)  # 4 seconds example

# 1. Pad audio to complete frames
padded_audio, original_len = pad_audio_to_frames(original_audio)
audio_len = np.array([padded_audio.shape[-1]], dtype=np.int64)

# 2. Encode: Audio → Tokens
tokens, tokens_len = encoder.run(None, {
    'audio': padded_audio,
    'audio_len': audio_len
})

# 3. Decode: Tokens → Audio
reconstructed_audio, reconstructed_len = decoder.run(None, {
    'tokens': tokens,
    'tokens_len': tokens_len
})

# 4. Trim to original length
reconstructed_audio = reconstructed_audio[:, :original_len]

License

Dual License Structure

This derivative work is distributed under a dual license:

Original Model (NVIDIA):

Licensed by NVIDIA Corporation under the NVIDIA Open Model License
The underlying trained weights and model architecture are subject to the NVIDIA Open Model License Agreement

Modifications:

Any modifications are licensed under Apache License 2.0

Downloads last month: 3

Model tree for Knehm/nemo-nano-codec-22khz-1.78kbps-12.5fps-ONNX

Base model

nvidia/nemo-nano-codec-22khz-1.78kbps-12.5fps

Quantized

(1)

this model

Paper for Knehm/nemo-nano-codec-22khz-1.78kbps-12.5fps-ONNX

NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

Paper • 2508.05835 • Published Aug 7, 2025