NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference
Paper • 2508.05835 • Published
ONNX model for nemo-nano-codec-22khz-1.78kbps-12.5fps
The codec uses a two-stage architecture: an encoder that compresses audio into discrete tokens via vector quantization, and a decoder that reconstructs audio from these tokens.
This model must not be used for:
pip install onnxruntime numpy
Input audio must be pre-padded to complete frames (multiples of 1764 samples)
import numpy as np
import onnxruntime as ort
# Load both models
encoder = ort.InferenceSession('encoder.onnx')
decoder = ort.InferenceSession('decoder.onnx')
# Helper function for padding
def pad_audio_to_frames(audio, samples_per_frame=1764):
"""Pad audio to complete frames (required for encoder)."""
current_len = audio.shape[-1]
num_frames = (current_len + samples_per_frame - 1) // samples_per_frame
padded_len = num_frames * samples_per_frame
padding_needed = padded_len - current_len
if padding_needed > 0:
audio = np.pad(audio, ((0, 0), (0, padding_needed)), mode='constant')
return audio, current_len
# Load your audio (must be 22050 Hz, mono)
original_audio = np.random.randn(1, 88200).astype(np.float32) # 4 seconds example
# 1. Pad audio to complete frames
padded_audio, original_len = pad_audio_to_frames(original_audio)
audio_len = np.array([padded_audio.shape[-1]], dtype=np.int64)
# 2. Encode: Audio → Tokens
tokens, tokens_len = encoder.run(None, {
'audio': padded_audio,
'audio_len': audio_len
})
# 3. Decode: Tokens → Audio
reconstructed_audio, reconstructed_len = decoder.run(None, {
'tokens': tokens,
'tokens_len': tokens_len
})
# 4. Trim to original length
reconstructed_audio = reconstructed_audio[:, :original_len]
This derivative work is distributed under a dual license:
Original Model (NVIDIA):
Modifications: