Nemotron Speech 600M β€” ONNX (Streaming)

ONNX export of nvidia/nemotron-speech-streaming-en-0.6b, a 600M-parameter streaming ASR model using FastConformer encoder + RNNT decoder.

Exported directly from the original NeMo checkpoint (not from a third-party conversion) to ensure numerical fidelity.

Apple Silicon users: Use the native CoreML version instead β€” danielbodart/nemotron-speech-600m-coreml. It runs 93% of encoder ops on the Apple Neural Engine at ~15.5x realtime, vs CPU-only execution with ONNX Runtime on macOS.

Available Precisions

Variant Encoder Weights Decoder Weights Total Target Hardware Execution Provider Notes
fp32/ 2.3 GB 34 MB 2.4 GB Any CPU, CUDA Original precision
fp16/ 1.2 GB 17 MB 1.2 GB NVIDIA GPU CPU, CUDA Recommended for quality-first GPU inference
int8-dynamic/ 799 MB 34 MB 876 MB Intel CPU (VNNI/AMX) CPU only Dynamic quantization, MatMul weights only
int8-static/ 799 MB 34 MB 876 MB NVIDIA GPU, Intel CPU CPU, CUDA QDQ format, ~45% less VRAM than FP16 on GPU

Choosing a variant

  • NVIDIA GPU (memory constrained): int8-static/ β€” loads on CUDA EP, uses ~1.3 GB VRAM vs ~2.4 GB for FP16
  • NVIDIA GPU (quality first): fp16/ β€” marginally better on edge cases with repetitive content
  • Intel CPU: int8-dynamic/ or int8-static/ β€” both use VNNI/AMX integer instructions with similar performance
  • Apple Silicon: Use the CoreML version instead (93% ANE, ~15.5x realtime). ONNX Runtime on macOS runs CPU-only.

INT8 Static quantization details

The static variant uses QDQ (QuantizeLinear/DequantizeLinear) nodes with calibrated activation scales. Only MatMul ops are quantized (same as dynamic), but activation scales are pre-computed using warm-cache streaming calibration β€” audio chunks are fed sequentially through the FP32 model carrying encoder cache state forward, so activation ranges reflect real streaming inference conditions. This avoids the blank-token collapse that breaks naive static quantization of streaming models with persistent cache state.

Model Architecture

  • Encoder: Cache-aware streaming FastConformer (24 layers, 1024 dim)
  • Decoder: RNNT with LSTM prediction network (2 layers, 640 hidden)
  • Vocabulary: 1025 SentencePiece tokens (1024 + blank)
  • Audio input: 128-channel mel spectrogram, 16kHz, 560ms chunks (56 mel frames + 9 cache frames)

Runtime Configuration

All parameters needed to run the model are documented in config.json. Key values:

Audio Preprocessing

Parameter Value
Sample rate 16000 Hz
Sample format S16_LE (16-bit signed little-endian)
Pre-emphasis 0.97
FFT size 512
Hop length 160 samples (10ms)
Window length 400 samples (25ms)
Window type Hann
Mel bands 128
Mel norm Slaney
Mel layout Band-major [n_mels, n_frames] (not frame-major)

Mel filterbank weights are provided in shared/filterbank.bin (shape [1, 128, 257]).

Encoder Streaming

Parameter Value
Chunk size 56 mel frames (560ms audio)
Pre-encode cache 9 mel frames (prepended from previous chunk)
Total input frames 65 per chunk (56 + 9)
Layers 24
Dimension 1024
cache_last_channel shape [1, 24, 70, 1024] (init zeros)
cache_last_time shape [1, 24, 1024, 8] (init zeros)
cache_last_channel_len [1] int64 (init zero)

Feed cache outputs back as next chunk's cache inputs.

RNNT Decoder

Parameter Value
Blank token ID 1024
Vocab size 1025 (1024 tokens + blank)
Max symbols per frame 10
Prediction layers 2 (LSTM)
Prediction hidden 640
input_states_1/2 shape [2, 1, 640] float32 (init zeros)

For each encoder output frame: feed single frame [1, 1024, 1] to decoder, argmax logits over 1025 vocab, if not blank emit token and loop (up to 10), if blank move to next frame. Feed decoder states back for next symbol/frame.

Files

config.json                 # Machine-readable runtime parameters (all values above)

shared/
β”œβ”€β”€ filterbank.bin          # Mel filterbank weights [1, 128, 257] Slaney norm
β”œβ”€β”€ filterbank.meta         # Shape metadata
β”œβ”€β”€ tokens.txt              # SentencePiece vocabulary (1024 tokens)
└── preprocessor.config     # Feature extraction parameters

fp32/                       # FP32 precision
β”œβ”€β”€ encoder_model.onnx      # Streaming encoder graph
β”œβ”€β”€ encoder_model.onnx.data # Encoder weights
β”œβ”€β”€ decoder_model.onnx      # RNNT decoder+joint graph
└── decoder_model.onnx.data # Decoder weights

fp16/                       # FP16 precision (same structure)
β”œβ”€β”€ encoder_model.onnx
β”œβ”€β”€ encoder_model.onnx.data
β”œβ”€β”€ decoder_model.onnx
└── decoder_model.onnx.data

int8-dynamic/               # INT8 dynamic quantization
β”œβ”€β”€ encoder_model.onnx      # MatMulInteger ops, CPU execution provider only
β”œβ”€β”€ encoder_model.onnx.data
β”œβ”€β”€ decoder_model.onnx      # Decoder stays FP32
└── decoder_model.onnx.data

int8-static/                # INT8 static quantization (QDQ)
β”œβ”€β”€ encoder_model.onnx      # QDQ nodes, works on CPU and CUDA execution providers
β”œβ”€β”€ encoder_model.onnx.data
β”œβ”€β”€ decoder_model.onnx      # Decoder stays FP32
└── decoder_model.onnx.data

Usage

Download a specific precision:

# INT8 Static (recommended for GPU with memory constraints)
hf download danielbodart/nemotron-speech-600m-onnx int8-static/ shared/ config.json --local-dir ./model

# FP16 (recommended for GPU, quality first)
hf download danielbodart/nemotron-speech-600m-onnx fp16/ shared/ config.json --local-dir ./model

# INT8 Dynamic (recommended for Intel CPU)
hf download danielbodart/nemotron-speech-600m-onnx int8-dynamic/ shared/ config.json --local-dir ./model

# FP32 (reference)
hf download danielbodart/nemotron-speech-600m-onnx fp32/ shared/ config.json --local-dir ./model

Load with ONNX Runtime:

import onnxruntime as ort

# GPU inference β€” INT8 Static (less VRAM) or FP16 (slightly better quality)
encoder = ort.InferenceSession("model/int8-static/encoder_model.onnx",
                                providers=["CUDAExecutionProvider"])
decoder = ort.InferenceSession("model/int8-static/decoder_model.onnx",
                                providers=["CUDAExecutionProvider"])

# CPU inference (INT8 Dynamic recommended)
encoder = ort.InferenceSession("model/int8-dynamic/encoder_model.onnx",
                                providers=["CPUExecutionProvider"])
decoder = ort.InferenceSession("model/int8-dynamic/decoder_model.onnx",
                                providers=["CPUExecutionProvider"])

Export Reproducibility

All export, conversion, quantization, and validation scripts are in the companion GitHub repo: danielbodart/nemotron-speech-600m-onnx

  • nemo_export_onnx.py β€” Export from original NeMo checkpoint
  • onnx_fp16_convert.py β€” FP32 β†’ FP16 conversion with I/O Cast wrappers
  • onnx_fp16_validate.py β€” Numerical validation
  • onnx_int8_quantize.py β€” INT8 quantization (dynamic and static modes)
  • onnx_int8_calibration.py β€” Warm-cache streaming calibration data reader
  • onnx_int8_validate.py β€” INT8 streaming validation
  • onnx_package_for_hf.py β€” Consolidate and package for Hub upload

Related

License

The original model is licensed under CC-BY-4.0 by NVIDIA.

Downloads last month
100
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for danielbodart/nemotron-speech-600m-onnx

Quantized
(8)
this model