Nemotron Speech 600M β ONNX (Streaming)
ONNX export of nvidia/nemotron-speech-streaming-en-0.6b, a 600M-parameter streaming ASR model using FastConformer encoder + RNNT decoder.
Exported directly from the original NeMo checkpoint (not from a third-party conversion) to ensure numerical fidelity.
Apple Silicon users: Use the native CoreML version instead β danielbodart/nemotron-speech-600m-coreml. It runs 93% of encoder ops on the Apple Neural Engine at ~15.5x realtime, vs CPU-only execution with ONNX Runtime on macOS.
Available Precisions
| Variant | Encoder Weights | Decoder Weights | Total | Target Hardware | Execution Provider | Notes |
|---|---|---|---|---|---|---|
fp32/ |
2.3 GB | 34 MB | 2.4 GB | Any | CPU, CUDA | Original precision |
fp16/ |
1.2 GB | 17 MB | 1.2 GB | NVIDIA GPU | CPU, CUDA | Recommended for quality-first GPU inference |
int8-dynamic/ |
799 MB | 34 MB | 876 MB | Intel CPU (VNNI/AMX) | CPU only | Dynamic quantization, MatMul weights only |
int8-static/ |
799 MB | 34 MB | 876 MB | NVIDIA GPU, Intel CPU | CPU, CUDA | QDQ format, ~45% less VRAM than FP16 on GPU |
Choosing a variant
- NVIDIA GPU (memory constrained):
int8-static/β loads on CUDA EP, uses ~1.3 GB VRAM vs ~2.4 GB for FP16 - NVIDIA GPU (quality first):
fp16/β marginally better on edge cases with repetitive content - Intel CPU:
int8-dynamic/orint8-static/β both use VNNI/AMX integer instructions with similar performance - Apple Silicon: Use the CoreML version instead (93% ANE, ~15.5x realtime). ONNX Runtime on macOS runs CPU-only.
INT8 Static quantization details
The static variant uses QDQ (QuantizeLinear/DequantizeLinear) nodes with calibrated activation scales. Only MatMul ops are quantized (same as dynamic), but activation scales are pre-computed using warm-cache streaming calibration β audio chunks are fed sequentially through the FP32 model carrying encoder cache state forward, so activation ranges reflect real streaming inference conditions. This avoids the blank-token collapse that breaks naive static quantization of streaming models with persistent cache state.
Model Architecture
- Encoder: Cache-aware streaming FastConformer (24 layers, 1024 dim)
- Decoder: RNNT with LSTM prediction network (2 layers, 640 hidden)
- Vocabulary: 1025 SentencePiece tokens (1024 + blank)
- Audio input: 128-channel mel spectrogram, 16kHz, 560ms chunks (56 mel frames + 9 cache frames)
Runtime Configuration
All parameters needed to run the model are documented in config.json. Key values:
Audio Preprocessing
| Parameter | Value |
|---|---|
| Sample rate | 16000 Hz |
| Sample format | S16_LE (16-bit signed little-endian) |
| Pre-emphasis | 0.97 |
| FFT size | 512 |
| Hop length | 160 samples (10ms) |
| Window length | 400 samples (25ms) |
| Window type | Hann |
| Mel bands | 128 |
| Mel norm | Slaney |
| Mel layout | Band-major [n_mels, n_frames] (not frame-major) |
Mel filterbank weights are provided in shared/filterbank.bin (shape [1, 128, 257]).
Encoder Streaming
| Parameter | Value |
|---|---|
| Chunk size | 56 mel frames (560ms audio) |
| Pre-encode cache | 9 mel frames (prepended from previous chunk) |
| Total input frames | 65 per chunk (56 + 9) |
| Layers | 24 |
| Dimension | 1024 |
cache_last_channel shape |
[1, 24, 70, 1024] (init zeros) |
cache_last_time shape |
[1, 24, 1024, 8] (init zeros) |
cache_last_channel_len |
[1] int64 (init zero) |
Feed cache outputs back as next chunk's cache inputs.
RNNT Decoder
| Parameter | Value |
|---|---|
| Blank token ID | 1024 |
| Vocab size | 1025 (1024 tokens + blank) |
| Max symbols per frame | 10 |
| Prediction layers | 2 (LSTM) |
| Prediction hidden | 640 |
input_states_1/2 shape |
[2, 1, 640] float32 (init zeros) |
For each encoder output frame: feed single frame [1, 1024, 1] to decoder, argmax logits over 1025 vocab, if not blank emit token and loop (up to 10), if blank move to next frame. Feed decoder states back for next symbol/frame.
Files
config.json # Machine-readable runtime parameters (all values above)
shared/
βββ filterbank.bin # Mel filterbank weights [1, 128, 257] Slaney norm
βββ filterbank.meta # Shape metadata
βββ tokens.txt # SentencePiece vocabulary (1024 tokens)
βββ preprocessor.config # Feature extraction parameters
fp32/ # FP32 precision
βββ encoder_model.onnx # Streaming encoder graph
βββ encoder_model.onnx.data # Encoder weights
βββ decoder_model.onnx # RNNT decoder+joint graph
βββ decoder_model.onnx.data # Decoder weights
fp16/ # FP16 precision (same structure)
βββ encoder_model.onnx
βββ encoder_model.onnx.data
βββ decoder_model.onnx
βββ decoder_model.onnx.data
int8-dynamic/ # INT8 dynamic quantization
βββ encoder_model.onnx # MatMulInteger ops, CPU execution provider only
βββ encoder_model.onnx.data
βββ decoder_model.onnx # Decoder stays FP32
βββ decoder_model.onnx.data
int8-static/ # INT8 static quantization (QDQ)
βββ encoder_model.onnx # QDQ nodes, works on CPU and CUDA execution providers
βββ encoder_model.onnx.data
βββ decoder_model.onnx # Decoder stays FP32
βββ decoder_model.onnx.data
Usage
Download a specific precision:
# INT8 Static (recommended for GPU with memory constraints)
hf download danielbodart/nemotron-speech-600m-onnx int8-static/ shared/ config.json --local-dir ./model
# FP16 (recommended for GPU, quality first)
hf download danielbodart/nemotron-speech-600m-onnx fp16/ shared/ config.json --local-dir ./model
# INT8 Dynamic (recommended for Intel CPU)
hf download danielbodart/nemotron-speech-600m-onnx int8-dynamic/ shared/ config.json --local-dir ./model
# FP32 (reference)
hf download danielbodart/nemotron-speech-600m-onnx fp32/ shared/ config.json --local-dir ./model
Load with ONNX Runtime:
import onnxruntime as ort
# GPU inference β INT8 Static (less VRAM) or FP16 (slightly better quality)
encoder = ort.InferenceSession("model/int8-static/encoder_model.onnx",
providers=["CUDAExecutionProvider"])
decoder = ort.InferenceSession("model/int8-static/decoder_model.onnx",
providers=["CUDAExecutionProvider"])
# CPU inference (INT8 Dynamic recommended)
encoder = ort.InferenceSession("model/int8-dynamic/encoder_model.onnx",
providers=["CPUExecutionProvider"])
decoder = ort.InferenceSession("model/int8-dynamic/decoder_model.onnx",
providers=["CPUExecutionProvider"])
Export Reproducibility
All export, conversion, quantization, and validation scripts are in the companion GitHub repo: danielbodart/nemotron-speech-600m-onnx
nemo_export_onnx.pyβ Export from original NeMo checkpointonnx_fp16_convert.pyβ FP32 β FP16 conversion with I/O Cast wrappersonnx_fp16_validate.pyβ Numerical validationonnx_int8_quantize.pyβ INT8 quantization (dynamic and static modes)onnx_int8_calibration.pyβ Warm-cache streaming calibration data readeronnx_int8_validate.pyβ INT8 streaming validationonnx_package_for_hf.pyβ Consolidate and package for Hub upload
Related
- danielbodart/nemotron-speech-600m-coreml β Native CoreML version for Apple Silicon (93% ANE, ~15.5x realtime)
License
The original model is licensed under CC-BY-4.0 by NVIDIA.
- Downloads last month
- 100
Model tree for danielbodart/nemotron-speech-600m-onnx
Base model
nvidia/nemotron-speech-streaming-en-0.6b