Nemotron Speech 600M — ONNX (Streaming)

ONNX export of nvidia/nemotron-speech-streaming-en-0.6b, a 600M-parameter streaming ASR model using FastConformer encoder + RNNT decoder.

Exported directly from the original NeMo checkpoint (not from a third-party conversion) to ensure numerical fidelity.

Apple Silicon users: Use the native CoreML version instead — danielbodart/nemotron-speech-600m-coreml. It runs 93% of encoder ops on the Apple Neural Engine at ~15.5x realtime, vs CPU-only execution with ONNX Runtime on macOS.

Available Precisions

Variant	Encoder Weights	Decoder Weights	Total	Target Hardware	Execution Provider	Notes
`fp32/`	2.3 GB	34 MB	2.4 GB	Any	CPU, CUDA	Original precision
`fp16/`	1.2 GB	17 MB	1.2 GB	NVIDIA GPU	CPU, CUDA	Recommended for quality-first GPU inference
`int8-dynamic/`	799 MB	34 MB	876 MB	Intel CPU (VNNI/AMX)	CPU only	Dynamic quantization, MatMul weights only
`int8-static/`	799 MB	34 MB	876 MB	NVIDIA GPU, Intel CPU	CPU, CUDA	QDQ format, ~45% less VRAM than FP16 on GPU

Choosing a variant

NVIDIA GPU (memory constrained): int8-static/ — loads on CUDA EP, uses ~1.3 GB VRAM vs ~2.4 GB for FP16
NVIDIA GPU (quality first): fp16/ — marginally better on edge cases with repetitive content
Intel CPU: int8-dynamic/ or int8-static/ — both use VNNI/AMX integer instructions with similar performance
Apple Silicon: Use the CoreML version instead (93% ANE, ~15.5x realtime). ONNX Runtime on macOS runs CPU-only.

INT8 Static quantization details

The static variant uses QDQ (QuantizeLinear/DequantizeLinear) nodes with calibrated activation scales. Only MatMul ops are quantized (same as dynamic), but activation scales are pre-computed using warm-cache streaming calibration — audio chunks are fed sequentially through the FP32 model carrying encoder cache state forward, so activation ranges reflect real streaming inference conditions. This avoids the blank-token collapse that breaks naive static quantization of streaming models with persistent cache state.

Model Architecture

Encoder: Cache-aware streaming FastConformer (24 layers, 1024 dim)
Decoder: RNNT with LSTM prediction network (2 layers, 640 hidden)
Vocabulary: 1025 SentencePiece tokens (1024 + blank)
Audio input: 128-channel mel spectrogram, 16kHz, 560ms chunks (56 mel frames + 9 cache frames)

Runtime Configuration

All parameters needed to run the model are documented in config.json. Key values:

Audio Preprocessing

Parameter	Value
Sample rate	16000 Hz
Sample format	S16_LE (16-bit signed little-endian)
Pre-emphasis	0.97
FFT size	512
Hop length	160 samples (10ms)
Window length	400 samples (25ms)
Window type	Hann
Mel bands	128
Mel norm	Slaney
Mel layout	Band-major `[n_mels, n_frames]` (not frame-major)

Mel filterbank weights are provided in shared/filterbank.bin (shape [1, 128, 257]).

Encoder Streaming

Parameter	Value
Chunk size	56 mel frames (560ms audio)
Pre-encode cache	9 mel frames (prepended from previous chunk)
Total input frames	65 per chunk (56 + 9)
Layers	24
Dimension	1024
`cache_last_channel` shape	`[1, 24, 70, 1024]` (init zeros)
`cache_last_time` shape	`[1, 24, 1024, 8]` (init zeros)
`cache_last_channel_len`	`[1]` int64 (init zero)

Feed cache outputs back as next chunk's cache inputs.

RNNT Decoder

Parameter	Value
Blank token ID	1024
Vocab size	1025 (1024 tokens + blank)
Max symbols per frame	10
Prediction layers	2 (LSTM)
Prediction hidden	640
`input_states_1/2` shape	`[2, 1, 640]` float32 (init zeros)

For each encoder output frame: feed single frame [1, 1024, 1] to decoder, argmax logits over 1025 vocab, if not blank emit token and loop (up to 10), if blank move to next frame. Feed decoder states back for next symbol/frame.

Files

config.json                 # Machine-readable runtime parameters (all values above)

shared/
├── filterbank.bin          # Mel filterbank weights [1, 128, 257] Slaney norm
├── filterbank.meta         # Shape metadata
├── tokens.txt              # SentencePiece vocabulary (1024 tokens)
└── preprocessor.config     # Feature extraction parameters

fp32/                       # FP32 precision
├── encoder_model.onnx      # Streaming encoder graph
├── encoder_model.onnx.data # Encoder weights
├── decoder_model.onnx      # RNNT decoder+joint graph
└── decoder_model.onnx.data # Decoder weights

fp16/                       # FP16 precision (same structure)
├── encoder_model.onnx
├── encoder_model.onnx.data
├── decoder_model.onnx
└── decoder_model.onnx.data

int8-dynamic/               # INT8 dynamic quantization
├── encoder_model.onnx      # MatMulInteger ops, CPU execution provider only
├── encoder_model.onnx.data
├── decoder_model.onnx      # Decoder stays FP32
└── decoder_model.onnx.data

int8-static/                # INT8 static quantization (QDQ)
├── encoder_model.onnx      # QDQ nodes, works on CPU and CUDA execution providers
├── encoder_model.onnx.data
├── decoder_model.onnx      # Decoder stays FP32
└── decoder_model.onnx.data

Usage

Download a specific precision:

# INT8 Static (recommended for GPU with memory constraints)
hf download danielbodart/nemotron-speech-600m-onnx int8-static/ shared/ config.json --local-dir ./model

# FP16 (recommended for GPU, quality first)
hf download danielbodart/nemotron-speech-600m-onnx fp16/ shared/ config.json --local-dir ./model

# INT8 Dynamic (recommended for Intel CPU)
hf download danielbodart/nemotron-speech-600m-onnx int8-dynamic/ shared/ config.json --local-dir ./model

# FP32 (reference)
hf download danielbodart/nemotron-speech-600m-onnx fp32/ shared/ config.json --local-dir ./model

Load with ONNX Runtime:

import onnxruntime as ort

# GPU inference — INT8 Static (less VRAM) or FP16 (slightly better quality)
encoder = ort.InferenceSession("model/int8-static/encoder_model.onnx",
                                providers=["CUDAExecutionProvider"])
decoder = ort.InferenceSession("model/int8-static/decoder_model.onnx",
                                providers=["CUDAExecutionProvider"])

# CPU inference (INT8 Dynamic recommended)
encoder = ort.InferenceSession("model/int8-dynamic/encoder_model.onnx",
                                providers=["CPUExecutionProvider"])
decoder = ort.InferenceSession("model/int8-dynamic/decoder_model.onnx",
                                providers=["CPUExecutionProvider"])

Export Reproducibility

All export, conversion, quantization, and validation scripts are in the companion GitHub repo: danielbodart/nemotron-speech-600m-onnx

nemo_export_onnx.py — Export from original NeMo checkpoint
onnx_fp16_convert.py — FP32 → FP16 conversion with I/O Cast wrappers
onnx_fp16_validate.py — Numerical validation
onnx_int8_quantize.py — INT8 quantization (dynamic and static modes)
onnx_int8_calibration.py — Warm-cache streaming calibration data reader
onnx_int8_validate.py — INT8 streaming validation
onnx_package_for_hf.py — Consolidate and package for Hub upload

danielbodart/nemotron-speech-600m-coreml — Native CoreML version for Apple Silicon (93% ANE, ~15.5x realtime)

License

The original model is licensed under CC-BY-4.0 by NVIDIA.

Downloads last month: 100

Model tree for danielbodart/nemotron-speech-600m-onnx

Base model

nvidia/nemotron-speech-streaming-en-0.6b

Quantized

(8)

this model

danielbodart
/

nemotron-speech-600m-onnx