Parakeet TDT 0.6B v3 - ONNX (Quantized)
ONNX exports of NVIDIA Parakeet TDT 0.6B ASR encoder in multiple precision formats.
Model Variants
| Variant | Encoder Size | Compression | Quality (cosine vs FP32) | Notes |
|---|---|---|---|---|
| fp32 | 2,363 MB | 1x (baseline) | 1.000 | Original FP32 weights |
| fp16 | 1,182 MB | 2x | 0.9997 | FP16 weights, optimized graph |
| int8 | 622 MB | 3.8x | 0.808 | ORT dynamic INT8 quantization |
| fp8 | 621 MB | 3.8x | custom | FP8 E4M3FN weights |
| int4 | 330 MB | 7.2x | custom | INT4 packed weights + scale metadata |
INT8 cosine measured on random noise input; real audio quality expected to be higher. FP8 and INT4 models use custom weight formats requiring DequantizeLinear or custom loader.
Files
config.yaml- Model configurationdecoder_joint.onnx- Decoder/Joint network (shared across all variants)tokenizer/- SentencePiece tokenizer filesfp32/,fp16/,int8/,fp8/,int4/- Encoder variants
Usage
import onnxruntime as ort
import numpy as np
# FP16 (recommended for GPU) or INT8 (recommended for CPU)
sess = ort.InferenceSession("fp16/encoder.onnx", providers=["CUDAExecutionProvider"])
audio_features = np.random.randn(1, 128, 100).astype(np.float32)
length = np.array([100], dtype=np.int64)
outputs = sess.run(None, {"audio_signal": audio_features, "length": length})
encoded = outputs[0] # [batch, d_model, time]
Quantization Details
- FP16: Full graph + weight conversion via onnxruntime.transformers.optimizer
- INT8: Dynamic weight-only quantization via onnxruntime.quantization.quantize_dynamic
- FP8: Manual weight conversion to E4M3FN format using ml_dtypes
- INT4: Symmetric per-tensor INT4 quantization, packed 2 values per byte, with JSON scale metadata
Source
Exported from NVIDIA NeMo Parakeet TDT 0.6B checkpoint on DGX Spark (GB10 Grace Blackwell).
- Downloads last month
- 4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for m1cc0z/parakeet-tdt-0.6b-v3-onnx
Base model
nvidia/parakeet-tdt-0.6b-v2