Parakeet TDT 0.6B v3 - ONNX (Quantized)

ONNX exports of NVIDIA Parakeet TDT 0.6B ASR encoder in multiple precision formats.

Model Variants

Variant Encoder Size Compression Quality (cosine vs FP32) Notes
fp32 2,363 MB 1x (baseline) 1.000 Original FP32 weights
fp16 1,182 MB 2x 0.9997 FP16 weights, optimized graph
int8 622 MB 3.8x 0.808 ORT dynamic INT8 quantization
fp8 621 MB 3.8x custom FP8 E4M3FN weights
int4 330 MB 7.2x custom INT4 packed weights + scale metadata

INT8 cosine measured on random noise input; real audio quality expected to be higher. FP8 and INT4 models use custom weight formats requiring DequantizeLinear or custom loader.

Files

  • config.yaml - Model configuration
  • decoder_joint.onnx - Decoder/Joint network (shared across all variants)
  • tokenizer/ - SentencePiece tokenizer files
  • fp32/, fp16/, int8/, fp8/, int4/ - Encoder variants

Usage

import onnxruntime as ort
import numpy as np

# FP16 (recommended for GPU) or INT8 (recommended for CPU)
sess = ort.InferenceSession("fp16/encoder.onnx", providers=["CUDAExecutionProvider"])

audio_features = np.random.randn(1, 128, 100).astype(np.float32)
length = np.array([100], dtype=np.int64)

outputs = sess.run(None, {"audio_signal": audio_features, "length": length})
encoded = outputs[0]  # [batch, d_model, time]

Quantization Details

  • FP16: Full graph + weight conversion via onnxruntime.transformers.optimizer
  • INT8: Dynamic weight-only quantization via onnxruntime.quantization.quantize_dynamic
  • FP8: Manual weight conversion to E4M3FN format using ml_dtypes
  • INT4: Symmetric per-tensor INT4 quantization, packed 2 values per byte, with JSON scale metadata

Source

Exported from NVIDIA NeMo Parakeet TDT 0.6B checkpoint on DGX Spark (GB10 Grace Blackwell).

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for m1cc0z/parakeet-tdt-0.6b-v3-onnx

Quantized
(11)
this model