Parakeet TDT 0.6B v3 β€” LiteRT (INT8)

NVIDIA's multilingual FastConformer ASR. 25 languages, INT8 encoder + FP32 decoder-joint.

Part of the soniqo.audio speech toolkit β€” an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

NVIDIA's multilingual FastConformer ASR model exported to LiteRT for Android. Covers 25 languages. Split into a FastConformer encoder and a streaming LSTM decoder-joint for token-level inference.

Model

Component Parameters Format Size (INT8)
Encoder (FastConformer) ~600 M TFLite 567.3 MB
Decoder + Joint (LSTM + linear) ~15 M TFLite 17.7 MB

Files

File Size Description
parakeet-encoder.tflite 567.3 MB FastConformer encoder, INT8 dynamic weights
parakeet-decoder-joint.tflite 17.7 MB Fused LSTM decoder + joint, INT8
vocab.json 192 KB 8 192-token SentencePiece vocab
config.json 1 KB Encoder / decoder / joint specs

Pipeline

audio [1, N] ──► mel fbank (128 bins, 16 kHz) ──► encoder ──► encoded [1, 1024, T']
                                                                  β”‚
                                                                  β–Ό
targets (blank-initialized) ──► decoder-joint ──► logits [1, 1, 1, 1030]
                                                         β”‚
                                                         β–Ό
                                                    TDT decode

TDT (Token-and-Duration Transducer) emits both a token and a duration in {0, 1, 2, 3, 4} frames. Blank id = 1024, vocab size = 1024, total logits = 1030 (1024 tokens + 1 blank + 5 durations).

Encoder signature

Inputs:
  audio_signal  [1, 128, T]   float32   Mel features (log, normalized)
  length        [1]           int64     Valid T (NeMo convention)

Outputs:
  encoded       [1, 1024, T'] float32   Encoded features
  encoded_length [1]          int64     Valid T'

Decoder-joint signature

Inputs:
  encoder_out   [1, 1, 1024]  float32   Current encoder frame
  target        [1, 1]        int64     Last emitted token (blank to start)
  h             [2, 1, 640]   float32   LSTM hidden state
  c             [2, 1, 640]   float32   LSTM cell state

Outputs:
  logits        [1, 1, 1, 1030] float32 Joint output
  h_out         [2, 1, 640]   float32   Next hidden state
  c_out         [2, 1, 640]   float32   Next cell state

Audio preprocessing

The model expects the exact NeMo mel pipeline: 128 mel bins, 16 kHz, n_fft=512, hop_length=160, win_length=400, pre_emphasis=0.97, log mel with per-utterance normalization. Implement this on the caller side in native code to match the NeMo reference exactly.

Source

Upstream: nvidia/parakeet-tdt-0.6b-v3 (CC BY 4.0). 25-language multilingual ASR.

Links

Ecosystem

  • soniqo.audio β€” use-case explorer (transcription, voice cloning, live ASR, voice agents).
  • speech-core β€” C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
  • speech-swift β€” Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
  • speech-android β€” Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

TTS / Voice Cloning

License

This bundle inherits the upstream model license (cc-by-4.0). See the linked base_model repository for the full terms.

Downloads last month
84
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8

Finetuned
(44)
this model

Collection including soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8