Parakeet TDT 0.6B v3 — LiteRT (INT8)

NVIDIA's multilingual FastConformer ASR. 25 languages, INT8 encoder + FP32 decoder-joint.

Part of the soniqo.audio speech toolkit — an open, runtime-portable stack for speech AI. This bundle is the LiteRT export, designed to plug into the abstract interfaces in speech-core (C++ voice-agent orchestration library). Browse all LiteRT bundles in the soniqo LiteRT collection.

Use cases on soniqo.audio

NVIDIA's multilingual FastConformer ASR model exported to LiteRT for Android. Covers 25 languages. Split into a FastConformer encoder and a streaming LSTM decoder-joint for token-level inference.

Model

Component	Parameters	Format	Size (INT8)
Encoder (FastConformer)	~600 M	TFLite	567.3 MB
Decoder + Joint (LSTM + linear)	~15 M	TFLite	17.7 MB

Files

File	Size	Description
`parakeet-encoder.tflite`	567.3 MB	FastConformer encoder, INT8 dynamic weights
`parakeet-decoder-joint.tflite`	17.7 MB	Fused LSTM decoder + joint, INT8
`vocab.json`	192 KB	8 192-token SentencePiece vocab
`config.json`	1 KB	Encoder / decoder / joint specs

Pipeline

audio [1, N] ──► mel fbank (128 bins, 16 kHz) ──► encoder ──► encoded [1, 1024, T']
                                                                  │
                                                                  ▼
targets (blank-initialized) ──► decoder-joint ──► logits [1, 1, 1, 1030]
                                                         │
                                                         ▼
                                                    TDT decode

TDT (Token-and-Duration Transducer) emits both a token and a duration in {0, 1, 2, 3, 4} frames. Blank id = 1024, vocab size = 1024, total logits = 1030 (1024 tokens + 1 blank + 5 durations).

Encoder signature

Inputs:
  audio_signal  [1, 128, T]   float32   Mel features (log, normalized)
  length        [1]           int64     Valid T (NeMo convention)

Outputs:
  encoded       [1, 1024, T'] float32   Encoded features
  encoded_length [1]          int64     Valid T'

Decoder-joint signature

Inputs:
  encoder_out   [1, 1, 1024]  float32   Current encoder frame
  target        [1, 1]        int64     Last emitted token (blank to start)
  h             [2, 1, 640]   float32   LSTM hidden state
  c             [2, 1, 640]   float32   LSTM cell state

Outputs:
  logits        [1, 1, 1, 1030] float32 Joint output
  h_out         [2, 1, 640]   float32   Next hidden state
  c_out         [2, 1, 640]   float32   Next cell state

Audio preprocessing

The model expects the exact NeMo mel pipeline: 128 mel bins, 16 kHz, n_fft=512, hop_length=160, win_length=400, pre_emphasis=0.97, log mel with per-utterance normalization. Implement this on the caller side in native code to match the NeMo reference exactly.

Source

Upstream: nvidia/parakeet-tdt-0.6b-v3 (CC BY 4.0). 25-language multilingual ASR.

Ecosystem

soniqo.audio — use-case explorer (transcription, voice cloning, live ASR, voice agents).
speech-core — C++ orchestration library for voice agents. Abstract STTInterface / TTSInterface / VADInterface / EnhancerInterface; LiteRT implementations plug straight into the interfaces.
speech-swift — Apple Silicon MLX companion runtime (model-specific MLX bundles linked above where applicable).
speech-android — Android SDK consuming on-device LiteRT bundles.

Other LiteRT models in this collection

ASR / Transcription

VAD / Diarization

TTS / Voice Cloning

VoxCPM2 — LiteRT (INT8)

License

This bundle inherits the upstream model license (cc-by-4.0). See the linked base_model repository for the full terms.

Downloads last month: 84

Model tree for soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8

Base model

nvidia/parakeet-tdt-0.6b-v3

Finetuned

(44)

this model

Collection including soniqo/Parakeet-TDT-0.6B-v3-LiteRT-INT8

LiteRT

Collection

LiteRT (.tflite) bundles for soniqo.audio. ASR, VAD, diarization, speaker ID, streaming, TTS — served by speech-cloud and speech-core. • 9 items • Updated 7 days ago

soniqo
/

Parakeet-TDT-0.6B-v3-LiteRT-INT8