Automatic Speech Recognition
LiteRT
LiteRT
speech
audio
parakeet
tdt
on-device
mobile
android
streaming
Instructions to use spybyscript/parakeet-tdt-litert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use spybyscript/parakeet-tdt-litert with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 7,545 Bytes
6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 455fc08 6e11431 2a68489 6e11431 455fc08 2a68489 455fc08 2a68489 455fc08 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 2a68489 6e11431 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 | ---
license: cc-by-4.0
language:
- en
- es
- it
- de
- fr
- pt
library_name: litert
base_model: nvidia/parakeet-tdt-0.6b-v3
tags:
- automatic-speech-recognition
- speech
- audio
- parakeet
- tdt
- litert
- tflite
- on-device
- mobile
- android
- streaming
pipeline_tag: automatic-speech-recognition
---
# Parakeet-TDT-0.6B-v3 β LiteRT (TFLite) port
LiteRT (TFLite) port of
[`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
packaged for on-device inference (Android / Mac / embedded) without a Python
or NeMo runtime dependency.
For **model capabilities, languages, training data, license, and benchmarks**,
see the upstream model card. This card only documents what's specific to the
LiteRT port.
## What's in this bundle
| File | Size | Purpose |
|---|---|---|
| `encoder_T1500.tflite` | 1.15 GB | FP16 encoder, fixed `T_mel = 1500` (15 s window) |
| `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
| `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
| `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
| `manifest.json` | β | All metadata the runtime needs |
Total: **~1.18 GB** (FP16). FP32 reference is ~2.37 GB.
## Encoder I/O contract
```
inputs:
audio_signal : float32 [1, 128, 1500] # log-mel features (NeMo preproc)
length : int32 [1] # actual mel frames used (β€ 1500)
outputs:
encoded : float32 [1, 1024, 188] # 188 = (1500 - 4) // 8
encoded_lengths : int32 [1]
```
Pad shorter inputs with zeros at the **tail** (the encoder was trained with
audio anchored at position 0; left-padding causes hallucinations) and pass
the true length.
The 1500-mel bucket covers β€ 15 s of audio. For long-form input, run the
encoder in a sliding-window streaming loop β see "Streaming usage" below.
**Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
NPU accelerator) reject int64 tensors entirely. With int64 length, every
internal CAST node touching it falls back to CPU, and `CompiledModel.create()`
fails outright on Android with the GPU backend. This bundle is exported with
int32 length end-to-end (input β internal mask arange/comparisons β output
`encoded_lengths`). int32 covers > 2 billion mel frames (~5 hours of audio),
so no practical range loss.
## Why a single bucket and not multi-signature
An earlier revision shipped a multi-signature encoder with 4 buckets
(300/500/700/1500) sharing weights inside one `.tflite`. The disk savings
were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android
the LiteRT `CompiledModel.create()` API prepares **every** signature's
subgraph at load time β each one going through the full delegate-partition
pass. With 4 signatures Γ ~7 s of XNNPACK / GPU partition prep, app cold
start was ~28 s.
A single-bucket file is one subgraph: ~7 s init, then ready. If you need
multiple bucket sizes for latency reasons, ship them as separate `.tflite`
files (TFLite has no cross-file weight sharing) and load on demand.
## Decoder + joint contract
```
decoder_step:
inputs: token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]
joint_step:
inputs: enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
outputs: logits float32 [1,1,1,8198]
# logits[..., 0:8193] β token logits (8192 BPE + 1 blank)
# logits[..., 8193:8198] β duration logits over [0,1,2,3,4]
```
`decoder_step.token` is `int64` because it's an embedding lookup; that op
runs on CPU regardless of delegate, so int64 there is harmless.
Greedy TDT decoding (per encoder frame):
1. Run joint with current `enc_frame` and last predicted `pred_frame`.
2. `token = argmax(token_logits)`; `dur = durations[argmax(duration_logits)] β {0,1,2,3,4}`.
3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
re-prime decoder with the emitted token (h, c update).
4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
5. Repeat until `enc_lengths` is exhausted.
Cap at ~10 non-blank emissions per encoder frame to guard against the
pathological `dur=0` decode loop.
## Audio preprocessing
LiteRT itself does not produce mel features β your runtime must compute
them. Match NeMo's preprocessor exactly:
```
sample_rate : 16000 Hz (resample if needed)
n_fft : 512
hop_length : 160 β 100 mel frames / second
win_length : 400
n_mels : 128
preemph : 0.97
log : log(mel + 1e-5), per-feature normalized
mel_scale : slaney
```
Encoder frame rate after the 8Γ subsampler: **12.5 fps** (1 enc frame = 80 ms).
## Streaming usage
This bundle supports chunked streaming inference using a left+chunk+right
context window that fits inside 15 s. A reference Python implementation is
in the upstream repo (`transcribe_litert_streaming.py`). Recommended config
for Android UX:
| Knob | Value | Reason |
|---|---|---|
| `chunk_seconds` | 5 | committed per step |
| `left_context_seconds` | 5 | encoder bilateral context |
| `right_context_seconds` | 2 | end-to-end latency β 7 s |
| `window total` | 12 s | (5 + 5 + 2) Γ 100 = 1200 mel β€ 1500 |
| `carry_state` | false | offline-trained model; carrying LSTM state across chunks tends to hurt |
We measured ~27 % WER on multilingual long-form audio (EN/ES/IT
code-switching) with this config, ~22 % on clean offline β€15 s English.
## Quantization
- All `.tflite` weights are FP16. Activations remain FP32.
- Bit-identical token output vs the upstream FP32 model on a 99-clip eval
set.
## Conversion provenance
Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:
1. **NeMo β torch.export ExportedProgram** (per encoder/decoder/joint module).
2. **ExportedProgram β TFLite** via
[`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0.
3. **FP32 β FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
Several NeMo internals required export-time monkey-patches:
- `MaskedConvSequential.{forward,_create_mask}` and `apply_channel_mask` β to
remove `.expand(...)` patterns rejected by the TFLite broadcast checker.
- `RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask` β to
build masks in `bool` instead of `uint8` (litert-torch has no uint8
lowering).
- `ConformerEncoder.{forward_internal,_create_masks}` and
`MaskedConvSequential.{forward,_create_mask}` β to keep the entire length
pipeline in `int32` instead of NeMo's default `int64`, so LiteRT's
GPU/NPU delegates can compile the graph without falling back to CPU.
## Limitations
1. **Audio at position 0.** The encoder expects audio anchored at the start
of its input window. Padding before the audio causes hallucinations.
2. **15 s max per call.** Use the streaming chunker for longer clips.
3. **No VAD or diarization.** Pair with an external VAD or a diarizer
(e.g. Sortformer) for speaker-attributed transcripts.
4. **Multilingual but no language token.** Code-switching works, but the
model doesn't emit a language ID. Run a separate classifier if you need it.
## License
Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0).
## Citation
```bibtex
@misc{nvidia_parakeet_tdt_0_6b_v3,
title = {Parakeet-TDT-0.6B-v3},
author = {NVIDIA},
year = {2025},
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
}
```
|