File size: 7,545 Bytes

---
license: cc-by-4.0
language:
- en
- es
- it
- de
- fr
- pt
library_name: litert
base_model: nvidia/parakeet-tdt-0.6b-v3
tags:
- automatic-speech-recognition
- speech
- audio
- parakeet
- tdt
- litert
- tflite
- on-device
- mobile
- android
- streaming
pipeline_tag: automatic-speech-recognition
---

# Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port

LiteRT (TFLite) port of
[`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
packaged for on-device inference (Android / Mac / embedded) without a Python
or NeMo runtime dependency.

For **model capabilities, languages, training data, license, and benchmarks**,
see the upstream model card. This card only documents what's specific to the
LiteRT port.

## What's in this bundle

| File | Size | Purpose |
|---|---|---|
| `encoder_T1500.tflite` | 1.15 GB | FP16 encoder, fixed `T_mel = 1500` (15 s window) |
| `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
| `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
| `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
| `manifest.json` | — | All metadata the runtime needs |

Total: **~1.18 GB** (FP16). FP32 reference is ~2.37 GB.

## Encoder I/O contract

```
inputs:
  audio_signal : float32 [1, 128, 1500]   # log-mel features (NeMo preproc)
  length       : int32   [1]               # actual mel frames used (≤ 1500)
outputs:
  encoded         : float32 [1, 1024, 188]  # 188 = (1500 - 4) // 8
  encoded_lengths : int32   [1]
```

Pad shorter inputs with zeros at the **tail** (the encoder was trained with
audio anchored at position 0; left-padding causes hallucinations) and pass
the true length.

The 1500-mel bucket covers ≤ 15 s of audio. For long-form input, run the
encoder in a sliding-window streaming loop — see "Streaming usage" below.

**Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
NPU accelerator) reject int64 tensors entirely. With int64 length, every
internal CAST node touching it falls back to CPU, and `CompiledModel.create()`
fails outright on Android with the GPU backend. This bundle is exported with
int32 length end-to-end (input → internal mask arange/comparisons → output
`encoded_lengths`). int32 covers > 2 billion mel frames (~5 hours of audio),
so no practical range loss.

## Why a single bucket and not multi-signature

An earlier revision shipped a multi-signature encoder with 4 buckets
(300/500/700/1500) sharing weights inside one `.tflite`. The disk savings
were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android
the LiteRT `CompiledModel.create()` API prepares **every** signature's
subgraph at load time — each one going through the full delegate-partition
pass. With 4 signatures × ~7 s of XNNPACK / GPU partition prep, app cold
start was ~28 s.

A single-bucket file is one subgraph: ~7 s init, then ready. If you need
multiple bucket sizes for latency reasons, ship them as separate `.tflite`
files (TFLite has no cross-file weight sharing) and load on demand.

## Decoder + joint contract

```
decoder_step:
  inputs:  token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
  outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]

joint_step:
  inputs:  enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
  outputs: logits float32 [1,1,1,8198]
           # logits[..., 0:8193] → token logits (8192 BPE + 1 blank)
           # logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
```

`decoder_step.token` is `int64` because it's an embedding lookup; that op
runs on CPU regardless of delegate, so int64 there is harmless.

Greedy TDT decoding (per encoder frame):

1. Run joint with current `enc_frame` and last predicted `pred_frame`.
2. `token = argmax(token_logits)`; `dur = durations[argmax(duration_logits)] ∈ {0,1,2,3,4}`.
3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
   re-prime decoder with the emitted token (h, c update).
4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
5. Repeat until `enc_lengths` is exhausted.

Cap at ~10 non-blank emissions per encoder frame to guard against the
pathological `dur=0` decode loop.

## Audio preprocessing

LiteRT itself does not produce mel features — your runtime must compute
them. Match NeMo's preprocessor exactly:

```
sample_rate    : 16000 Hz (resample if needed)
n_fft          : 512
hop_length     : 160      → 100 mel frames / second
win_length     : 400
n_mels         : 128
preemph        : 0.97
log            : log(mel + 1e-5), per-feature normalized
mel_scale      : slaney
```

Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms).

## Streaming usage

This bundle supports chunked streaming inference using a left+chunk+right
context window that fits inside 15 s. A reference Python implementation is
in the upstream repo (`transcribe_litert_streaming.py`). Recommended config
for Android UX:

| Knob | Value | Reason |
|---|---|---|
| `chunk_seconds` | 5 | committed per step |
| `left_context_seconds` | 5 | encoder bilateral context |
| `right_context_seconds` | 2 | end-to-end latency ≈ 7 s |
| `window total` | 12 s | (5 + 5 + 2) × 100 = 1200 mel ≤ 1500 |
| `carry_state` | false | offline-trained model; carrying LSTM state across chunks tends to hurt |

We measured ~27 % WER on multilingual long-form audio (EN/ES/IT
code-switching) with this config, ~22 % on clean offline ≤15 s English.

## Quantization

- All `.tflite` weights are FP16. Activations remain FP32.
- Bit-identical token output vs the upstream FP32 model on a 99-clip eval
  set.

## Conversion provenance

Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:

1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module).
2. **ExportedProgram → TFLite** via
   [`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0.
3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
   FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.

Several NeMo internals required export-time monkey-patches:

- `MaskedConvSequential.{forward,_create_mask}` and `apply_channel_mask` — to
  remove `.expand(...)` patterns rejected by the TFLite broadcast checker.
- `RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask` — to
  build masks in `bool` instead of `uint8` (litert-torch has no uint8
  lowering).
- `ConformerEncoder.{forward_internal,_create_masks}` and
  `MaskedConvSequential.{forward,_create_mask}` — to keep the entire length
  pipeline in `int32` instead of NeMo's default `int64`, so LiteRT's
  GPU/NPU delegates can compile the graph without falling back to CPU.

## Limitations

1. **Audio at position 0.** The encoder expects audio anchored at the start
   of its input window. Padding before the audio causes hallucinations.
2. **15 s max per call.** Use the streaming chunker for longer clips.
3. **No VAD or diarization.** Pair with an external VAD or a diarizer
   (e.g. Sortformer) for speaker-attributed transcripts.
4. **Multilingual but no language token.** Code-switching works, but the
   model doesn't emit a language ID. Run a separate classifier if you need it.

## License

Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0).

## Citation

```bibtex
@misc{nvidia_parakeet_tdt_0_6b_v3,
  title  = {Parakeet-TDT-0.6B-v3},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
}
```