File size: 7,545 Bytes
6e11431
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a68489
6e11431
2a68489
 
6e11431
 
 
 
 
 
 
 
 
2a68489
6e11431
 
 
 
 
2a68489
6e11431
2a68489
6e11431
 
 
2a68489
 
6e11431
2a68489
455fc08
6e11431
 
2a68489
 
 
 
 
 
6e11431
455fc08
 
2a68489
455fc08
 
2a68489
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
455fc08
6e11431
 
 
 
 
 
 
 
 
 
 
 
 
 
2a68489
 
 
6e11431
 
 
2a68489
6e11431
 
 
 
 
2a68489
 
 
6e11431
 
2a68489
 
6e11431
 
 
 
 
 
 
 
2a68489
 
6e11431
 
 
 
 
 
2a68489
 
 
 
6e11431
2a68489
 
 
 
 
 
 
6e11431
2a68489
 
6e11431
 
 
2a68489
 
 
6e11431
 
 
 
 
 
 
2a68489
6e11431
 
 
2a68489
 
 
 
 
 
 
 
 
 
 
6e11431
2a68489
6e11431
2a68489
 
 
 
 
 
 
6e11431
 
 
2a68489
6e11431
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
license: cc-by-4.0
language:
- en
- es
- it
- de
- fr
- pt
library_name: litert
base_model: nvidia/parakeet-tdt-0.6b-v3
tags:
- automatic-speech-recognition
- speech
- audio
- parakeet
- tdt
- litert
- tflite
- on-device
- mobile
- android
- streaming
pipeline_tag: automatic-speech-recognition
---

# Parakeet-TDT-0.6B-v3 β€” LiteRT (TFLite) port

LiteRT (TFLite) port of
[`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
packaged for on-device inference (Android / Mac / embedded) without a Python
or NeMo runtime dependency.

For **model capabilities, languages, training data, license, and benchmarks**,
see the upstream model card. This card only documents what's specific to the
LiteRT port.

## What's in this bundle

| File | Size | Purpose |
|---|---|---|
| `encoder_T1500.tflite` | 1.15 GB | FP16 encoder, fixed `T_mel = 1500` (15 s window) |
| `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
| `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
| `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
| `manifest.json` | β€” | All metadata the runtime needs |

Total: **~1.18 GB** (FP16). FP32 reference is ~2.37 GB.

## Encoder I/O contract

```
inputs:
  audio_signal : float32 [1, 128, 1500]   # log-mel features (NeMo preproc)
  length       : int32   [1]               # actual mel frames used (≀ 1500)
outputs:
  encoded         : float32 [1, 1024, 188]  # 188 = (1500 - 4) // 8
  encoded_lengths : int32   [1]
```

Pad shorter inputs with zeros at the **tail** (the encoder was trained with
audio anchored at position 0; left-padding causes hallucinations) and pass
the true length.

The 1500-mel bucket covers ≀ 15 s of audio. For long-form input, run the
encoder in a sliding-window streaming loop β€” see "Streaming usage" below.

**Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
NPU accelerator) reject int64 tensors entirely. With int64 length, every
internal CAST node touching it falls back to CPU, and `CompiledModel.create()`
fails outright on Android with the GPU backend. This bundle is exported with
int32 length end-to-end (input β†’ internal mask arange/comparisons β†’ output
`encoded_lengths`). int32 covers > 2 billion mel frames (~5 hours of audio),
so no practical range loss.

## Why a single bucket and not multi-signature

An earlier revision shipped a multi-signature encoder with 4 buckets
(300/500/700/1500) sharing weights inside one `.tflite`. The disk savings
were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android
the LiteRT `CompiledModel.create()` API prepares **every** signature's
subgraph at load time β€” each one going through the full delegate-partition
pass. With 4 signatures Γ— ~7 s of XNNPACK / GPU partition prep, app cold
start was ~28 s.

A single-bucket file is one subgraph: ~7 s init, then ready. If you need
multiple bucket sizes for latency reasons, ship them as separate `.tflite`
files (TFLite has no cross-file weight sharing) and load on demand.

## Decoder + joint contract

```
decoder_step:
  inputs:  token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
  outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]

joint_step:
  inputs:  enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
  outputs: logits float32 [1,1,1,8198]
           # logits[..., 0:8193] β†’ token logits (8192 BPE + 1 blank)
           # logits[..., 8193:8198] β†’ duration logits over [0,1,2,3,4]
```

`decoder_step.token` is `int64` because it's an embedding lookup; that op
runs on CPU regardless of delegate, so int64 there is harmless.

Greedy TDT decoding (per encoder frame):

1. Run joint with current `enc_frame` and last predicted `pred_frame`.
2. `token = argmax(token_logits)`; `dur = durations[argmax(duration_logits)] ∈ {0,1,2,3,4}`.
3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
   re-prime decoder with the emitted token (h, c update).
4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
5. Repeat until `enc_lengths` is exhausted.

Cap at ~10 non-blank emissions per encoder frame to guard against the
pathological `dur=0` decode loop.

## Audio preprocessing

LiteRT itself does not produce mel features β€” your runtime must compute
them. Match NeMo's preprocessor exactly:

```
sample_rate    : 16000 Hz (resample if needed)
n_fft          : 512
hop_length     : 160      β†’ 100 mel frames / second
win_length     : 400
n_mels         : 128
preemph        : 0.97
log            : log(mel + 1e-5), per-feature normalized
mel_scale      : slaney
```

Encoder frame rate after the 8Γ— subsampler: **12.5 fps** (1 enc frame = 80 ms).

## Streaming usage

This bundle supports chunked streaming inference using a left+chunk+right
context window that fits inside 15 s. A reference Python implementation is
in the upstream repo (`transcribe_litert_streaming.py`). Recommended config
for Android UX:

| Knob | Value | Reason |
|---|---|---|
| `chunk_seconds` | 5 | committed per step |
| `left_context_seconds` | 5 | encoder bilateral context |
| `right_context_seconds` | 2 | end-to-end latency β‰ˆ 7 s |
| `window total` | 12 s | (5 + 5 + 2) Γ— 100 = 1200 mel ≀ 1500 |
| `carry_state` | false | offline-trained model; carrying LSTM state across chunks tends to hurt |

We measured ~27 % WER on multilingual long-form audio (EN/ES/IT
code-switching) with this config, ~22 % on clean offline ≀15 s English.

## Quantization

- All `.tflite` weights are FP16. Activations remain FP32.
- Bit-identical token output vs the upstream FP32 model on a 99-clip eval
  set.

## Conversion provenance

Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:

1. **NeMo β†’ torch.export ExportedProgram** (per encoder/decoder/joint module).
2. **ExportedProgram β†’ TFLite** via
   [`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0.
3. **FP32 β†’ FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
   FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.

Several NeMo internals required export-time monkey-patches:

- `MaskedConvSequential.{forward,_create_mask}` and `apply_channel_mask` β€” to
  remove `.expand(...)` patterns rejected by the TFLite broadcast checker.
- `RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask` β€” to
  build masks in `bool` instead of `uint8` (litert-torch has no uint8
  lowering).
- `ConformerEncoder.{forward_internal,_create_masks}` and
  `MaskedConvSequential.{forward,_create_mask}` β€” to keep the entire length
  pipeline in `int32` instead of NeMo's default `int64`, so LiteRT's
  GPU/NPU delegates can compile the graph without falling back to CPU.

## Limitations

1. **Audio at position 0.** The encoder expects audio anchored at the start
   of its input window. Padding before the audio causes hallucinations.
2. **15 s max per call.** Use the streaming chunker for longer clips.
3. **No VAD or diarization.** Pair with an external VAD or a diarizer
   (e.g. Sortformer) for speaker-attributed transcripts.
4. **Multilingual but no language token.** Code-switching works, but the
   model doesn't emit a language ID. Run a separate classifier if you need it.

## License

Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0).

## Citation

```bibtex
@misc{nvidia_parakeet_tdt_0_6b_v3,
  title  = {Parakeet-TDT-0.6B-v3},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
}
```