spybyscript commited on
Commit
2a68489
·
verified ·
1 Parent(s): 7ea9c47

Upload LiteRT FP16 bundle

Browse files
Files changed (3) hide show
  1. README.md +80 -65
  2. encoder_T1500.tflite +3 -0
  3. manifest.json +69 -78
README.md CHANGED
@@ -26,10 +26,10 @@ pipeline_tag: automatic-speech-recognition
26
 
27
  # Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port
28
 
29
- This is a [LiteRT](https://ai.google.dev/edge/litert) (TFLite) port of
30
  [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
31
- packaged for on-device inference (Android / Mac / embedded) without a Python or
32
- NeMo runtime dependency.
33
 
34
  For **model capabilities, languages, training data, license, and benchmarks**,
35
  see the upstream model card. This card only documents what's specific to the
@@ -39,46 +39,53 @@ LiteRT port.
39
 
40
  | File | Size | Purpose |
41
  |---|---|---|
42
- | `encoder_multisig.tflite` | 1.19 GB | FP16 weight-shared encoder, 4 bucket signatures |
43
  | `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
44
  | `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
45
  | `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
46
  | `manifest.json` | — | All metadata the runtime needs |
47
 
48
- Total: **~1.2 GB** (FP16). FP32 reference is roughly 2.4 GB.
49
 
50
- ## Encoder signatures (multi-bucket)
51
-
52
- Weights are shared across 4 fixed-T input shapes via TFLite signatures:
53
-
54
- | Signature | T_mel | Audio | Use |
55
- |---|---|---|---|
56
- | `forward_T300` | 300 | 3.0 s | short utterances, low latency |
57
- | `forward_T500` | 500 | 5.0 s | typical streaming chunks |
58
- | `forward_T700` | 700 | 7.0 s | medium utterances |
59
- | `forward_T1500` | 1500 | 15.0 s | long utterances, offline |
60
-
61
- Each signature has the same I/O shape contract:
62
 
63
  ```
64
  inputs:
65
- audio_signal : float32 [1, 128, T_mel] # log-mel features (NeMo preproc)
66
- length : int32 [1] # actual mel frames used (≤ T_mel)
67
  outputs:
68
- encoded : float32 [1, 1024, T_enc] # T_enc = (T_mel - 4) // 8
69
  encoded_lengths : int32 [1]
70
  ```
71
 
72
- Pick the smallest bucket that fits your input; pad shorter inputs with zeros
73
- and pass the true length.
 
 
 
 
74
 
75
  **Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
76
  NPU accelerator) reject int64 tensors entirely. With int64 length, every
77
- internal CAST node touching it falls back to CPU and `CompiledModel.create()`
78
  fails outright on Android with the GPU backend. This bundle is exported with
79
  int32 length end-to-end (input → internal mask arange/comparisons → output
80
- `encoded_lengths`). int32 covers >2 billion mel frames (~5 hours), so no
81
- practical range loss.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  ## Decoder + joint contract
84
 
@@ -94,19 +101,25 @@ joint_step:
94
  # logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
95
  ```
96
 
 
 
 
97
  Greedy TDT decoding (per encoder frame):
98
 
99
  1. Run joint with current `enc_frame` and last predicted `pred_frame`.
100
- 2. `token = argmax(token_logits)`; `dur = argmax(duration_logits) ∈ {0,1,2,3,4}`.
101
  3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
102
  re-prime decoder with the emitted token (h, c update).
103
  4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
104
  5. Repeat until `enc_lengths` is exhausted.
105
 
 
 
 
106
  ## Audio preprocessing
107
 
108
- LiteRT itself does not produce mel features — your runtime must compute them.
109
- Match NeMo's preprocessor exactly:
110
 
111
  ```
112
  sample_rate : 16000 Hz (resample if needed)
@@ -115,36 +128,35 @@ hop_length : 160 → 100 mel frames / second
115
  win_length : 400
116
  n_mels : 128
117
  preemph : 0.97
118
- log : log10(mel + 1e-5) per-feature normalized
 
119
  ```
120
 
121
  Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms).
122
 
123
  ## Streaming usage
124
 
125
- This bundle supports chunked streaming inference. A reference Python
126
- implementation is provided in the upload repo (`transcribe_litert_streaming.py`),
127
- which produces ~27% WER on multilingual long-form audio at ~2× real-time on CPU
128
- with `chunk=5s, left=5s, right=2s` (12 s window, bucket `forward_T1500`).
129
-
130
- For Android, port the chunker by:
131
 
132
- 1. Hold a rolling mel buffer (left context + new chunk + right look-ahead).
133
- 2. Pick the smallest bucket ≥ window length, pad to bucket T_mel.
134
- 3. Run encoder signature, then TDT greedy decode over `T_enc` frames.
135
- 4. Dedup tokens against the previous chunk's emit window using their
136
- `encoder_frame_idx`. Reuse the LSTM `(h, c)` state across chunks (optional).
 
 
137
 
138
- The model is **not** a strict left-only streamer — it sees right context within
139
- each chunk window. For "real" low-latency streaming, the right-context
140
- look-ahead can be reduced or removed at a quality cost.
141
 
142
  ## Quantization
143
 
144
- - All `.tflite` weights are FP16. Activations remain FP32 (no activation
145
- calibration).
146
- - Round-trip parity with the upstream FP32 model: bit-identical token output on
147
- a 99-clip English eval set (validated with the offline runner).
148
 
149
  ## Conversion provenance
150
 
@@ -152,35 +164,38 @@ Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:
152
 
153
  1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module).
154
  2. **ExportedProgram → TFLite** via
155
- [`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0
156
- (`signature(...).add_signature(...).convert()`).
157
  3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
158
  FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
159
 
160
- The encoder graph is exported once with a dynamic time dim, then specialized
161
- into 4 fixed-T signatures sharing weights. The TFLite serializer dedups the
162
- weight tensors, so the bundle is the size of one encoder, not four.
 
 
 
 
 
 
 
 
163
 
164
- ## Limitations & caveats
165
 
166
- - **Bucket positional encoding.** The encoder was trained with audio anchored
167
- at position 0 of its input window. Padding *before* the audio causes
168
- hallucinations. Always place audio at the start of the bucket buffer and
169
- zero-pad the tail.
170
- - **Long-form clips.** A single bucket call covers at most 15 s. Anything
171
- longer must be chunked at the runtime level.
172
- - **No voice activity detection / diarization.** Pair with a separate VAD or
173
- diarizer (e.g. Sortformer, pyannote) for speaker-attributed transcripts.
174
 
175
  ## License
176
 
177
- Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0). See
178
- the upstream model card for full terms.
179
 
180
  ## Citation
181
 
182
- If you use this bundle, cite the upstream NeMo model:
183
-
184
  ```bibtex
185
  @misc{nvidia_parakeet_tdt_0_6b_v3,
186
  title = {Parakeet-TDT-0.6B-v3},
 
26
 
27
  # Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port
28
 
29
+ LiteRT (TFLite) port of
30
  [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
31
+ packaged for on-device inference (Android / Mac / embedded) without a Python
32
+ or NeMo runtime dependency.
33
 
34
  For **model capabilities, languages, training data, license, and benchmarks**,
35
  see the upstream model card. This card only documents what's specific to the
 
39
 
40
  | File | Size | Purpose |
41
  |---|---|---|
42
+ | `encoder_T1500.tflite` | 1.15 GB | FP16 encoder, fixed `T_mel = 1500` (15 s window) |
43
  | `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
44
  | `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
45
  | `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
46
  | `manifest.json` | — | All metadata the runtime needs |
47
 
48
+ Total: **~1.18 GB** (FP16). FP32 reference is ~2.37 GB.
49
 
50
+ ## Encoder I/O contract
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  ```
53
  inputs:
54
+ audio_signal : float32 [1, 128, 1500] # log-mel features (NeMo preproc)
55
+ length : int32 [1] # actual mel frames used (≤ 1500)
56
  outputs:
57
+ encoded : float32 [1, 1024, 188] # 188 = (1500 - 4) // 8
58
  encoded_lengths : int32 [1]
59
  ```
60
 
61
+ Pad shorter inputs with zeros at the **tail** (the encoder was trained with
62
+ audio anchored at position 0; left-padding causes hallucinations) and pass
63
+ the true length.
64
+
65
+ The 1500-mel bucket covers ≤ 15 s of audio. For long-form input, run the
66
+ encoder in a sliding-window streaming loop — see "Streaming usage" below.
67
 
68
  **Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
69
  NPU accelerator) reject int64 tensors entirely. With int64 length, every
70
+ internal CAST node touching it falls back to CPU, and `CompiledModel.create()`
71
  fails outright on Android with the GPU backend. This bundle is exported with
72
  int32 length end-to-end (input → internal mask arange/comparisons → output
73
+ `encoded_lengths`). int32 covers > 2 billion mel frames (~5 hours of audio),
74
+ so no practical range loss.
75
+
76
+ ## Why a single bucket and not multi-signature
77
+
78
+ An earlier revision shipped a multi-signature encoder with 4 buckets
79
+ (300/500/700/1500) sharing weights inside one `.tflite`. The disk savings
80
+ were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android
81
+ the LiteRT `CompiledModel.create()` API prepares **every** signature's
82
+ subgraph at load time — each one going through the full delegate-partition
83
+ pass. With 4 signatures × ~7 s of XNNPACK / GPU partition prep, app cold
84
+ start was ~28 s.
85
+
86
+ A single-bucket file is one subgraph: ~7 s init, then ready. If you need
87
+ multiple bucket sizes for latency reasons, ship them as separate `.tflite`
88
+ files (TFLite has no cross-file weight sharing) and load on demand.
89
 
90
  ## Decoder + joint contract
91
 
 
101
  # logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
102
  ```
103
 
104
+ `decoder_step.token` is `int64` because it's an embedding lookup; that op
105
+ runs on CPU regardless of delegate, so int64 there is harmless.
106
+
107
  Greedy TDT decoding (per encoder frame):
108
 
109
  1. Run joint with current `enc_frame` and last predicted `pred_frame`.
110
+ 2. `token = argmax(token_logits)`; `dur = durations[argmax(duration_logits)] ∈ {0,1,2,3,4}`.
111
  3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
112
  re-prime decoder with the emitted token (h, c update).
113
  4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
114
  5. Repeat until `enc_lengths` is exhausted.
115
 
116
+ Cap at ~10 non-blank emissions per encoder frame to guard against the
117
+ pathological `dur=0` decode loop.
118
+
119
  ## Audio preprocessing
120
 
121
+ LiteRT itself does not produce mel features — your runtime must compute
122
+ them. Match NeMo's preprocessor exactly:
123
 
124
  ```
125
  sample_rate : 16000 Hz (resample if needed)
 
128
  win_length : 400
129
  n_mels : 128
130
  preemph : 0.97
131
+ log : log(mel + 1e-5), per-feature normalized
132
+ mel_scale : slaney
133
  ```
134
 
135
  Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms).
136
 
137
  ## Streaming usage
138
 
139
+ This bundle supports chunked streaming inference using a left+chunk+right
140
+ context window that fits inside 15 s. A reference Python implementation is
141
+ in the upstream repo (`transcribe_litert_streaming.py`). Recommended config
142
+ for Android UX:
 
 
143
 
144
+ | Knob | Value | Reason |
145
+ |---|---|---|
146
+ | `chunk_seconds` | 5 | committed per step |
147
+ | `left_context_seconds` | 5 | encoder bilateral context |
148
+ | `right_context_seconds` | 2 | end-to-end latency 7 s |
149
+ | `window total` | 12 s | (5 + 5 + 2) × 100 = 1200 mel ≤ 1500 |
150
+ | `carry_state` | false | offline-trained model; carrying LSTM state across chunks tends to hurt |
151
 
152
+ We measured ~27 % WER on multilingual long-form audio (EN/ES/IT
153
+ code-switching) with this config, ~22 % on clean offline ≤15 s English.
 
154
 
155
  ## Quantization
156
 
157
+ - All `.tflite` weights are FP16. Activations remain FP32.
158
+ - Bit-identical token output vs the upstream FP32 model on a 99-clip eval
159
+ set.
 
160
 
161
  ## Conversion provenance
162
 
 
164
 
165
  1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module).
166
  2. **ExportedProgram → TFLite** via
167
+ [`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0.
 
168
  3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
169
  FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
170
 
171
+ Several NeMo internals required export-time monkey-patches:
172
+
173
+ - `MaskedConvSequential.{forward,_create_mask}` and `apply_channel_mask` to
174
+ remove `.expand(...)` patterns rejected by the TFLite broadcast checker.
175
+ - `RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask` — to
176
+ build masks in `bool` instead of `uint8` (litert-torch has no uint8
177
+ lowering).
178
+ - `ConformerEncoder.{forward_internal,_create_masks}` and
179
+ `MaskedConvSequential.{forward,_create_mask}` — to keep the entire length
180
+ pipeline in `int32` instead of NeMo's default `int64`, so LiteRT's
181
+ GPU/NPU delegates can compile the graph without falling back to CPU.
182
 
183
+ ## Limitations
184
 
185
+ 1. **Audio at position 0.** The encoder expects audio anchored at the start
186
+ of its input window. Padding before the audio causes hallucinations.
187
+ 2. **15 s max per call.** Use the streaming chunker for longer clips.
188
+ 3. **No VAD or diarization.** Pair with an external VAD or a diarizer
189
+ (e.g. Sortformer) for speaker-attributed transcripts.
190
+ 4. **Multilingual but no language token.** Code-switching works, but the
191
+ model doesn't emit a language ID. Run a separate classifier if you need it.
 
192
 
193
  ## License
194
 
195
+ Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0).
 
196
 
197
  ## Citation
198
 
 
 
199
  ```bibtex
200
  @misc{nvidia_parakeet_tdt_0_6b_v3,
201
  title = {Parakeet-TDT-0.6B-v3},
encoder_T1500.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f826f40ab8e50c0e05ff1532d1cedcaa9c79c34909f1c14d9ac86c0643db45b
3
+ size 1206789364
manifest.json CHANGED
@@ -30,36 +30,6 @@
30
  "attention_mode": "rel_pos",
31
  "att_context_size": null,
32
  "buckets": [
33
- {
34
- "n_mel_frames": 300,
35
- "n_encoder_frames": 37,
36
- "input_shape": [
37
- 1,
38
- 128,
39
- 300
40
- ],
41
- "signature": "forward_T300"
42
- },
43
- {
44
- "n_mel_frames": 500,
45
- "n_encoder_frames": 62,
46
- "input_shape": [
47
- 1,
48
- 128,
49
- 500
50
- ],
51
- "signature": "forward_T500"
52
- },
53
- {
54
- "n_mel_frames": 700,
55
- "n_encoder_frames": 87,
56
- "input_shape": [
57
- 1,
58
- 128,
59
- 700
60
- ],
61
- "signature": "forward_T700"
62
- },
63
  {
64
  "n_mel_frames": 1500,
65
  "n_encoder_frames": 187,
@@ -68,12 +38,16 @@
68
  128,
69
  1500
70
  ],
71
- "signature": "forward_T1500"
 
 
 
 
 
 
72
  }
73
  ],
74
- "multisig": true,
75
- "dynamic_artifact": "encoder_dynamicT.pt2",
76
- "dynamic_artifact_size_mb": 2367.32
77
  },
78
  "decoder": {
79
  "num_layers": 2,
@@ -171,51 +145,68 @@
171
  "results": [
172
  {
173
  "graph": "encoder",
174
- "source_artifact": "encoder_dynamicT.pt2",
175
- "output_artifact": "encoder_multisig.tflite",
176
- "size_mb": 1191.14,
177
- "convert_seconds": 367.97,
178
  "quant": "fp16",
179
- "multisig": true,
180
- "signatures": [
181
- "forward_T300",
182
- "forward_T500",
183
- "forward_T700",
184
- "forward_T1500"
 
 
 
185
  ],
186
- "parity_per_signature": {
187
- "forward_T300": {
188
- "ok": true,
189
- "max_abs_diff": 0.009477382525801659,
190
- "per_output_diffs": [
191
- 0.009477382525801659,
192
- 0.0
193
- ]
194
- },
195
- "forward_T500": {
196
- "ok": true,
197
- "max_abs_diff": 0.0061398837715387344,
198
- "per_output_diffs": [
199
- 0.0061398837715387344,
200
- 0.0
 
 
 
 
 
 
 
 
 
 
201
  ]
202
- },
203
- "forward_T700": {
204
- "ok": true,
205
- "max_abs_diff": 0.001271696761250496,
206
- "per_output_diffs": [
207
- 0.001271696761250496,
208
- 0.0
 
 
209
  ]
210
- },
211
- "forward_T1500": {
212
- "ok": true,
213
- "max_abs_diff": 0.004102766513824463,
214
- "per_output_diffs": [
215
- 0.004102766513824463,
216
- 0.0
 
 
217
  ]
218
- }
219
  }
220
  },
221
  {
@@ -223,7 +214,7 @@
223
  "source_artifact": "decoder_step.pt2",
224
  "output_artifact": "decoder_step.tflite",
225
  "size_mb": 22.55,
226
- "convert_seconds": 2.72,
227
  "quant": "fp16",
228
  "torch_output_shapes": [
229
  [
@@ -315,7 +306,7 @@
315
  "source_artifact": "joint_step.pt2",
316
  "output_artifact": "joint_step.tflite",
317
  "size_mb": 12.08,
318
- "convert_seconds": 1.08,
319
  "quant": "fp16",
320
  "torch_output_shapes": [
321
  [
@@ -327,9 +318,9 @@
327
  ],
328
  "parity": {
329
  "ok": true,
330
- "max_abs_diff": 0.33984375,
331
  "per_output_diffs": [
332
- 0.33984375
333
  ],
334
  "tflite_output_shapes": [
335
  [
 
30
  "attention_mode": "rel_pos",
31
  "att_context_size": null,
32
  "buckets": [
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  {
34
  "n_mel_frames": 1500,
35
  "n_encoder_frames": 187,
 
38
  128,
39
  1500
40
  ],
41
+ "output_shape": [
42
+ 1,
43
+ 1024,
44
+ 188
45
+ ],
46
+ "artifact": "encoder_T1500.pt2",
47
+ "size_mb": 2366.85
48
  }
49
  ],
50
+ "multisig": false
 
 
51
  },
52
  "decoder": {
53
  "num_layers": 2,
 
145
  "results": [
146
  {
147
  "graph": "encoder",
148
+ "source_artifact": "encoder_T1500.pt2",
149
+ "output_artifact": "encoder_T1500.tflite",
150
+ "size_mb": 1150.88,
151
+ "convert_seconds": 158.59,
152
  "quant": "fp16",
153
+ "torch_output_shapes": [
154
+ [
155
+ 1,
156
+ 1024,
157
+ 188
158
+ ],
159
+ [
160
+ 1
161
+ ]
162
  ],
163
+ "parity": {
164
+ "ok": true,
165
+ "max_abs_diff": 0.0,
166
+ "per_output_diffs": [
167
+ [
168
+ "shape mismatch",
169
+ [
170
+ 1
171
+ ],
172
+ [
173
+ 1,
174
+ 1024,
175
+ 188
176
+ ]
177
+ ],
178
+ [
179
+ "shape mismatch",
180
+ [
181
+ 1,
182
+ 1024,
183
+ 188
184
+ ],
185
+ [
186
+ 1
187
+ ]
188
  ]
189
+ ],
190
+ "tflite_output_shapes": [
191
+ [
192
+ 1
193
+ ],
194
+ [
195
+ 1,
196
+ 1024,
197
+ 188
198
  ]
199
+ ],
200
+ "torch_output_shapes": [
201
+ [
202
+ 1,
203
+ 1024,
204
+ 188
205
+ ],
206
+ [
207
+ 1
208
  ]
209
+ ]
210
  }
211
  },
212
  {
 
214
  "source_artifact": "decoder_step.pt2",
215
  "output_artifact": "decoder_step.tflite",
216
  "size_mb": 22.55,
217
+ "convert_seconds": 1.92,
218
  "quant": "fp16",
219
  "torch_output_shapes": [
220
  [
 
306
  "source_artifact": "joint_step.pt2",
307
  "output_artifact": "joint_step.tflite",
308
  "size_mb": 12.08,
309
+ "convert_seconds": 1.61,
310
  "quant": "fp16",
311
  "torch_output_shapes": [
312
  [
 
318
  ],
319
  "parity": {
320
  "ok": true,
321
+ "max_abs_diff": 0.408447265625,
322
  "per_output_diffs": [
323
+ 0.408447265625
324
  ],
325
  "tflite_output_shapes": [
326
  [