spybyscript commited on
Commit
6e11431
·
verified ·
1 Parent(s): 93ca9ad

Upload LiteRT FP16 multi-sig bundle

Browse files
README.md ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ - es
6
+ - it
7
+ - de
8
+ - fr
9
+ - pt
10
+ library_name: litert
11
+ base_model: nvidia/parakeet-tdt-0.6b-v3
12
+ tags:
13
+ - automatic-speech-recognition
14
+ - speech
15
+ - audio
16
+ - parakeet
17
+ - tdt
18
+ - litert
19
+ - tflite
20
+ - on-device
21
+ - mobile
22
+ - android
23
+ - streaming
24
+ pipeline_tag: automatic-speech-recognition
25
+ ---
26
+
27
+ # Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port
28
+
29
+ This is a [LiteRT](https://ai.google.dev/edge/litert) (TFLite) port of
30
+ [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
31
+ packaged for on-device inference (Android / Mac / embedded) without a Python or
32
+ NeMo runtime dependency.
33
+
34
+ For **model capabilities, languages, training data, license, and benchmarks**,
35
+ see the upstream model card. This card only documents what's specific to the
36
+ LiteRT port.
37
+
38
+ ## What's in this bundle
39
+
40
+ | File | Size | Purpose |
41
+ |---|---|---|
42
+ | `encoder_multisig.tflite` | 1.19 GB | FP16 weight-shared encoder, 4 bucket signatures |
43
+ | `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
44
+ | `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
45
+ | `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
46
+ | `manifest.json` | — | All metadata the runtime needs |
47
+
48
+ Total: **~1.2 GB** (FP16). FP32 reference is roughly 2.4 GB.
49
+
50
+ ## Encoder signatures (multi-bucket)
51
+
52
+ Weights are shared across 4 fixed-T input shapes via TFLite signatures:
53
+
54
+ | Signature | T_mel | Audio | Use |
55
+ |---|---|---|---|
56
+ | `forward_T300` | 300 | 3.0 s | short utterances, low latency |
57
+ | `forward_T500` | 500 | 5.0 s | typical streaming chunks |
58
+ | `forward_T700` | 700 | 7.0 s | medium utterances |
59
+ | `forward_T1500` | 1500 | 15.0 s | long utterances, offline |
60
+
61
+ Each signature has the same I/O shape contract:
62
+
63
+ ```
64
+ inputs:
65
+ audio_signal : float32 [1, 128, T_mel] # log-mel features (NeMo preproc)
66
+ length : int64 [1] # actual mel frames used (≤ T_mel)
67
+ outputs:
68
+ encoded : float32 [1, 1024, T_enc] # T_enc = (T_mel - 4) // 8
69
+ encoded_lengths : int64 [1]
70
+ ```
71
+
72
+ Pick the smallest bucket that fits your input; pad shorter inputs with zeros
73
+ and pass the true length.
74
+
75
+ ## Decoder + joint contract
76
+
77
+ ```
78
+ decoder_step:
79
+ inputs: token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
80
+ outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]
81
+
82
+ joint_step:
83
+ inputs: enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
84
+ outputs: logits float32 [1,1,1,8198]
85
+ # logits[..., 0:8193] → token logits (8192 BPE + 1 blank)
86
+ # logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
87
+ ```
88
+
89
+ Greedy TDT decoding (per encoder frame):
90
+
91
+ 1. Run joint with current `enc_frame` and last predicted `pred_frame`.
92
+ 2. `token = argmax(token_logits)`; `dur = argmax(duration_logits) ∈ {0,1,2,3,4}`.
93
+ 3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
94
+ re-prime decoder with the emitted token (h, c update).
95
+ 4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
96
+ 5. Repeat until `enc_lengths` is exhausted.
97
+
98
+ ## Audio preprocessing
99
+
100
+ LiteRT itself does not produce mel features — your runtime must compute them.
101
+ Match NeMo's preprocessor exactly:
102
+
103
+ ```
104
+ sample_rate : 16000 Hz (resample if needed)
105
+ n_fft : 512
106
+ hop_length : 160 → 100 mel frames / second
107
+ win_length : 400
108
+ n_mels : 128
109
+ preemph : 0.97
110
+ log : log10(mel + 1e-5) per-feature normalized
111
+ ```
112
+
113
+ Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms).
114
+
115
+ ## Streaming usage
116
+
117
+ This bundle supports chunked streaming inference. A reference Python
118
+ implementation is provided in the upload repo (`transcribe_litert_streaming.py`),
119
+ which produces ~27% WER on multilingual long-form audio at ~2× real-time on CPU
120
+ with `chunk=5s, left=5s, right=2s` (12 s window, bucket `forward_T1500`).
121
+
122
+ For Android, port the chunker by:
123
+
124
+ 1. Hold a rolling mel buffer (left context + new chunk + right look-ahead).
125
+ 2. Pick the smallest bucket ≥ window length, pad to bucket T_mel.
126
+ 3. Run encoder signature, then TDT greedy decode over `T_enc` frames.
127
+ 4. Dedup tokens against the previous chunk's emit window using their
128
+ `encoder_frame_idx`. Reuse the LSTM `(h, c)` state across chunks (optional).
129
+
130
+ The model is **not** a strict left-only streamer — it sees right context within
131
+ each chunk window. For "real" low-latency streaming, the right-context
132
+ look-ahead can be reduced or removed at a quality cost.
133
+
134
+ ## Quantization
135
+
136
+ - All `.tflite` weights are FP16. Activations remain FP32 (no activation
137
+ calibration).
138
+ - Round-trip parity with the upstream FP32 model: bit-identical token output on
139
+ a 99-clip English eval set (validated with the offline runner).
140
+
141
+ ## Conversion provenance
142
+
143
+ Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:
144
+
145
+ 1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module).
146
+ 2. **ExportedProgram → TFLite** via
147
+ [`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0
148
+ (`signature(...).add_signature(...).convert()`).
149
+ 3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
150
+ FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
151
+
152
+ The encoder graph is exported once with a dynamic time dim, then specialized
153
+ into 4 fixed-T signatures sharing weights. The TFLite serializer dedups the
154
+ weight tensors, so the bundle is the size of one encoder, not four.
155
+
156
+ ## Limitations & caveats
157
+
158
+ - **Bucket positional encoding.** The encoder was trained with audio anchored
159
+ at position 0 of its input window. Padding *before* the audio causes
160
+ hallucinations. Always place audio at the start of the bucket buffer and
161
+ zero-pad the tail.
162
+ - **Long-form clips.** A single bucket call covers at most 15 s. Anything
163
+ longer must be chunked at the runtime level.
164
+ - **No voice activity detection / diarization.** Pair with a separate VAD or
165
+ diarizer (e.g. Sortformer, pyannote) for speaker-attributed transcripts.
166
+
167
+ ## License
168
+
169
+ Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0). See
170
+ the upstream model card for full terms.
171
+
172
+ ## Citation
173
+
174
+ If you use this bundle, cite the upstream NeMo model:
175
+
176
+ ```bibtex
177
+ @misc{nvidia_parakeet_tdt_0_6b_v3,
178
+ title = {Parakeet-TDT-0.6B-v3},
179
+ author = {NVIDIA},
180
+ year = {2025},
181
+ url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
182
+ }
183
+ ```
decoder_step.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb0bf3559a0b4cbdc3ca05b7e8ff948ee5ef158ce424667b62a85f6c769a9ce1
3
+ size 23650084
encoder_multisig.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a97075644590cedce95a53083c876f56dce22d2e1e5807bc4ca2d6879f6183c8
3
+ size 1249026196
joint_step.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e28c22fc426df9900ef4a1bd15760ec757e44f0fd1818e0afb51c4fe79031be
3
+ size 12664976
manifest.json ADDED
@@ -0,0 +1,352 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "nvidia/parakeet-tdt-0.6b-v3",
3
+ "torch_version": "2.11.0+cu130",
4
+ "model_class": "EncDecRNNTBPEModel",
5
+ "vocab_size": 8192,
6
+ "blank_id": 8192,
7
+ "durations": [
8
+ 0,
9
+ 1,
10
+ 2,
11
+ 3,
12
+ 4
13
+ ],
14
+ "num_durations": 5,
15
+ "joint_output_dim": 8198,
16
+ "joint_token_logits_slice": [
17
+ 0,
18
+ 8193
19
+ ],
20
+ "joint_duration_logits_slice": [
21
+ 8193,
22
+ 8198
23
+ ],
24
+ "encoder": {
25
+ "d_model": 1024,
26
+ "subsampling_factor": 8,
27
+ "n_layers": 24,
28
+ "n_heads": 8,
29
+ "feat_in": 128,
30
+ "buckets": [
31
+ {
32
+ "n_mel_frames": 300,
33
+ "n_encoder_frames": 37,
34
+ "input_shape": [
35
+ 1,
36
+ 128,
37
+ 300
38
+ ],
39
+ "signature": "forward_T300"
40
+ },
41
+ {
42
+ "n_mel_frames": 500,
43
+ "n_encoder_frames": 62,
44
+ "input_shape": [
45
+ 1,
46
+ 128,
47
+ 500
48
+ ],
49
+ "signature": "forward_T500"
50
+ },
51
+ {
52
+ "n_mel_frames": 700,
53
+ "n_encoder_frames": 87,
54
+ "input_shape": [
55
+ 1,
56
+ 128,
57
+ 700
58
+ ],
59
+ "signature": "forward_T700"
60
+ },
61
+ {
62
+ "n_mel_frames": 1500,
63
+ "n_encoder_frames": 187,
64
+ "input_shape": [
65
+ 1,
66
+ 128,
67
+ 1500
68
+ ],
69
+ "signature": "forward_T1500"
70
+ }
71
+ ],
72
+ "multisig": true,
73
+ "dynamic_artifact": "encoder_dynamicT.pt2",
74
+ "dynamic_artifact_size_mb": 2367.32
75
+ },
76
+ "decoder": {
77
+ "num_layers": 2,
78
+ "hidden": 640,
79
+ "embed_dim": 640
80
+ },
81
+ "joint": {
82
+ "d_enc": 1024,
83
+ "d_pred": 640,
84
+ "joint_dim": 640
85
+ },
86
+ "preprocessor": {
87
+ "sample_rate": 16000,
88
+ "n_fft": 512,
89
+ "win_length": 400,
90
+ "hop_length": 160,
91
+ "n_mels": 128,
92
+ "preemph": 0.97,
93
+ "log": true,
94
+ "frame_rate_hz_post_subsample": 12.5
95
+ },
96
+ "artifacts": {
97
+ "decoder_step": {
98
+ "filename": "decoder_step.pt2",
99
+ "size_mb": 45.07,
100
+ "input_shapes": {
101
+ "token": [
102
+ 1,
103
+ 1
104
+ ],
105
+ "h": [
106
+ 2,
107
+ 1,
108
+ 640
109
+ ],
110
+ "c": [
111
+ 2,
112
+ 1,
113
+ 640
114
+ ]
115
+ },
116
+ "input_dtypes": {
117
+ "token": "int64",
118
+ "h": "float32",
119
+ "c": "float32"
120
+ },
121
+ "output_shapes": {
122
+ "g": [
123
+ 1,
124
+ 1,
125
+ 640
126
+ ],
127
+ "h": [
128
+ 2,
129
+ 1,
130
+ 640
131
+ ],
132
+ "c": [
133
+ 2,
134
+ 1,
135
+ 640
136
+ ]
137
+ }
138
+ },
139
+ "joint_step": {
140
+ "filename": "joint_step.pt2",
141
+ "size_mb": 24.14,
142
+ "input_shapes": {
143
+ "enc_frame": [
144
+ 1,
145
+ 1024,
146
+ 1
147
+ ],
148
+ "pred_frame": [
149
+ 1,
150
+ 640,
151
+ 1
152
+ ]
153
+ },
154
+ "output_shape": [
155
+ 1,
156
+ 1,
157
+ 1,
158
+ 8198
159
+ ]
160
+ }
161
+ },
162
+ "tokenizer": {
163
+ "saved": true,
164
+ "method": "serialized_model_proto",
165
+ "vocab_size": 8192
166
+ },
167
+ "litert": {
168
+ "quant": "fp16",
169
+ "results": [
170
+ {
171
+ "graph": "encoder",
172
+ "source_artifact": "encoder_dynamicT.pt2",
173
+ "output_artifact": "encoder_multisig.tflite",
174
+ "size_mb": 1191.16,
175
+ "convert_seconds": 402.16,
176
+ "quant": "fp16",
177
+ "multisig": true,
178
+ "signatures": [
179
+ "forward_T300",
180
+ "forward_T500",
181
+ "forward_T700",
182
+ "forward_T1500"
183
+ ],
184
+ "parity_per_signature": {
185
+ "forward_T300": {
186
+ "ok": true,
187
+ "max_abs_diff": 0.0033329054713249207,
188
+ "per_output_diffs": [
189
+ 0.0033329054713249207,
190
+ 0.0
191
+ ]
192
+ },
193
+ "forward_T500": {
194
+ "ok": true,
195
+ "max_abs_diff": 0.006780040450394154,
196
+ "per_output_diffs": [
197
+ 0.006780040450394154,
198
+ 0.0
199
+ ]
200
+ },
201
+ "forward_T700": {
202
+ "ok": true,
203
+ "max_abs_diff": 0.0005690590478479862,
204
+ "per_output_diffs": [
205
+ 0.0005690590478479862,
206
+ 0.0
207
+ ]
208
+ },
209
+ "forward_T1500": {
210
+ "ok": true,
211
+ "max_abs_diff": 0.003892328590154648,
212
+ "per_output_diffs": [
213
+ 0.003892328590154648,
214
+ 0.0
215
+ ]
216
+ }
217
+ }
218
+ },
219
+ {
220
+ "graph": "decoder_step",
221
+ "source_artifact": "decoder_step.pt2",
222
+ "output_artifact": "decoder_step.tflite",
223
+ "size_mb": 22.55,
224
+ "convert_seconds": 3.81,
225
+ "quant": "fp16",
226
+ "torch_output_shapes": [
227
+ [
228
+ 1,
229
+ 1,
230
+ 640
231
+ ],
232
+ [
233
+ 2,
234
+ 1,
235
+ 640
236
+ ],
237
+ [
238
+ 2,
239
+ 1,
240
+ 640
241
+ ]
242
+ ],
243
+ "parity": {
244
+ "ok": true,
245
+ "max_abs_diff": 0.0044100284576416016,
246
+ "per_output_diffs": [
247
+ [
248
+ "shape mismatch",
249
+ [
250
+ 2,
251
+ 1,
252
+ 640
253
+ ],
254
+ [
255
+ 1,
256
+ 1,
257
+ 640
258
+ ]
259
+ ],
260
+ [
261
+ "shape mismatch",
262
+ [
263
+ 1,
264
+ 1,
265
+ 640
266
+ ],
267
+ [
268
+ 2,
269
+ 1,
270
+ 640
271
+ ]
272
+ ],
273
+ 0.0044100284576416016
274
+ ],
275
+ "tflite_output_shapes": [
276
+ [
277
+ 2,
278
+ 1,
279
+ 640
280
+ ],
281
+ [
282
+ 1,
283
+ 1,
284
+ 640
285
+ ],
286
+ [
287
+ 2,
288
+ 1,
289
+ 640
290
+ ]
291
+ ],
292
+ "torch_output_shapes": [
293
+ [
294
+ 1,
295
+ 1,
296
+ 640
297
+ ],
298
+ [
299
+ 2,
300
+ 1,
301
+ 640
302
+ ],
303
+ [
304
+ 2,
305
+ 1,
306
+ 640
307
+ ]
308
+ ]
309
+ }
310
+ },
311
+ {
312
+ "graph": "joint_step",
313
+ "source_artifact": "joint_step.pt2",
314
+ "output_artifact": "joint_step.tflite",
315
+ "size_mb": 12.08,
316
+ "convert_seconds": 1.13,
317
+ "quant": "fp16",
318
+ "torch_output_shapes": [
319
+ [
320
+ 1,
321
+ 1,
322
+ 1,
323
+ 8198
324
+ ]
325
+ ],
326
+ "parity": {
327
+ "ok": true,
328
+ "max_abs_diff": 0.275390625,
329
+ "per_output_diffs": [
330
+ 0.275390625
331
+ ],
332
+ "tflite_output_shapes": [
333
+ [
334
+ 1,
335
+ 1,
336
+ 1,
337
+ 8198
338
+ ]
339
+ ],
340
+ "torch_output_shapes": [
341
+ [
342
+ 1,
343
+ 1,
344
+ 1,
345
+ 8198
346
+ ]
347
+ ]
348
+ }
349
+ }
350
+ ]
351
+ }
352
+ }
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eacec2b0a77f336d4a2ca4a25a7047575d3c2b74de47e997f4c205126ed3135e
3
+ size 360916