spybyscript
/

parakeet-tdt-litert

@@ -26,10 +26,10 @@ pipeline_tag: automatic-speech-recognition
 # Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port
-This is a [LiteRT](https://ai.google.dev/edge/litert) (TFLite) port of
 [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
-packaged for on-device inference (Android / Mac / embedded) without a Python or
-NeMo runtime dependency.
 For **model capabilities, languages, training data, license, and benchmarks**,
 see the upstream model card. This card only documents what's specific to the
@@ -39,46 +39,53 @@ LiteRT port.
 | File | Size | Purpose |
 |---|---|---|
-| `encoder_multisig.tflite` | 1.19 GB | FP16 weight-shared encoder, 4 bucket signatures |
 | `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
 | `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
 | `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
 | `manifest.json` | — | All metadata the runtime needs |
-Total: **~1.2 GB** (FP16). FP32 reference is roughly 2.4 GB.
-## Encoder signatures (multi-bucket)
-Weights are shared across 4 fixed-T input shapes via TFLite signatures:
-| Signature | T_mel | Audio | Use |
-|---|---|---|---|
-| `forward_T300` | 300 | 3.0 s | short utterances, low latency |
-| `forward_T500` | 500 | 5.0 s | typical streaming chunks |
-| `forward_T700` | 700 | 7.0 s | medium utterances |
-| `forward_T1500` | 1500 | 15.0 s | long utterances, offline |
-Each signature has the same I/O shape contract:
 ```
 inputs:
-  audio_signal : float32 [1, 128, T_mel]   # log-mel features (NeMo preproc)
-  length       : int32   [1]                # actual mel frames used (≤ T_mel)
 outputs:
-  encoded         : float32 [1, 1024, T_enc]  # T_enc = (T_mel - 4) // 8
   encoded_lengths : int32   [1]
 ```
-Pick the smallest bucket that fits your input; pad shorter inputs with zeros
-and pass the true length.
 **Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
 NPU accelerator) reject int64 tensors entirely. With int64 length, every
-internal CAST node touching it falls back to CPU and `CompiledModel.create()`
 fails outright on Android with the GPU backend. This bundle is exported with
 int32 length end-to-end (input → internal mask arange/comparisons → output
-`encoded_lengths`). int32 covers >2 billion mel frames (~5 hours), so no
-practical range loss.
 ## Decoder + joint contract
@@ -94,19 +101,25 @@ joint_step:
            # logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
 ```
 Greedy TDT decoding (per encoder frame):
 1. Run joint with current `enc_frame` and last predicted `pred_frame`.
-2. `token = argmax(token_logits)`; `dur = argmax(duration_logits) ∈ {0,1,2,3,4}`.
 3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
    re-prime decoder with the emitted token (h, c update).
 4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
 5. Repeat until `enc_lengths` is exhausted.
 ## Audio preprocessing
-LiteRT itself does not produce mel features — your runtime must compute them.
-Match NeMo's preprocessor exactly:
 ```
 sample_rate    : 16000 Hz (resample if needed)
@@ -115,36 +128,35 @@ hop_length     : 160      → 100 mel frames / second
 win_length     : 400
 n_mels         : 128
 preemph        : 0.97
-log            : log10(mel + 1e-5) per-feature normalized
 ```
 Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms).
 ## Streaming usage
-This bundle supports chunked streaming inference. A reference Python
-implementation is provided in the upload repo (`transcribe_litert_streaming.py`),
-which produces ~27% WER on multilingual long-form audio at ~2× real-time on CPU
-with `chunk=5s, left=5s, right=2s` (12 s window, bucket `forward_T1500`).
-For Android, port the chunker by:
-1. Hold a rolling mel buffer (left context + new chunk + right look-ahead).
-2. Pick the smallest bucket ≥ window length, pad to bucket T_mel.
-3. Run encoder signature, then TDT greedy decode over `T_enc` frames.
-4. Dedup tokens against the previous chunk's emit window using their
-   `encoder_frame_idx`. Reuse the LSTM `(h, c)` state across chunks (optional).
-The model is **not** a strict left-only streamer — it sees right context within
-each chunk window. For "real" low-latency streaming, the right-context
-look-ahead can be reduced or removed at a quality cost.
 ## Quantization
-- All `.tflite` weights are FP16. Activations remain FP32 (no activation
-  calibration).
-- Round-trip parity with the upstream FP32 model: bit-identical token output on
-  a 99-clip English eval set (validated with the offline runner).
 ## Conversion provenance
@@ -152,35 +164,38 @@ Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:
 1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module).
 2. **ExportedProgram → TFLite** via
-   [`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0
-   (`signature(...).add_signature(...).convert()`).
 3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
    FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
-The encoder graph is exported once with a dynamic time dim, then specialized
-into 4 fixed-T signatures sharing weights. The TFLite serializer dedups the
-weight tensors, so the bundle is the size of one encoder, not four.
-## Limitations & caveats
-- **Bucket positional encoding.** The encoder was trained with audio anchored
-  at position 0 of its input window. Padding *before* the audio causes
-  hallucinations. Always place audio at the start of the bucket buffer and
-  zero-pad the tail.
-- **Long-form clips.** A single bucket call covers at most 15 s. Anything
-  longer must be chunked at the runtime level.
-- **No voice activity detection / diarization.** Pair with a separate VAD or
-  diarizer (e.g. Sortformer, pyannote) for speaker-attributed transcripts.
 ## License
-Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0). See
-the upstream model card for full terms.
 ## Citation
-If you use this bundle, cite the upstream NeMo model:
 ```bibtex
 @misc{nvidia_parakeet_tdt_0_6b_v3,
   title  = {Parakeet-TDT-0.6B-v3},

 # Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port
+LiteRT (TFLite) port of
 [`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
+packaged for on-device inference (Android / Mac / embedded) without a Python
+or NeMo runtime dependency.
 For **model capabilities, languages, training data, license, and benchmarks**,
 see the upstream model card. This card only documents what's specific to the
 | File | Size | Purpose |
 |---|---|---|
+| `encoder_T1500.tflite` | 1.15 GB | FP16 encoder, fixed `T_mel = 1500` (15 s window) |
 | `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
 | `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
 | `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
 | `manifest.json` | — | All metadata the runtime needs |
+Total: **~1.18 GB** (FP16). FP32 reference is ~2.37 GB.
+## Encoder I/O contract
 ```
 inputs:
+  audio_signal : float32 [1, 128, 1500]   # log-mel features (NeMo preproc)
+  length       : int32   [1]               # actual mel frames used (≤ 1500)
 outputs:
+  encoded         : float32 [1, 1024, 188]  # 188 = (1500 - 4) // 8
   encoded_lengths : int32   [1]
 ```
+Pad shorter inputs with zeros at the **tail** (the encoder was trained with
+audio anchored at position 0; left-padding causes hallucinations) and pass
+the true length.
+The 1500-mel bucket covers ≤ 15 s of audio. For long-form input, run the
+encoder in a sliding-window streaming loop — see "Streaming usage" below.
 **Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
 NPU accelerator) reject int64 tensors entirely. With int64 length, every
+internal CAST node touching it falls back to CPU, and `CompiledModel.create()`
 fails outright on Android with the GPU backend. This bundle is exported with
 int32 length end-to-end (input → internal mask arange/comparisons → output
+`encoded_lengths`). int32 covers > 2 billion mel frames (~5 hours of audio),
+so no practical range loss.
+## Why a single bucket and not multi-signature
+An earlier revision shipped a multi-signature encoder with 4 buckets
+(300/500/700/1500) sharing weights inside one `.tflite`. The disk savings
+were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android
+the LiteRT `CompiledModel.create()` API prepares **every** signature's
+subgraph at load time — each one going through the full delegate-partition
+pass. With 4 signatures × ~7 s of XNNPACK / GPU partition prep, app cold
+start was ~28 s.
+A single-bucket file is one subgraph: ~7 s init, then ready. If you need
+multiple bucket sizes for latency reasons, ship them as separate `.tflite`
+files (TFLite has no cross-file weight sharing) and load on demand.
 ## Decoder + joint contract
            # logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
 ```
+`decoder_step.token` is `int64` because it's an embedding lookup; that op
+runs on CPU regardless of delegate, so int64 there is harmless.
 Greedy TDT decoding (per encoder frame):
 1. Run joint with current `enc_frame` and last predicted `pred_frame`.
+2. `token = argmax(token_logits)`; `dur = durations[argmax(duration_logits)] ∈ {0,1,2,3,4}`.
 3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
    re-prime decoder with the emitted token (h, c update).
 4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
 5. Repeat until `enc_lengths` is exhausted.
+Cap at ~10 non-blank emissions per encoder frame to guard against the
+pathological `dur=0` decode loop.
 ## Audio preprocessing
+LiteRT itself does not produce mel features — your runtime must compute
+them. Match NeMo's preprocessor exactly:
 ```
 sample_rate    : 16000 Hz (resample if needed)
 win_length     : 400
 n_mels         : 128
 preemph        : 0.97
+log            : log(mel + 1e-5), per-feature normalized
+mel_scale      : slaney
 ```
 Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms).
 ## Streaming usage
+This bundle supports chunked streaming inference using a left+chunk+right
+context window that fits inside 15 s. A reference Python implementation is
+in the upstream repo (`transcribe_litert_streaming.py`). Recommended config
+for Android UX:
+| Knob | Value | Reason |
+|---|---|---|
+| `chunk_seconds` | 5 | committed per step |
+| `left_context_seconds` | 5 | encoder bilateral context |
+| `right_context_seconds` | 2 | end-to-end latency ≈ 7 s |
+| `window total` | 12 s | (5 + 5 + 2) × 100 = 1200 mel ≤ 1500 |
+| `carry_state` | false | offline-trained model; carrying LSTM state across chunks tends to hurt |
+We measured ~27 % WER on multilingual long-form audio (EN/ES/IT
+code-switching) with this config, ~22 % on clean offline ≤15 s English.
 ## Quantization
+- All `.tflite` weights are FP16. Activations remain FP32.
+- Bit-identical token output vs the upstream FP32 model on a 99-clip eval
+  set.
 ## Conversion provenance
 1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module).
 2. **ExportedProgram → TFLite** via
+   [`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0.
 3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
    FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
+Several NeMo internals required export-time monkey-patches:
+- `MaskedConvSequential.{forward,_create_mask}` and `apply_channel_mask` — to
+  remove `.expand(...)` patterns rejected by the TFLite broadcast checker.
+- `RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask` — to
+  build masks in `bool` instead of `uint8` (litert-torch has no uint8
+  lowering).
+- `ConformerEncoder.{forward_internal,_create_masks}` and
+  `MaskedConvSequential.{forward,_create_mask}` — to keep the entire length
+  pipeline in `int32` instead of NeMo's default `int64`, so LiteRT's
+  GPU/NPU delegates can compile the graph without falling back to CPU.
+## Limitations
+1. **Audio at position 0.** The encoder expects audio anchored at the start
+   of its input window. Padding before the audio causes hallucinations.
+2. **15 s max per call.** Use the streaming chunker for longer clips.
+3. **No VAD or diarization.** Pair with an external VAD or a diarizer
+   (e.g. Sortformer) for speaker-attributed transcripts.
+4. **Multilingual but no language token.** Code-switching works, but the
+   model doesn't emit a language ID. Run a separate classifier if you need it.
 ## License
+Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0).
 ## Citation
 ```bibtex
 @misc{nvidia_parakeet_tdt_0_6b_v3,
   title  = {Parakeet-TDT-0.6B-v3},

encoder_T1500.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8f826f40ab8e50c0e05ff1532d1cedcaa9c79c34909f1c14d9ac86c0643db45b
+size 1206789364

manifest.json CHANGED Viewed

@@ -30,36 +30,6 @@
     "attention_mode": "rel_pos",
     "att_context_size": null,
     "buckets": [
-      {
-        "n_mel_frames": 300,
-        "n_encoder_frames": 37,
-        "input_shape": [
-          1,
-          128,
-          300
-        ],
-        "signature": "forward_T300"
-      },
-      {
-        "n_mel_frames": 500,
-        "n_encoder_frames": 62,
-        "input_shape": [
-          1,
-          128,
-          500
-        ],
-        "signature": "forward_T500"
-      },
-      {
-        "n_mel_frames": 700,
-        "n_encoder_frames": 87,
-        "input_shape": [
-          1,
-          128,
-          700
-        ],
-        "signature": "forward_T700"
-      },
       {
         "n_mel_frames": 1500,
         "n_encoder_frames": 187,
@@ -68,12 +38,16 @@
           128,
           1500
         ],
-        "signature": "forward_T1500"
       }
     ],
-    "multisig": true,
-    "dynamic_artifact": "encoder_dynamicT.pt2",
-    "dynamic_artifact_size_mb": 2367.32
   },
   "decoder": {
     "num_layers": 2,
@@ -171,51 +145,68 @@
     "results": [
       {
         "graph": "encoder",
-        "source_artifact": "encoder_dynamicT.pt2",
-        "output_artifact": "encoder_multisig.tflite",
-        "size_mb": 1191.14,
-        "convert_seconds": 367.97,
         "quant": "fp16",
-        "multisig": true,
-        "signatures": [
-          "forward_T300",
-          "forward_T500",
-          "forward_T700",
-          "forward_T1500"
         ],
-        "parity_per_signature": {
-          "forward_T300": {
-            "ok": true,
-            "max_abs_diff": 0.009477382525801659,
-            "per_output_diffs": [
-              0.009477382525801659,
-              0.0
-            ]
-          },
-          "forward_T500": {
-            "ok": true,
-            "max_abs_diff": 0.0061398837715387344,
-            "per_output_diffs": [
-              0.0061398837715387344,
-              0.0
             ]
-          },
-          "forward_T700": {
-            "ok": true,
-            "max_abs_diff": 0.001271696761250496,
-            "per_output_diffs": [
-              0.001271696761250496,
-              0.0
             ]
-          },
-          "forward_T1500": {
-            "ok": true,
-            "max_abs_diff": 0.004102766513824463,
-            "per_output_diffs": [
-              0.004102766513824463,
-              0.0
             ]
-          }
         }
       },
       {
@@ -223,7 +214,7 @@
         "source_artifact": "decoder_step.pt2",
         "output_artifact": "decoder_step.tflite",
         "size_mb": 22.55,
-        "convert_seconds": 2.72,
         "quant": "fp16",
         "torch_output_shapes": [
           [
@@ -315,7 +306,7 @@
         "source_artifact": "joint_step.pt2",
         "output_artifact": "joint_step.tflite",
         "size_mb": 12.08,
-        "convert_seconds": 1.08,
         "quant": "fp16",
         "torch_output_shapes": [
           [
@@ -327,9 +318,9 @@
         ],
         "parity": {
           "ok": true,
-          "max_abs_diff": 0.33984375,
           "per_output_diffs": [
-            0.33984375
           ],
           "tflite_output_shapes": [
             [

     "attention_mode": "rel_pos",
     "att_context_size": null,
     "buckets": [
       {
         "n_mel_frames": 1500,
         "n_encoder_frames": 187,
           128,
           1500
         ],
+        "output_shape": [
+          1,
+          1024,
+          188
+        ],
+        "artifact": "encoder_T1500.pt2",
+        "size_mb": 2366.85
       }
     ],
+    "multisig": false
   },
   "decoder": {
     "num_layers": 2,
     "results": [
       {
         "graph": "encoder",
+        "source_artifact": "encoder_T1500.pt2",
+        "output_artifact": "encoder_T1500.tflite",
+        "size_mb": 1150.88,
+        "convert_seconds": 158.59,
         "quant": "fp16",
+        "torch_output_shapes": [
+          [
+            1,
+            1024,
+            188
+          ],
+          [
+            1
+          ]
         ],
+        "parity": {
+          "ok": true,
+          "max_abs_diff": 0.0,
+          "per_output_diffs": [
+            [
+              "shape mismatch",
+              [
+                1
+              ],
+              [
+                1,
+                1024,
+                188
+              ]
+            ],
+            [
+              "shape mismatch",
+              [
+                1,
+                1024,
+                188
+              ],
+              [
+                1
+              ]
             ]
+          ],
+          "tflite_output_shapes": [
+            [
+              1
+            ],
+            [
+              1,
+              1024,
+              188
             ]
+          ],
+          "torch_output_shapes": [
+            [
+              1,
+              1024,
+              188
+            ],
+            [
+              1
             ]
+          ]
         }
       },
       {
         "source_artifact": "decoder_step.pt2",
         "output_artifact": "decoder_step.tflite",
         "size_mb": 22.55,
+        "convert_seconds": 1.92,
         "quant": "fp16",
         "torch_output_shapes": [
           [
         "source_artifact": "joint_step.pt2",
         "output_artifact": "joint_step.tflite",
         "size_mb": 12.08,
+        "convert_seconds": 1.61,
         "quant": "fp16",
         "torch_output_shapes": [
           [
         ],
         "parity": {
           "ok": true,
+          "max_abs_diff": 0.408447265625,
           "per_output_diffs": [
+            0.408447265625
           ],
           "tflite_output_shapes": [
             [