Upload LiteRT FP16 multi-sig bundle

Browse files

Files changed (6) hide show

README.md +183 -0
decoder_step.tflite +3 -0
encoder_multisig.tflite +3 -0
joint_step.tflite +3 -0
manifest.json +352 -0
tokenizer.model +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,183 @@

+---
+license: cc-by-4.0
+language:
+- en
+- es
+- it
+- de
+- fr
+- pt
+library_name: litert
+base_model: nvidia/parakeet-tdt-0.6b-v3
+tags:
+- automatic-speech-recognition
+- speech
+- audio
+- parakeet
+- tdt
+- litert
+- tflite
+- on-device
+- mobile
+- android
+- streaming
+pipeline_tag: automatic-speech-recognition
+---
+# Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port
+This is a [LiteRT](https://ai.google.dev/edge/litert) (TFLite) port of
+[`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
+packaged for on-device inference (Android / Mac / embedded) without a Python or
+NeMo runtime dependency.
+For **model capabilities, languages, training data, license, and benchmarks**,
+see the upstream model card. This card only documents what's specific to the
+LiteRT port.
+## What's in this bundle
+| File | Size | Purpose |
+|---|---|---|
+| `encoder_multisig.tflite` | 1.19 GB | FP16 weight-shared encoder, 4 bucket signatures |
+| `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
+| `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
+| `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
+| `manifest.json` | — | All metadata the runtime needs |
+Total: **~1.2 GB** (FP16). FP32 reference is roughly 2.4 GB.
+## Encoder signatures (multi-bucket)
+Weights are shared across 4 fixed-T input shapes via TFLite signatures:
+| Signature | T_mel | Audio | Use |
+|---|---|---|---|
+| `forward_T300` | 300 | 3.0 s | short utterances, low latency |
+| `forward_T500` | 500 | 5.0 s | typical streaming chunks |
+| `forward_T700` | 700 | 7.0 s | medium utterances |
+| `forward_T1500` | 1500 | 15.0 s | long utterances, offline |
+Each signature has the same I/O shape contract:
+```
+inputs:
+  audio_signal : float32 [1, 128, T_mel]   # log-mel features (NeMo preproc)
+  length       : int64   [1]                # actual mel frames used (≤ T_mel)
+outputs:
+  encoded         : float32 [1, 1024, T_enc]  # T_enc = (T_mel - 4) // 8
+  encoded_lengths : int64   [1]
+```
+Pick the smallest bucket that fits your input; pad shorter inputs with zeros
+and pass the true length.
+## Decoder + joint contract
+```
+decoder_step:
+  inputs:  token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
+  outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]
+joint_step:
+  inputs:  enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
+  outputs: logits float32 [1,1,1,8198]
+           # logits[..., 0:8193] → token logits (8192 BPE + 1 blank)
+           # logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
+```
+Greedy TDT decoding (per encoder frame):
+1. Run joint with current `enc_frame` and last predicted `pred_frame`.
+2. `token = argmax(token_logits)`; `dur = argmax(duration_logits) ∈ {0,1,2,3,4}`.
+3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
+   re-prime decoder with the emitted token (h, c update).
+4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
+5. Repeat until `enc_lengths` is exhausted.
+## Audio preprocessing
+LiteRT itself does not produce mel features — your runtime must compute them.
+Match NeMo's preprocessor exactly:
+```
+sample_rate    : 16000 Hz (resample if needed)
+n_fft          : 512
+hop_length     : 160      → 100 mel frames / second
+win_length     : 400
+n_mels         : 128
+preemph        : 0.97
+log            : log10(mel + 1e-5) per-feature normalized
+```
+Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms).
+## Streaming usage
+This bundle supports chunked streaming inference. A reference Python
+implementation is provided in the upload repo (`transcribe_litert_streaming.py`),
+which produces ~27% WER on multilingual long-form audio at ~2× real-time on CPU
+with `chunk=5s, left=5s, right=2s` (12 s window, bucket `forward_T1500`).
+For Android, port the chunker by:
+1. Hold a rolling mel buffer (left context + new chunk + right look-ahead).
+2. Pick the smallest bucket ≥ window length, pad to bucket T_mel.
+3. Run encoder signature, then TDT greedy decode over `T_enc` frames.
+4. Dedup tokens against the previous chunk's emit window using their
+   `encoder_frame_idx`. Reuse the LSTM `(h, c)` state across chunks (optional).
+The model is **not** a strict left-only streamer — it sees right context within
+each chunk window. For "real" low-latency streaming, the right-context
+look-ahead can be reduced or removed at a quality cost.
+## Quantization
+- All `.tflite` weights are FP16. Activations remain FP32 (no activation
+  calibration).
+- Round-trip parity with the upstream FP32 model: bit-identical token output on
+  a 99-clip English eval set (validated with the offline runner).
+## Conversion provenance
+Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:
+1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module).
+2. **ExportedProgram → TFLite** via
+   [`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0
+   (`signature(...).add_signature(...).convert()`).
+3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
+   FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
+The encoder graph is exported once with a dynamic time dim, then specialized
+into 4 fixed-T signatures sharing weights. The TFLite serializer dedups the
+weight tensors, so the bundle is the size of one encoder, not four.
+## Limitations & caveats
+- **Bucket positional encoding.** The encoder was trained with audio anchored
+  at position 0 of its input window. Padding *before* the audio causes
+  hallucinations. Always place audio at the start of the bucket buffer and
+  zero-pad the tail.
+- **Long-form clips.** A single bucket call covers at most 15 s. Anything
+  longer must be chunked at the runtime level.
+- **No voice activity detection / diarization.** Pair with a separate VAD or
+  diarizer (e.g. Sortformer, pyannote) for speaker-attributed transcripts.
+## License
+Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0). See
+the upstream model card for full terms.
+## Citation
+If you use this bundle, cite the upstream NeMo model:
+```bibtex
+@misc{nvidia_parakeet_tdt_0_6b_v3,
+  title  = {Parakeet-TDT-0.6B-v3},
+  author = {NVIDIA},
+  year   = {2025},
+  url    = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
+}
+```

decoder_step.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eb0bf3559a0b4cbdc3ca05b7e8ff948ee5ef158ce424667b62a85f6c769a9ce1
+size 23650084

encoder_multisig.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a97075644590cedce95a53083c876f56dce22d2e1e5807bc4ca2d6879f6183c8
+size 1249026196

joint_step.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0e28c22fc426df9900ef4a1bd15760ec757e44f0fd1818e0afb51c4fe79031be
+size 12664976

manifest.json ADDED Viewed

	@@ -0,0 +1,352 @@

+{
+  "model": "nvidia/parakeet-tdt-0.6b-v3",
+  "torch_version": "2.11.0+cu130",
+  "model_class": "EncDecRNNTBPEModel",
+  "vocab_size": 8192,
+  "blank_id": 8192,
+  "durations": [
+    0,
+    1,
+    2,
+    3,
+    4
+  ],
+  "num_durations": 5,
+  "joint_output_dim": 8198,
+  "joint_token_logits_slice": [
+    0,
+    8193
+  ],
+  "joint_duration_logits_slice": [
+    8193,
+    8198
+  ],
+  "encoder": {
+    "d_model": 1024,
+    "subsampling_factor": 8,
+    "n_layers": 24,
+    "n_heads": 8,
+    "feat_in": 128,
+    "buckets": [
+      {
+        "n_mel_frames": 300,
+        "n_encoder_frames": 37,
+        "input_shape": [
+          1,
+          128,
+          300
+        ],
+        "signature": "forward_T300"
+      },
+      {
+        "n_mel_frames": 500,
+        "n_encoder_frames": 62,
+        "input_shape": [
+          1,
+          128,
+          500
+        ],
+        "signature": "forward_T500"
+      },
+      {
+        "n_mel_frames": 700,
+        "n_encoder_frames": 87,
+        "input_shape": [
+          1,
+          128,
+          700
+        ],
+        "signature": "forward_T700"
+      },
+      {
+        "n_mel_frames": 1500,
+        "n_encoder_frames": 187,
+        "input_shape": [
+          1,
+          128,
+          1500
+        ],
+        "signature": "forward_T1500"
+      }
+    ],
+    "multisig": true,
+    "dynamic_artifact": "encoder_dynamicT.pt2",
+    "dynamic_artifact_size_mb": 2367.32
+  },
+  "decoder": {
+    "num_layers": 2,
+    "hidden": 640,
+    "embed_dim": 640
+  },
+  "joint": {
+    "d_enc": 1024,
+    "d_pred": 640,
+    "joint_dim": 640
+  },
+  "preprocessor": {
+    "sample_rate": 16000,
+    "n_fft": 512,
+    "win_length": 400,
+    "hop_length": 160,
+    "n_mels": 128,
+    "preemph": 0.97,
+    "log": true,
+    "frame_rate_hz_post_subsample": 12.5
+  },
+  "artifacts": {
+    "decoder_step": {
+      "filename": "decoder_step.pt2",
+      "size_mb": 45.07,
+      "input_shapes": {
+        "token": [
+          1,
+          1
+        ],
+        "h": [
+          2,
+          1,
+          640
+        ],
+        "c": [
+          2,
+          1,
+          640
+        ]
+      },
+      "input_dtypes": {
+        "token": "int64",
+        "h": "float32",
+        "c": "float32"
+      },
+      "output_shapes": {
+        "g": [
+          1,
+          1,
+          640
+        ],
+        "h": [
+          2,
+          1,
+          640
+        ],
+        "c": [
+          2,
+          1,
+          640
+        ]
+      }
+    },
+    "joint_step": {
+      "filename": "joint_step.pt2",
+      "size_mb": 24.14,
+      "input_shapes": {
+        "enc_frame": [
+          1,
+          1024,
+          1
+        ],
+        "pred_frame": [
+          1,
+          640,
+          1
+        ]
+      },
+      "output_shape": [
+        1,
+        1,
+        1,
+        8198
+      ]
+    }
+  },
+  "tokenizer": {
+    "saved": true,
+    "method": "serialized_model_proto",
+    "vocab_size": 8192
+  },
+  "litert": {
+    "quant": "fp16",
+    "results": [
+      {
+        "graph": "encoder",
+        "source_artifact": "encoder_dynamicT.pt2",
+        "output_artifact": "encoder_multisig.tflite",
+        "size_mb": 1191.16,
+        "convert_seconds": 402.16,
+        "quant": "fp16",
+        "multisig": true,
+        "signatures": [
+          "forward_T300",
+          "forward_T500",
+          "forward_T700",
+          "forward_T1500"
+        ],
+        "parity_per_signature": {
+          "forward_T300": {
+            "ok": true,
+            "max_abs_diff": 0.0033329054713249207,
+            "per_output_diffs": [
+              0.0033329054713249207,
+              0.0
+            ]
+          },
+          "forward_T500": {
+            "ok": true,
+            "max_abs_diff": 0.006780040450394154,
+            "per_output_diffs": [
+              0.006780040450394154,
+              0.0
+            ]
+          },
+          "forward_T700": {
+            "ok": true,
+            "max_abs_diff": 0.0005690590478479862,
+            "per_output_diffs": [
+              0.0005690590478479862,
+              0.0
+            ]
+          },
+          "forward_T1500": {
+            "ok": true,
+            "max_abs_diff": 0.003892328590154648,
+            "per_output_diffs": [
+              0.003892328590154648,
+              0.0
+            ]
+          }
+        }
+      },
+      {
+        "graph": "decoder_step",
+        "source_artifact": "decoder_step.pt2",
+        "output_artifact": "decoder_step.tflite",
+        "size_mb": 22.55,
+        "convert_seconds": 3.81,
+        "quant": "fp16",
+        "torch_output_shapes": [
+          [
+            1,
+            1,
+            640
+          ],
+          [
+            2,
+            1,
+            640
+          ],
+          [
+            2,
+            1,
+            640
+          ]
+        ],
+        "parity": {
+          "ok": true,
+          "max_abs_diff": 0.0044100284576416016,
+          "per_output_diffs": [
+            [
+              "shape mismatch",
+              [
+                2,
+                1,
+                640
+              ],
+              [
+                1,
+                1,
+                640
+              ]
+            ],
+            [
+              "shape mismatch",
+              [
+                1,
+                1,
+                640
+              ],
+              [
+                2,
+                1,
+                640
+              ]
+            ],
+            0.0044100284576416016
+          ],
+          "tflite_output_shapes": [
+            [
+              2,
+              1,
+              640
+            ],
+            [
+              1,
+              1,
+              640
+            ],
+            [
+              2,
+              1,
+              640
+            ]
+          ],
+          "torch_output_shapes": [
+            [
+              1,
+              1,
+              640
+            ],
+            [
+              2,
+              1,
+              640
+            ],
+            [
+              2,
+              1,
+              640
+            ]
+          ]
+        }
+      },
+      {
+        "graph": "joint_step",
+        "source_artifact": "joint_step.pt2",
+        "output_artifact": "joint_step.tflite",
+        "size_mb": 12.08,
+        "convert_seconds": 1.13,
+        "quant": "fp16",
+        "torch_output_shapes": [
+          [
+            1,
+            1,
+            1,
+            8198
+          ]
+        ],
+        "parity": {
+          "ok": true,
+          "max_abs_diff": 0.275390625,
+          "per_output_diffs": [
+            0.275390625
+          ],
+          "tflite_output_shapes": [
+            [
+              1,
+              1,
+              1,
+              8198
+            ]
+          ],
+          "torch_output_shapes": [
+            [
+              1,
+              1,
+              1,
+              8198
+            ]
+          ]
+        }
+      }
+    ]
+  }
+}

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eacec2b0a77f336d4a2ca4a25a7047575d3c2b74de47e997f4c205126ed3135e
+size 360916