Automatic Speech Recognition
LiteRT
LiteRT
speech
audio
parakeet
tdt
on-device
mobile
android
streaming
Instructions to use spybyscript/parakeet-tdt-litert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use spybyscript/parakeet-tdt-litert with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Upload LiteRT FP16 multi-sig bundle
Browse files- README.md +183 -0
- decoder_step.tflite +3 -0
- encoder_multisig.tflite +3 -0
- joint_step.tflite +3 -0
- manifest.json +352 -0
- tokenizer.model +3 -0
README.md
ADDED
|
@@ -0,0 +1,183 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-4.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- es
|
| 6 |
+
- it
|
| 7 |
+
- de
|
| 8 |
+
- fr
|
| 9 |
+
- pt
|
| 10 |
+
library_name: litert
|
| 11 |
+
base_model: nvidia/parakeet-tdt-0.6b-v3
|
| 12 |
+
tags:
|
| 13 |
+
- automatic-speech-recognition
|
| 14 |
+
- speech
|
| 15 |
+
- audio
|
| 16 |
+
- parakeet
|
| 17 |
+
- tdt
|
| 18 |
+
- litert
|
| 19 |
+
- tflite
|
| 20 |
+
- on-device
|
| 21 |
+
- mobile
|
| 22 |
+
- android
|
| 23 |
+
- streaming
|
| 24 |
+
pipeline_tag: automatic-speech-recognition
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
# Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port
|
| 28 |
+
|
| 29 |
+
This is a [LiteRT](https://ai.google.dev/edge/litert) (TFLite) port of
|
| 30 |
+
[`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
|
| 31 |
+
packaged for on-device inference (Android / Mac / embedded) without a Python or
|
| 32 |
+
NeMo runtime dependency.
|
| 33 |
+
|
| 34 |
+
For **model capabilities, languages, training data, license, and benchmarks**,
|
| 35 |
+
see the upstream model card. This card only documents what's specific to the
|
| 36 |
+
LiteRT port.
|
| 37 |
+
|
| 38 |
+
## What's in this bundle
|
| 39 |
+
|
| 40 |
+
| File | Size | Purpose |
|
| 41 |
+
|---|---|---|
|
| 42 |
+
| `encoder_multisig.tflite` | 1.19 GB | FP16 weight-shared encoder, 4 bucket signatures |
|
| 43 |
+
| `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
|
| 44 |
+
| `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
|
| 45 |
+
| `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
|
| 46 |
+
| `manifest.json` | — | All metadata the runtime needs |
|
| 47 |
+
|
| 48 |
+
Total: **~1.2 GB** (FP16). FP32 reference is roughly 2.4 GB.
|
| 49 |
+
|
| 50 |
+
## Encoder signatures (multi-bucket)
|
| 51 |
+
|
| 52 |
+
Weights are shared across 4 fixed-T input shapes via TFLite signatures:
|
| 53 |
+
|
| 54 |
+
| Signature | T_mel | Audio | Use |
|
| 55 |
+
|---|---|---|---|
|
| 56 |
+
| `forward_T300` | 300 | 3.0 s | short utterances, low latency |
|
| 57 |
+
| `forward_T500` | 500 | 5.0 s | typical streaming chunks |
|
| 58 |
+
| `forward_T700` | 700 | 7.0 s | medium utterances |
|
| 59 |
+
| `forward_T1500` | 1500 | 15.0 s | long utterances, offline |
|
| 60 |
+
|
| 61 |
+
Each signature has the same I/O shape contract:
|
| 62 |
+
|
| 63 |
+
```
|
| 64 |
+
inputs:
|
| 65 |
+
audio_signal : float32 [1, 128, T_mel] # log-mel features (NeMo preproc)
|
| 66 |
+
length : int64 [1] # actual mel frames used (≤ T_mel)
|
| 67 |
+
outputs:
|
| 68 |
+
encoded : float32 [1, 1024, T_enc] # T_enc = (T_mel - 4) // 8
|
| 69 |
+
encoded_lengths : int64 [1]
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
Pick the smallest bucket that fits your input; pad shorter inputs with zeros
|
| 73 |
+
and pass the true length.
|
| 74 |
+
|
| 75 |
+
## Decoder + joint contract
|
| 76 |
+
|
| 77 |
+
```
|
| 78 |
+
decoder_step:
|
| 79 |
+
inputs: token int64 [1,1], h float32 [2,1,640], c float32 [2,1,640]
|
| 80 |
+
outputs: g float32 [1,1,640], h float32 [2,1,640], c float32 [2,1,640]
|
| 81 |
+
|
| 82 |
+
joint_step:
|
| 83 |
+
inputs: enc_frame float32 [1,1024,1], pred_frame float32 [1,640,1]
|
| 84 |
+
outputs: logits float32 [1,1,1,8198]
|
| 85 |
+
# logits[..., 0:8193] → token logits (8192 BPE + 1 blank)
|
| 86 |
+
# logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
Greedy TDT decoding (per encoder frame):
|
| 90 |
+
|
| 91 |
+
1. Run joint with current `enc_frame` and last predicted `pred_frame`.
|
| 92 |
+
2. `token = argmax(token_logits)`; `dur = argmax(duration_logits) ∈ {0,1,2,3,4}`.
|
| 93 |
+
3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
|
| 94 |
+
re-prime decoder with the emitted token (h, c update).
|
| 95 |
+
4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
|
| 96 |
+
5. Repeat until `enc_lengths` is exhausted.
|
| 97 |
+
|
| 98 |
+
## Audio preprocessing
|
| 99 |
+
|
| 100 |
+
LiteRT itself does not produce mel features — your runtime must compute them.
|
| 101 |
+
Match NeMo's preprocessor exactly:
|
| 102 |
+
|
| 103 |
+
```
|
| 104 |
+
sample_rate : 16000 Hz (resample if needed)
|
| 105 |
+
n_fft : 512
|
| 106 |
+
hop_length : 160 → 100 mel frames / second
|
| 107 |
+
win_length : 400
|
| 108 |
+
n_mels : 128
|
| 109 |
+
preemph : 0.97
|
| 110 |
+
log : log10(mel + 1e-5) per-feature normalized
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms).
|
| 114 |
+
|
| 115 |
+
## Streaming usage
|
| 116 |
+
|
| 117 |
+
This bundle supports chunked streaming inference. A reference Python
|
| 118 |
+
implementation is provided in the upload repo (`transcribe_litert_streaming.py`),
|
| 119 |
+
which produces ~27% WER on multilingual long-form audio at ~2× real-time on CPU
|
| 120 |
+
with `chunk=5s, left=5s, right=2s` (12 s window, bucket `forward_T1500`).
|
| 121 |
+
|
| 122 |
+
For Android, port the chunker by:
|
| 123 |
+
|
| 124 |
+
1. Hold a rolling mel buffer (left context + new chunk + right look-ahead).
|
| 125 |
+
2. Pick the smallest bucket ≥ window length, pad to bucket T_mel.
|
| 126 |
+
3. Run encoder signature, then TDT greedy decode over `T_enc` frames.
|
| 127 |
+
4. Dedup tokens against the previous chunk's emit window using their
|
| 128 |
+
`encoder_frame_idx`. Reuse the LSTM `(h, c)` state across chunks (optional).
|
| 129 |
+
|
| 130 |
+
The model is **not** a strict left-only streamer — it sees right context within
|
| 131 |
+
each chunk window. For "real" low-latency streaming, the right-context
|
| 132 |
+
look-ahead can be reduced or removed at a quality cost.
|
| 133 |
+
|
| 134 |
+
## Quantization
|
| 135 |
+
|
| 136 |
+
- All `.tflite` weights are FP16. Activations remain FP32 (no activation
|
| 137 |
+
calibration).
|
| 138 |
+
- Round-trip parity with the upstream FP32 model: bit-identical token output on
|
| 139 |
+
a 99-clip English eval set (validated with the offline runner).
|
| 140 |
+
|
| 141 |
+
## Conversion provenance
|
| 142 |
+
|
| 143 |
+
Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:
|
| 144 |
+
|
| 145 |
+
1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module).
|
| 146 |
+
2. **ExportedProgram → TFLite** via
|
| 147 |
+
[`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0
|
| 148 |
+
(`signature(...).add_signature(...).convert()`).
|
| 149 |
+
3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
|
| 150 |
+
FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
|
| 151 |
+
|
| 152 |
+
The encoder graph is exported once with a dynamic time dim, then specialized
|
| 153 |
+
into 4 fixed-T signatures sharing weights. The TFLite serializer dedups the
|
| 154 |
+
weight tensors, so the bundle is the size of one encoder, not four.
|
| 155 |
+
|
| 156 |
+
## Limitations & caveats
|
| 157 |
+
|
| 158 |
+
- **Bucket positional encoding.** The encoder was trained with audio anchored
|
| 159 |
+
at position 0 of its input window. Padding *before* the audio causes
|
| 160 |
+
hallucinations. Always place audio at the start of the bucket buffer and
|
| 161 |
+
zero-pad the tail.
|
| 162 |
+
- **Long-form clips.** A single bucket call covers at most 15 s. Anything
|
| 163 |
+
longer must be chunked at the runtime level.
|
| 164 |
+
- **No voice activity detection / diarization.** Pair with a separate VAD or
|
| 165 |
+
diarizer (e.g. Sortformer, pyannote) for speaker-attributed transcripts.
|
| 166 |
+
|
| 167 |
+
## License
|
| 168 |
+
|
| 169 |
+
Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0). See
|
| 170 |
+
the upstream model card for full terms.
|
| 171 |
+
|
| 172 |
+
## Citation
|
| 173 |
+
|
| 174 |
+
If you use this bundle, cite the upstream NeMo model:
|
| 175 |
+
|
| 176 |
+
```bibtex
|
| 177 |
+
@misc{nvidia_parakeet_tdt_0_6b_v3,
|
| 178 |
+
title = {Parakeet-TDT-0.6B-v3},
|
| 179 |
+
author = {NVIDIA},
|
| 180 |
+
year = {2025},
|
| 181 |
+
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3},
|
| 182 |
+
}
|
| 183 |
+
```
|
decoder_step.tflite
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:eb0bf3559a0b4cbdc3ca05b7e8ff948ee5ef158ce424667b62a85f6c769a9ce1
|
| 3 |
+
size 23650084
|
encoder_multisig.tflite
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a97075644590cedce95a53083c876f56dce22d2e1e5807bc4ca2d6879f6183c8
|
| 3 |
+
size 1249026196
|
joint_step.tflite
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0e28c22fc426df9900ef4a1bd15760ec757e44f0fd1818e0afb51c4fe79031be
|
| 3 |
+
size 12664976
|
manifest.json
ADDED
|
@@ -0,0 +1,352 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"model": "nvidia/parakeet-tdt-0.6b-v3",
|
| 3 |
+
"torch_version": "2.11.0+cu130",
|
| 4 |
+
"model_class": "EncDecRNNTBPEModel",
|
| 5 |
+
"vocab_size": 8192,
|
| 6 |
+
"blank_id": 8192,
|
| 7 |
+
"durations": [
|
| 8 |
+
0,
|
| 9 |
+
1,
|
| 10 |
+
2,
|
| 11 |
+
3,
|
| 12 |
+
4
|
| 13 |
+
],
|
| 14 |
+
"num_durations": 5,
|
| 15 |
+
"joint_output_dim": 8198,
|
| 16 |
+
"joint_token_logits_slice": [
|
| 17 |
+
0,
|
| 18 |
+
8193
|
| 19 |
+
],
|
| 20 |
+
"joint_duration_logits_slice": [
|
| 21 |
+
8193,
|
| 22 |
+
8198
|
| 23 |
+
],
|
| 24 |
+
"encoder": {
|
| 25 |
+
"d_model": 1024,
|
| 26 |
+
"subsampling_factor": 8,
|
| 27 |
+
"n_layers": 24,
|
| 28 |
+
"n_heads": 8,
|
| 29 |
+
"feat_in": 128,
|
| 30 |
+
"buckets": [
|
| 31 |
+
{
|
| 32 |
+
"n_mel_frames": 300,
|
| 33 |
+
"n_encoder_frames": 37,
|
| 34 |
+
"input_shape": [
|
| 35 |
+
1,
|
| 36 |
+
128,
|
| 37 |
+
300
|
| 38 |
+
],
|
| 39 |
+
"signature": "forward_T300"
|
| 40 |
+
},
|
| 41 |
+
{
|
| 42 |
+
"n_mel_frames": 500,
|
| 43 |
+
"n_encoder_frames": 62,
|
| 44 |
+
"input_shape": [
|
| 45 |
+
1,
|
| 46 |
+
128,
|
| 47 |
+
500
|
| 48 |
+
],
|
| 49 |
+
"signature": "forward_T500"
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"n_mel_frames": 700,
|
| 53 |
+
"n_encoder_frames": 87,
|
| 54 |
+
"input_shape": [
|
| 55 |
+
1,
|
| 56 |
+
128,
|
| 57 |
+
700
|
| 58 |
+
],
|
| 59 |
+
"signature": "forward_T700"
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"n_mel_frames": 1500,
|
| 63 |
+
"n_encoder_frames": 187,
|
| 64 |
+
"input_shape": [
|
| 65 |
+
1,
|
| 66 |
+
128,
|
| 67 |
+
1500
|
| 68 |
+
],
|
| 69 |
+
"signature": "forward_T1500"
|
| 70 |
+
}
|
| 71 |
+
],
|
| 72 |
+
"multisig": true,
|
| 73 |
+
"dynamic_artifact": "encoder_dynamicT.pt2",
|
| 74 |
+
"dynamic_artifact_size_mb": 2367.32
|
| 75 |
+
},
|
| 76 |
+
"decoder": {
|
| 77 |
+
"num_layers": 2,
|
| 78 |
+
"hidden": 640,
|
| 79 |
+
"embed_dim": 640
|
| 80 |
+
},
|
| 81 |
+
"joint": {
|
| 82 |
+
"d_enc": 1024,
|
| 83 |
+
"d_pred": 640,
|
| 84 |
+
"joint_dim": 640
|
| 85 |
+
},
|
| 86 |
+
"preprocessor": {
|
| 87 |
+
"sample_rate": 16000,
|
| 88 |
+
"n_fft": 512,
|
| 89 |
+
"win_length": 400,
|
| 90 |
+
"hop_length": 160,
|
| 91 |
+
"n_mels": 128,
|
| 92 |
+
"preemph": 0.97,
|
| 93 |
+
"log": true,
|
| 94 |
+
"frame_rate_hz_post_subsample": 12.5
|
| 95 |
+
},
|
| 96 |
+
"artifacts": {
|
| 97 |
+
"decoder_step": {
|
| 98 |
+
"filename": "decoder_step.pt2",
|
| 99 |
+
"size_mb": 45.07,
|
| 100 |
+
"input_shapes": {
|
| 101 |
+
"token": [
|
| 102 |
+
1,
|
| 103 |
+
1
|
| 104 |
+
],
|
| 105 |
+
"h": [
|
| 106 |
+
2,
|
| 107 |
+
1,
|
| 108 |
+
640
|
| 109 |
+
],
|
| 110 |
+
"c": [
|
| 111 |
+
2,
|
| 112 |
+
1,
|
| 113 |
+
640
|
| 114 |
+
]
|
| 115 |
+
},
|
| 116 |
+
"input_dtypes": {
|
| 117 |
+
"token": "int64",
|
| 118 |
+
"h": "float32",
|
| 119 |
+
"c": "float32"
|
| 120 |
+
},
|
| 121 |
+
"output_shapes": {
|
| 122 |
+
"g": [
|
| 123 |
+
1,
|
| 124 |
+
1,
|
| 125 |
+
640
|
| 126 |
+
],
|
| 127 |
+
"h": [
|
| 128 |
+
2,
|
| 129 |
+
1,
|
| 130 |
+
640
|
| 131 |
+
],
|
| 132 |
+
"c": [
|
| 133 |
+
2,
|
| 134 |
+
1,
|
| 135 |
+
640
|
| 136 |
+
]
|
| 137 |
+
}
|
| 138 |
+
},
|
| 139 |
+
"joint_step": {
|
| 140 |
+
"filename": "joint_step.pt2",
|
| 141 |
+
"size_mb": 24.14,
|
| 142 |
+
"input_shapes": {
|
| 143 |
+
"enc_frame": [
|
| 144 |
+
1,
|
| 145 |
+
1024,
|
| 146 |
+
1
|
| 147 |
+
],
|
| 148 |
+
"pred_frame": [
|
| 149 |
+
1,
|
| 150 |
+
640,
|
| 151 |
+
1
|
| 152 |
+
]
|
| 153 |
+
},
|
| 154 |
+
"output_shape": [
|
| 155 |
+
1,
|
| 156 |
+
1,
|
| 157 |
+
1,
|
| 158 |
+
8198
|
| 159 |
+
]
|
| 160 |
+
}
|
| 161 |
+
},
|
| 162 |
+
"tokenizer": {
|
| 163 |
+
"saved": true,
|
| 164 |
+
"method": "serialized_model_proto",
|
| 165 |
+
"vocab_size": 8192
|
| 166 |
+
},
|
| 167 |
+
"litert": {
|
| 168 |
+
"quant": "fp16",
|
| 169 |
+
"results": [
|
| 170 |
+
{
|
| 171 |
+
"graph": "encoder",
|
| 172 |
+
"source_artifact": "encoder_dynamicT.pt2",
|
| 173 |
+
"output_artifact": "encoder_multisig.tflite",
|
| 174 |
+
"size_mb": 1191.16,
|
| 175 |
+
"convert_seconds": 402.16,
|
| 176 |
+
"quant": "fp16",
|
| 177 |
+
"multisig": true,
|
| 178 |
+
"signatures": [
|
| 179 |
+
"forward_T300",
|
| 180 |
+
"forward_T500",
|
| 181 |
+
"forward_T700",
|
| 182 |
+
"forward_T1500"
|
| 183 |
+
],
|
| 184 |
+
"parity_per_signature": {
|
| 185 |
+
"forward_T300": {
|
| 186 |
+
"ok": true,
|
| 187 |
+
"max_abs_diff": 0.0033329054713249207,
|
| 188 |
+
"per_output_diffs": [
|
| 189 |
+
0.0033329054713249207,
|
| 190 |
+
0.0
|
| 191 |
+
]
|
| 192 |
+
},
|
| 193 |
+
"forward_T500": {
|
| 194 |
+
"ok": true,
|
| 195 |
+
"max_abs_diff": 0.006780040450394154,
|
| 196 |
+
"per_output_diffs": [
|
| 197 |
+
0.006780040450394154,
|
| 198 |
+
0.0
|
| 199 |
+
]
|
| 200 |
+
},
|
| 201 |
+
"forward_T700": {
|
| 202 |
+
"ok": true,
|
| 203 |
+
"max_abs_diff": 0.0005690590478479862,
|
| 204 |
+
"per_output_diffs": [
|
| 205 |
+
0.0005690590478479862,
|
| 206 |
+
0.0
|
| 207 |
+
]
|
| 208 |
+
},
|
| 209 |
+
"forward_T1500": {
|
| 210 |
+
"ok": true,
|
| 211 |
+
"max_abs_diff": 0.003892328590154648,
|
| 212 |
+
"per_output_diffs": [
|
| 213 |
+
0.003892328590154648,
|
| 214 |
+
0.0
|
| 215 |
+
]
|
| 216 |
+
}
|
| 217 |
+
}
|
| 218 |
+
},
|
| 219 |
+
{
|
| 220 |
+
"graph": "decoder_step",
|
| 221 |
+
"source_artifact": "decoder_step.pt2",
|
| 222 |
+
"output_artifact": "decoder_step.tflite",
|
| 223 |
+
"size_mb": 22.55,
|
| 224 |
+
"convert_seconds": 3.81,
|
| 225 |
+
"quant": "fp16",
|
| 226 |
+
"torch_output_shapes": [
|
| 227 |
+
[
|
| 228 |
+
1,
|
| 229 |
+
1,
|
| 230 |
+
640
|
| 231 |
+
],
|
| 232 |
+
[
|
| 233 |
+
2,
|
| 234 |
+
1,
|
| 235 |
+
640
|
| 236 |
+
],
|
| 237 |
+
[
|
| 238 |
+
2,
|
| 239 |
+
1,
|
| 240 |
+
640
|
| 241 |
+
]
|
| 242 |
+
],
|
| 243 |
+
"parity": {
|
| 244 |
+
"ok": true,
|
| 245 |
+
"max_abs_diff": 0.0044100284576416016,
|
| 246 |
+
"per_output_diffs": [
|
| 247 |
+
[
|
| 248 |
+
"shape mismatch",
|
| 249 |
+
[
|
| 250 |
+
2,
|
| 251 |
+
1,
|
| 252 |
+
640
|
| 253 |
+
],
|
| 254 |
+
[
|
| 255 |
+
1,
|
| 256 |
+
1,
|
| 257 |
+
640
|
| 258 |
+
]
|
| 259 |
+
],
|
| 260 |
+
[
|
| 261 |
+
"shape mismatch",
|
| 262 |
+
[
|
| 263 |
+
1,
|
| 264 |
+
1,
|
| 265 |
+
640
|
| 266 |
+
],
|
| 267 |
+
[
|
| 268 |
+
2,
|
| 269 |
+
1,
|
| 270 |
+
640
|
| 271 |
+
]
|
| 272 |
+
],
|
| 273 |
+
0.0044100284576416016
|
| 274 |
+
],
|
| 275 |
+
"tflite_output_shapes": [
|
| 276 |
+
[
|
| 277 |
+
2,
|
| 278 |
+
1,
|
| 279 |
+
640
|
| 280 |
+
],
|
| 281 |
+
[
|
| 282 |
+
1,
|
| 283 |
+
1,
|
| 284 |
+
640
|
| 285 |
+
],
|
| 286 |
+
[
|
| 287 |
+
2,
|
| 288 |
+
1,
|
| 289 |
+
640
|
| 290 |
+
]
|
| 291 |
+
],
|
| 292 |
+
"torch_output_shapes": [
|
| 293 |
+
[
|
| 294 |
+
1,
|
| 295 |
+
1,
|
| 296 |
+
640
|
| 297 |
+
],
|
| 298 |
+
[
|
| 299 |
+
2,
|
| 300 |
+
1,
|
| 301 |
+
640
|
| 302 |
+
],
|
| 303 |
+
[
|
| 304 |
+
2,
|
| 305 |
+
1,
|
| 306 |
+
640
|
| 307 |
+
]
|
| 308 |
+
]
|
| 309 |
+
}
|
| 310 |
+
},
|
| 311 |
+
{
|
| 312 |
+
"graph": "joint_step",
|
| 313 |
+
"source_artifact": "joint_step.pt2",
|
| 314 |
+
"output_artifact": "joint_step.tflite",
|
| 315 |
+
"size_mb": 12.08,
|
| 316 |
+
"convert_seconds": 1.13,
|
| 317 |
+
"quant": "fp16",
|
| 318 |
+
"torch_output_shapes": [
|
| 319 |
+
[
|
| 320 |
+
1,
|
| 321 |
+
1,
|
| 322 |
+
1,
|
| 323 |
+
8198
|
| 324 |
+
]
|
| 325 |
+
],
|
| 326 |
+
"parity": {
|
| 327 |
+
"ok": true,
|
| 328 |
+
"max_abs_diff": 0.275390625,
|
| 329 |
+
"per_output_diffs": [
|
| 330 |
+
0.275390625
|
| 331 |
+
],
|
| 332 |
+
"tflite_output_shapes": [
|
| 333 |
+
[
|
| 334 |
+
1,
|
| 335 |
+
1,
|
| 336 |
+
1,
|
| 337 |
+
8198
|
| 338 |
+
]
|
| 339 |
+
],
|
| 340 |
+
"torch_output_shapes": [
|
| 341 |
+
[
|
| 342 |
+
1,
|
| 343 |
+
1,
|
| 344 |
+
1,
|
| 345 |
+
8198
|
| 346 |
+
]
|
| 347 |
+
]
|
| 348 |
+
}
|
| 349 |
+
}
|
| 350 |
+
]
|
| 351 |
+
}
|
| 352 |
+
}
|
tokenizer.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:eacec2b0a77f336d4a2ca4a25a7047575d3c2b74de47e997f4c205126ed3135e
|
| 3 |
+
size 360916
|