Automatic Speech Recognition
LiteRT
LiteRT
speech
audio
parakeet
tdt
on-device
mobile
android
streaming
Instructions to use spybyscript/parakeet-tdt-litert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use spybyscript/parakeet-tdt-litert with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Upload LiteRT FP16 bundle
Browse files- README.md +80 -65
- encoder_T1500.tflite +3 -0
- manifest.json +69 -78
README.md
CHANGED
|
@@ -26,10 +26,10 @@ pipeline_tag: automatic-speech-recognition
|
|
| 26 |
|
| 27 |
# Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port
|
| 28 |
|
| 29 |
-
|
| 30 |
[`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
|
| 31 |
-
packaged for on-device inference (Android / Mac / embedded) without a Python
|
| 32 |
-
NeMo runtime dependency.
|
| 33 |
|
| 34 |
For **model capabilities, languages, training data, license, and benchmarks**,
|
| 35 |
see the upstream model card. This card only documents what's specific to the
|
|
@@ -39,46 +39,53 @@ LiteRT port.
|
|
| 39 |
|
| 40 |
| File | Size | Purpose |
|
| 41 |
|---|---|---|
|
| 42 |
-
| `
|
| 43 |
| `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
|
| 44 |
| `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
|
| 45 |
| `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
|
| 46 |
| `manifest.json` | — | All metadata the runtime needs |
|
| 47 |
|
| 48 |
-
Total: **~1.
|
| 49 |
|
| 50 |
-
## Encoder
|
| 51 |
-
|
| 52 |
-
Weights are shared across 4 fixed-T input shapes via TFLite signatures:
|
| 53 |
-
|
| 54 |
-
| Signature | T_mel | Audio | Use |
|
| 55 |
-
|---|---|---|---|
|
| 56 |
-
| `forward_T300` | 300 | 3.0 s | short utterances, low latency |
|
| 57 |
-
| `forward_T500` | 500 | 5.0 s | typical streaming chunks |
|
| 58 |
-
| `forward_T700` | 700 | 7.0 s | medium utterances |
|
| 59 |
-
| `forward_T1500` | 1500 | 15.0 s | long utterances, offline |
|
| 60 |
-
|
| 61 |
-
Each signature has the same I/O shape contract:
|
| 62 |
|
| 63 |
```
|
| 64 |
inputs:
|
| 65 |
-
audio_signal : float32 [1, 128,
|
| 66 |
-
length : int32 [1]
|
| 67 |
outputs:
|
| 68 |
-
encoded : float32 [1, 1024,
|
| 69 |
encoded_lengths : int32 [1]
|
| 70 |
```
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
**Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
|
| 76 |
NPU accelerator) reject int64 tensors entirely. With int64 length, every
|
| 77 |
-
internal CAST node touching it falls back to CPU and `CompiledModel.create()`
|
| 78 |
fails outright on Android with the GPU backend. This bundle is exported with
|
| 79 |
int32 length end-to-end (input → internal mask arange/comparisons → output
|
| 80 |
-
`encoded_lengths`). int32 covers >2 billion mel frames (~5 hours
|
| 81 |
-
practical range loss.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
## Decoder + joint contract
|
| 84 |
|
|
@@ -94,19 +101,25 @@ joint_step:
|
|
| 94 |
# logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
|
| 95 |
```
|
| 96 |
|
|
|
|
|
|
|
|
|
|
| 97 |
Greedy TDT decoding (per encoder frame):
|
| 98 |
|
| 99 |
1. Run joint with current `enc_frame` and last predicted `pred_frame`.
|
| 100 |
-
2. `token = argmax(token_logits)`; `dur = argmax(duration_logits) ∈ {0,1,2,3,4}`.
|
| 101 |
3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
|
| 102 |
re-prime decoder with the emitted token (h, c update).
|
| 103 |
4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
|
| 104 |
5. Repeat until `enc_lengths` is exhausted.
|
| 105 |
|
|
|
|
|
|
|
|
|
|
| 106 |
## Audio preprocessing
|
| 107 |
|
| 108 |
-
LiteRT itself does not produce mel features — your runtime must compute
|
| 109 |
-
Match NeMo's preprocessor exactly:
|
| 110 |
|
| 111 |
```
|
| 112 |
sample_rate : 16000 Hz (resample if needed)
|
|
@@ -115,36 +128,35 @@ hop_length : 160 → 100 mel frames / second
|
|
| 115 |
win_length : 400
|
| 116 |
n_mels : 128
|
| 117 |
preemph : 0.97
|
| 118 |
-
log :
|
|
|
|
| 119 |
```
|
| 120 |
|
| 121 |
Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms).
|
| 122 |
|
| 123 |
## Streaming usage
|
| 124 |
|
| 125 |
-
This bundle supports chunked streaming inference
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
For Android, port the chunker by:
|
| 131 |
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
|
|
|
|
|
|
| 137 |
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
look-ahead can be reduced or removed at a quality cost.
|
| 141 |
|
| 142 |
## Quantization
|
| 143 |
|
| 144 |
-
- All `.tflite` weights are FP16. Activations remain FP32
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
a 99-clip English eval set (validated with the offline runner).
|
| 148 |
|
| 149 |
## Conversion provenance
|
| 150 |
|
|
@@ -152,35 +164,38 @@ Built from upstream `nvidia/parakeet-tdt-0.6b-v3.nemo` via:
|
|
| 152 |
|
| 153 |
1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module).
|
| 154 |
2. **ExportedProgram → TFLite** via
|
| 155 |
-
[`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0
|
| 156 |
-
(`signature(...).add_signature(...).convert()`).
|
| 157 |
3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
|
| 158 |
FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
|
| 159 |
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
-
## Limitations
|
| 165 |
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
diarizer (e.g. Sortformer, pyannote) for speaker-attributed transcripts.
|
| 174 |
|
| 175 |
## License
|
| 176 |
|
| 177 |
-
Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0).
|
| 178 |
-
the upstream model card for full terms.
|
| 179 |
|
| 180 |
## Citation
|
| 181 |
|
| 182 |
-
If you use this bundle, cite the upstream NeMo model:
|
| 183 |
-
|
| 184 |
```bibtex
|
| 185 |
@misc{nvidia_parakeet_tdt_0_6b_v3,
|
| 186 |
title = {Parakeet-TDT-0.6B-v3},
|
|
|
|
| 26 |
|
| 27 |
# Parakeet-TDT-0.6B-v3 — LiteRT (TFLite) port
|
| 28 |
|
| 29 |
+
LiteRT (TFLite) port of
|
| 30 |
[`nvidia/parakeet-tdt-0.6b-v3`](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3),
|
| 31 |
+
packaged for on-device inference (Android / Mac / embedded) without a Python
|
| 32 |
+
or NeMo runtime dependency.
|
| 33 |
|
| 34 |
For **model capabilities, languages, training data, license, and benchmarks**,
|
| 35 |
see the upstream model card. This card only documents what's specific to the
|
|
|
|
| 39 |
|
| 40 |
| File | Size | Purpose |
|
| 41 |
|---|---|---|
|
| 42 |
+
| `encoder_T1500.tflite` | 1.15 GB | FP16 encoder, fixed `T_mel = 1500` (15 s window) |
|
| 43 |
| `decoder_step.tflite` | 23 MB | Single-step LSTM prediction network |
|
| 44 |
| `joint_step.tflite` | 12 MB | TDT joint network (token + duration logits) |
|
| 45 |
| `tokenizer.model` | 353 KB | SentencePiece BPE tokenizer (vocab=8192) |
|
| 46 |
| `manifest.json` | — | All metadata the runtime needs |
|
| 47 |
|
| 48 |
+
Total: **~1.18 GB** (FP16). FP32 reference is ~2.37 GB.
|
| 49 |
|
| 50 |
+
## Encoder I/O contract
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
```
|
| 53 |
inputs:
|
| 54 |
+
audio_signal : float32 [1, 128, 1500] # log-mel features (NeMo preproc)
|
| 55 |
+
length : int32 [1] # actual mel frames used (≤ 1500)
|
| 56 |
outputs:
|
| 57 |
+
encoded : float32 [1, 1024, 188] # 188 = (1500 - 4) // 8
|
| 58 |
encoded_lengths : int32 [1]
|
| 59 |
```
|
| 60 |
|
| 61 |
+
Pad shorter inputs with zeros at the **tail** (the encoder was trained with
|
| 62 |
+
audio anchored at position 0; left-padding causes hallucinations) and pass
|
| 63 |
+
the true length.
|
| 64 |
+
|
| 65 |
+
The 1500-mel bucket covers ≤ 15 s of audio. For long-form input, run the
|
| 66 |
+
encoder in a sliding-window streaming loop — see "Streaming usage" below.
|
| 67 |
|
| 68 |
**Why int32, not int64.** LiteRT's GPU/NPU delegates (LiteRT-CL / OpenCL,
|
| 69 |
NPU accelerator) reject int64 tensors entirely. With int64 length, every
|
| 70 |
+
internal CAST node touching it falls back to CPU, and `CompiledModel.create()`
|
| 71 |
fails outright on Android with the GPU backend. This bundle is exported with
|
| 72 |
int32 length end-to-end (input → internal mask arange/comparisons → output
|
| 73 |
+
`encoded_lengths`). int32 covers > 2 billion mel frames (~5 hours of audio),
|
| 74 |
+
so no practical range loss.
|
| 75 |
+
|
| 76 |
+
## Why a single bucket and not multi-signature
|
| 77 |
+
|
| 78 |
+
An earlier revision shipped a multi-signature encoder with 4 buckets
|
| 79 |
+
(300/500/700/1500) sharing weights inside one `.tflite`. The disk savings
|
| 80 |
+
were real (~1.2 GB instead of 4.8 GB for 4 separate files), but on Android
|
| 81 |
+
the LiteRT `CompiledModel.create()` API prepares **every** signature's
|
| 82 |
+
subgraph at load time — each one going through the full delegate-partition
|
| 83 |
+
pass. With 4 signatures × ~7 s of XNNPACK / GPU partition prep, app cold
|
| 84 |
+
start was ~28 s.
|
| 85 |
+
|
| 86 |
+
A single-bucket file is one subgraph: ~7 s init, then ready. If you need
|
| 87 |
+
multiple bucket sizes for latency reasons, ship them as separate `.tflite`
|
| 88 |
+
files (TFLite has no cross-file weight sharing) and load on demand.
|
| 89 |
|
| 90 |
## Decoder + joint contract
|
| 91 |
|
|
|
|
| 101 |
# logits[..., 8193:8198] → duration logits over [0,1,2,3,4]
|
| 102 |
```
|
| 103 |
|
| 104 |
+
`decoder_step.token` is `int64` because it's an embedding lookup; that op
|
| 105 |
+
runs on CPU regardless of delegate, so int64 there is harmless.
|
| 106 |
+
|
| 107 |
Greedy TDT decoding (per encoder frame):
|
| 108 |
|
| 109 |
1. Run joint with current `enc_frame` and last predicted `pred_frame`.
|
| 110 |
+
2. `token = argmax(token_logits)`; `dur = durations[argmax(duration_logits)] ∈ {0,1,2,3,4}`.
|
| 111 |
3. If `token != blank_id (8192)`: emit token, advance `dur` encoder frames,
|
| 112 |
re-prime decoder with the emitted token (h, c update).
|
| 113 |
4. Else: advance `max(dur, 1)` encoder frames; do not advance the decoder.
|
| 114 |
5. Repeat until `enc_lengths` is exhausted.
|
| 115 |
|
| 116 |
+
Cap at ~10 non-blank emissions per encoder frame to guard against the
|
| 117 |
+
pathological `dur=0` decode loop.
|
| 118 |
+
|
| 119 |
## Audio preprocessing
|
| 120 |
|
| 121 |
+
LiteRT itself does not produce mel features — your runtime must compute
|
| 122 |
+
them. Match NeMo's preprocessor exactly:
|
| 123 |
|
| 124 |
```
|
| 125 |
sample_rate : 16000 Hz (resample if needed)
|
|
|
|
| 128 |
win_length : 400
|
| 129 |
n_mels : 128
|
| 130 |
preemph : 0.97
|
| 131 |
+
log : log(mel + 1e-5), per-feature normalized
|
| 132 |
+
mel_scale : slaney
|
| 133 |
```
|
| 134 |
|
| 135 |
Encoder frame rate after the 8× subsampler: **12.5 fps** (1 enc frame = 80 ms).
|
| 136 |
|
| 137 |
## Streaming usage
|
| 138 |
|
| 139 |
+
This bundle supports chunked streaming inference using a left+chunk+right
|
| 140 |
+
context window that fits inside 15 s. A reference Python implementation is
|
| 141 |
+
in the upstream repo (`transcribe_litert_streaming.py`). Recommended config
|
| 142 |
+
for Android UX:
|
|
|
|
|
|
|
| 143 |
|
| 144 |
+
| Knob | Value | Reason |
|
| 145 |
+
|---|---|---|
|
| 146 |
+
| `chunk_seconds` | 5 | committed per step |
|
| 147 |
+
| `left_context_seconds` | 5 | encoder bilateral context |
|
| 148 |
+
| `right_context_seconds` | 2 | end-to-end latency ≈ 7 s |
|
| 149 |
+
| `window total` | 12 s | (5 + 5 + 2) × 100 = 1200 mel ≤ 1500 |
|
| 150 |
+
| `carry_state` | false | offline-trained model; carrying LSTM state across chunks tends to hurt |
|
| 151 |
|
| 152 |
+
We measured ~27 % WER on multilingual long-form audio (EN/ES/IT
|
| 153 |
+
code-switching) with this config, ~22 % on clean offline ≤15 s English.
|
|
|
|
| 154 |
|
| 155 |
## Quantization
|
| 156 |
|
| 157 |
+
- All `.tflite` weights are FP16. Activations remain FP32.
|
| 158 |
+
- Bit-identical token output vs the upstream FP32 model on a 99-clip eval
|
| 159 |
+
set.
|
|
|
|
| 160 |
|
| 161 |
## Conversion provenance
|
| 162 |
|
|
|
|
| 164 |
|
| 165 |
1. **NeMo → torch.export ExportedProgram** (per encoder/decoder/joint module).
|
| 166 |
2. **ExportedProgram → TFLite** via
|
| 167 |
+
[`litert-torch`](https://github.com/google-ai-edge/LiteRT) 0.8.0.
|
|
|
|
| 168 |
3. **FP32 → FP16** via `ai_edge_quantizer` `FLOAT_CASTING` algorithm on
|
| 169 |
FC / Conv / DepthwiseConv / ConvTranspose / EmbeddingLookup ops.
|
| 170 |
|
| 171 |
+
Several NeMo internals required export-time monkey-patches:
|
| 172 |
+
|
| 173 |
+
- `MaskedConvSequential.{forward,_create_mask}` and `apply_channel_mask` — to
|
| 174 |
+
remove `.expand(...)` patterns rejected by the TFLite broadcast checker.
|
| 175 |
+
- `RelPositionMultiHeadAttentionLongformer._get_invalid_locations_mask` — to
|
| 176 |
+
build masks in `bool` instead of `uint8` (litert-torch has no uint8
|
| 177 |
+
lowering).
|
| 178 |
+
- `ConformerEncoder.{forward_internal,_create_masks}` and
|
| 179 |
+
`MaskedConvSequential.{forward,_create_mask}` — to keep the entire length
|
| 180 |
+
pipeline in `int32` instead of NeMo's default `int64`, so LiteRT's
|
| 181 |
+
GPU/NPU delegates can compile the graph without falling back to CPU.
|
| 182 |
|
| 183 |
+
## Limitations
|
| 184 |
|
| 185 |
+
1. **Audio at position 0.** The encoder expects audio anchored at the start
|
| 186 |
+
of its input window. Padding before the audio causes hallucinations.
|
| 187 |
+
2. **15 s max per call.** Use the streaming chunker for longer clips.
|
| 188 |
+
3. **No VAD or diarization.** Pair with an external VAD or a diarizer
|
| 189 |
+
(e.g. Sortformer) for speaker-attributed transcripts.
|
| 190 |
+
4. **Multilingual but no language token.** Code-switching works, but the
|
| 191 |
+
model doesn't emit a language ID. Run a separate classifier if you need it.
|
|
|
|
| 192 |
|
| 193 |
## License
|
| 194 |
|
| 195 |
+
Inherits the upstream `nvidia/parakeet-tdt-0.6b-v3` license (CC-BY-4.0).
|
|
|
|
| 196 |
|
| 197 |
## Citation
|
| 198 |
|
|
|
|
|
|
|
| 199 |
```bibtex
|
| 200 |
@misc{nvidia_parakeet_tdt_0_6b_v3,
|
| 201 |
title = {Parakeet-TDT-0.6B-v3},
|
encoder_T1500.tflite
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8f826f40ab8e50c0e05ff1532d1cedcaa9c79c34909f1c14d9ac86c0643db45b
|
| 3 |
+
size 1206789364
|
manifest.json
CHANGED
|
@@ -30,36 +30,6 @@
|
|
| 30 |
"attention_mode": "rel_pos",
|
| 31 |
"att_context_size": null,
|
| 32 |
"buckets": [
|
| 33 |
-
{
|
| 34 |
-
"n_mel_frames": 300,
|
| 35 |
-
"n_encoder_frames": 37,
|
| 36 |
-
"input_shape": [
|
| 37 |
-
1,
|
| 38 |
-
128,
|
| 39 |
-
300
|
| 40 |
-
],
|
| 41 |
-
"signature": "forward_T300"
|
| 42 |
-
},
|
| 43 |
-
{
|
| 44 |
-
"n_mel_frames": 500,
|
| 45 |
-
"n_encoder_frames": 62,
|
| 46 |
-
"input_shape": [
|
| 47 |
-
1,
|
| 48 |
-
128,
|
| 49 |
-
500
|
| 50 |
-
],
|
| 51 |
-
"signature": "forward_T500"
|
| 52 |
-
},
|
| 53 |
-
{
|
| 54 |
-
"n_mel_frames": 700,
|
| 55 |
-
"n_encoder_frames": 87,
|
| 56 |
-
"input_shape": [
|
| 57 |
-
1,
|
| 58 |
-
128,
|
| 59 |
-
700
|
| 60 |
-
],
|
| 61 |
-
"signature": "forward_T700"
|
| 62 |
-
},
|
| 63 |
{
|
| 64 |
"n_mel_frames": 1500,
|
| 65 |
"n_encoder_frames": 187,
|
|
@@ -68,12 +38,16 @@
|
|
| 68 |
128,
|
| 69 |
1500
|
| 70 |
],
|
| 71 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
}
|
| 73 |
],
|
| 74 |
-
"multisig":
|
| 75 |
-
"dynamic_artifact": "encoder_dynamicT.pt2",
|
| 76 |
-
"dynamic_artifact_size_mb": 2367.32
|
| 77 |
},
|
| 78 |
"decoder": {
|
| 79 |
"num_layers": 2,
|
|
@@ -171,51 +145,68 @@
|
|
| 171 |
"results": [
|
| 172 |
{
|
| 173 |
"graph": "encoder",
|
| 174 |
-
"source_artifact": "
|
| 175 |
-
"output_artifact": "
|
| 176 |
-
"size_mb":
|
| 177 |
-
"convert_seconds":
|
| 178 |
"quant": "fp16",
|
| 179 |
-
"
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
|
|
|
|
|
|
|
|
|
| 185 |
],
|
| 186 |
-
"
|
| 187 |
-
"
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
]
|
| 202 |
-
|
| 203 |
-
"
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
|
|
|
|
|
|
| 209 |
]
|
| 210 |
-
|
| 211 |
-
"
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
|
|
|
|
|
|
| 217 |
]
|
| 218 |
-
|
| 219 |
}
|
| 220 |
},
|
| 221 |
{
|
|
@@ -223,7 +214,7 @@
|
|
| 223 |
"source_artifact": "decoder_step.pt2",
|
| 224 |
"output_artifact": "decoder_step.tflite",
|
| 225 |
"size_mb": 22.55,
|
| 226 |
-
"convert_seconds":
|
| 227 |
"quant": "fp16",
|
| 228 |
"torch_output_shapes": [
|
| 229 |
[
|
|
@@ -315,7 +306,7 @@
|
|
| 315 |
"source_artifact": "joint_step.pt2",
|
| 316 |
"output_artifact": "joint_step.tflite",
|
| 317 |
"size_mb": 12.08,
|
| 318 |
-
"convert_seconds": 1.
|
| 319 |
"quant": "fp16",
|
| 320 |
"torch_output_shapes": [
|
| 321 |
[
|
|
@@ -327,9 +318,9 @@
|
|
| 327 |
],
|
| 328 |
"parity": {
|
| 329 |
"ok": true,
|
| 330 |
-
"max_abs_diff": 0.
|
| 331 |
"per_output_diffs": [
|
| 332 |
-
0.
|
| 333 |
],
|
| 334 |
"tflite_output_shapes": [
|
| 335 |
[
|
|
|
|
| 30 |
"attention_mode": "rel_pos",
|
| 31 |
"att_context_size": null,
|
| 32 |
"buckets": [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
{
|
| 34 |
"n_mel_frames": 1500,
|
| 35 |
"n_encoder_frames": 187,
|
|
|
|
| 38 |
128,
|
| 39 |
1500
|
| 40 |
],
|
| 41 |
+
"output_shape": [
|
| 42 |
+
1,
|
| 43 |
+
1024,
|
| 44 |
+
188
|
| 45 |
+
],
|
| 46 |
+
"artifact": "encoder_T1500.pt2",
|
| 47 |
+
"size_mb": 2366.85
|
| 48 |
}
|
| 49 |
],
|
| 50 |
+
"multisig": false
|
|
|
|
|
|
|
| 51 |
},
|
| 52 |
"decoder": {
|
| 53 |
"num_layers": 2,
|
|
|
|
| 145 |
"results": [
|
| 146 |
{
|
| 147 |
"graph": "encoder",
|
| 148 |
+
"source_artifact": "encoder_T1500.pt2",
|
| 149 |
+
"output_artifact": "encoder_T1500.tflite",
|
| 150 |
+
"size_mb": 1150.88,
|
| 151 |
+
"convert_seconds": 158.59,
|
| 152 |
"quant": "fp16",
|
| 153 |
+
"torch_output_shapes": [
|
| 154 |
+
[
|
| 155 |
+
1,
|
| 156 |
+
1024,
|
| 157 |
+
188
|
| 158 |
+
],
|
| 159 |
+
[
|
| 160 |
+
1
|
| 161 |
+
]
|
| 162 |
],
|
| 163 |
+
"parity": {
|
| 164 |
+
"ok": true,
|
| 165 |
+
"max_abs_diff": 0.0,
|
| 166 |
+
"per_output_diffs": [
|
| 167 |
+
[
|
| 168 |
+
"shape mismatch",
|
| 169 |
+
[
|
| 170 |
+
1
|
| 171 |
+
],
|
| 172 |
+
[
|
| 173 |
+
1,
|
| 174 |
+
1024,
|
| 175 |
+
188
|
| 176 |
+
]
|
| 177 |
+
],
|
| 178 |
+
[
|
| 179 |
+
"shape mismatch",
|
| 180 |
+
[
|
| 181 |
+
1,
|
| 182 |
+
1024,
|
| 183 |
+
188
|
| 184 |
+
],
|
| 185 |
+
[
|
| 186 |
+
1
|
| 187 |
+
]
|
| 188 |
]
|
| 189 |
+
],
|
| 190 |
+
"tflite_output_shapes": [
|
| 191 |
+
[
|
| 192 |
+
1
|
| 193 |
+
],
|
| 194 |
+
[
|
| 195 |
+
1,
|
| 196 |
+
1024,
|
| 197 |
+
188
|
| 198 |
]
|
| 199 |
+
],
|
| 200 |
+
"torch_output_shapes": [
|
| 201 |
+
[
|
| 202 |
+
1,
|
| 203 |
+
1024,
|
| 204 |
+
188
|
| 205 |
+
],
|
| 206 |
+
[
|
| 207 |
+
1
|
| 208 |
]
|
| 209 |
+
]
|
| 210 |
}
|
| 211 |
},
|
| 212 |
{
|
|
|
|
| 214 |
"source_artifact": "decoder_step.pt2",
|
| 215 |
"output_artifact": "decoder_step.tflite",
|
| 216 |
"size_mb": 22.55,
|
| 217 |
+
"convert_seconds": 1.92,
|
| 218 |
"quant": "fp16",
|
| 219 |
"torch_output_shapes": [
|
| 220 |
[
|
|
|
|
| 306 |
"source_artifact": "joint_step.pt2",
|
| 307 |
"output_artifact": "joint_step.tflite",
|
| 308 |
"size_mb": 12.08,
|
| 309 |
+
"convert_seconds": 1.61,
|
| 310 |
"quant": "fp16",
|
| 311 |
"torch_output_shapes": [
|
| 312 |
[
|
|
|
|
| 318 |
],
|
| 319 |
"parity": {
|
| 320 |
"ok": true,
|
| 321 |
+
"max_abs_diff": 0.408447265625,
|
| 322 |
"per_output_diffs": [
|
| 323 |
+
0.408447265625
|
| 324 |
],
|
| 325 |
"tflite_output_shapes": [
|
| 326 |
[
|