talatapp commited on
Commit
dace63e
·
verified ·
1 Parent(s): f1283ca

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +43 -0
README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - multilingual
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - onnx
8
+ - parakeet
9
+ - tdt
10
+ - multilingual
11
+ base_model: nvidia/parakeet-tdt-0.6b-v3
12
+ ---
13
+
14
+ # parakeet-tdt-0.6b-v3 ONNX (split decoder/joint, multilingual)
15
+
16
+ Multilingual variant. Re-export of [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) to four separate ONNX files, int8 dynamic-quantized for CPU / DirectML inference. Supports 25 European languages.
17
+
18
+ Same architecture and tensor schema as the English [v2 split bundle](https://huggingface.co/talatapp/parakeet-tdt-0.6b-v2-onnx-split) — only the vocab and the joint network's output dimension differ.
19
+
20
+ | File | Inputs | Outputs |
21
+ |---|---|---|
22
+ | `preprocessor.int8.onnx` | `audio_signal [1, S] f32`, `audio_length [1] i32` | `mel [1, 128, F] f32`, `mel_length [1] i64` |
23
+ | `encoder.int8.onnx` | `mel [1, 128, F] f32`, `mel_length [1] i32` | `encoder [1, 1024, T] f32`, `encoder_length [1] i64` |
24
+ | `decoder.int8.onnx` | `targets [1, U] i32`, `target_length [1] i32`, `h_in [2, 1, 640] f32`, `c_in [2, 1, 640] f32` | `decoder [1, 640, 2] f32`, `h_out`, `c_out` |
25
+ | `joint_decision.int8.onnx` | `encoder [1, 1024, T] f32`, `decoder [1, 640, U] f32` | `token_id [1, T, U] i32`, `token_prob [1, T, U] f32`, `duration [1, T, U] i32` |
26
+
27
+ `joint_decision` fuses the joint network with the decision head
28
+ (argmax over token logits + argmax over duration logits + gather for
29
+ token probability).
30
+
31
+ ## Why split?
32
+
33
+ NeMo's own `asr_model.export()` and [istupakov/parakeet-tdt-0.6b-v3-onnx](https://huggingface.co/istupakov/parakeet-tdt-0.6b-v3-onnx) fuse the decoder and joint network into a single ONNX file. That's fine for inference engines that call the full TDT decoder loop in one go, but it doesn't fit pipelines that drive the loop themselves and need the sub-graphs callable independently (e.g. the [talat](https://github.com/lumikey/talat) Rust inference layer, which mirrors FluidAudio's macOS CoreML 4-file decomposition).
34
+
35
+ The PyTorch wrappers used to extract the four sub-graphs are adapted from [FluidInference/mobius](https://github.com/FluidInference/mobius) (Apache 2.0).
36
+
37
+ ## Quantization
38
+
39
+ Per-channel int8 weight-only quantization via `onnxruntime.quantization.quantize_dynamic`. Activations remain fp32 at runtime — keeps the int8 path stable across CPU EP and DirectML without needing a calibration dataset.
40
+
41
+ ## License
42
+
43
+ Inherits NVIDIA Parakeet TDT v3's license (CC-BY-4.0).