| --- |
| license: cc-by-4.0 |
| language: |
| - en |
| metrics: |
| - wer |
| base_model: |
| - nvidia/parakeet-tdt_ctc-110m |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - automatic-speech-recognition |
| - speech |
| - audio |
| - Transducer |
| - TDT |
| - FastConformer |
| - Conformer |
| - pytorch |
| - NeMo |
| - hf-asr-leaderboard |
| --- |
| # Parakeet-TDT-CTC 110M — CoreML |
|
|
| CoreML export of [nvidia/parakeet-tdt_ctc-110m](https://huggingface.co/nvidia/parakeet-tdt_ctc-110m) for on-device speech recognition on Apple Silicon via [FluidAudio](https://github.com/FluidInference/FluidAudio). |
|
|
|
|
| ## CoreML Components |
|
|
| | File | Size | Description | |
| |------|------|-------------| |
| | `Preprocessor.mlmodelc` | 207 MB | Fused mel-spectrogram + FastConformer encoder | |
| | `Decoder.mlmodelc` | 7.5 MB | 1-layer LSTM prediction network | |
| | `JointDecision.mlmodelc` | 2.7 MB | Single-step joint network (token + duration) | |
| | `parakeet_vocab.json` | 18 KB | 1024-token BPE vocabulary | |
| | `config.json` | 2.5 KB | Model metadata and I/O contracts | |
|
|
| **Input:** 16 kHz mono audio, fixed 15-second window (240,000 samples). |
| **Output:** Token IDs, probabilities, and TDT duration predictions per encoder frame. |
|
|
| ## Performance |
|
|
| Benchmarked with FluidAudio CLI on Apple M2 (release build): |
|
|
| | Benchmark | WER | |
| |-----------|-----| |
| | LibriSpeech test-clean | **3.0%** | |
| | RTFx (overall) | **102x** real-time | |
| | Peak memory | 0.3 GB | |
|
|
| NVIDIA's reference WER (greedy, GPU): |
|
|
| | Benchmark | WER | |
| |-----------|-----| |
| | LibriSpeech test-clean | 2.4% | |
| | LibriSpeech test-other | 5.2% | |
| | AMI | 15.88% | |
| | Earnings-22 | 12.42% | |
| | GigaSpeech | 10.52% | |
| | TEDLIUM-v3 | 4.16% | |
|
|
| ## Usage with FluidAudio |
|
|
| ```bash |
| # Transcribe |
| fluidaudiocli transcribe audio.wav --model-version tdt-ctc-110m |
| |
| # Benchmark |
| fluidaudiocli asr-benchmark --subset test-clean --model-version tdt-ctc-110m |
| ``` |
|
|
| Models auto-download from this repo on first use. To pre-fetch: |
|
|
| ```bash |
| fluidaudiocli download --model-version tdt-ctc-110m |
| ``` |
|
|
| ## Conversion |
|
|
| Exported from NeMo using [mobius/models/stt/parakeet-tdt-ctc-110m/coreml/convert-tdt-coreml.py](https://github.com/FluidInference/mobius): |
|
|
| - Preprocessor fuses mel-spectrogram extraction and the FastConformer encoder into a single CoreML model |
| - JointDecision is the single-step variant (encoder_step + decoder_step inputs) used by FluidAudio's TDT decoder |
| - All models exported as MLProgram (iOS 17+ / macOS 14+), float32 precision |
|
|
| ## References |
|
|
| - [Fast Conformer with Linearly Scalable Attention](https://arxiv.org/abs/2305.05084) |
| - [Efficient Sequence Transduction by Jointly Predicting Tokens and Durations](https://arxiv.org/abs/2304.06795) |
| - [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) |