Parakeet-TDT 1.1B CoreML
Pre-compiled CoreML model bundles for NVIDIA Parakeet-TDT 1.1B speech recognition, optimized for Apple Silicon (M1-M5) with full Apple Neural Engine (ANE) acceleration.
Model Details
- Architecture: Transducer-Duration-Transducer (TDT) with RNN-T greedy search
- Parameters: 1.1 billion
- Format: Pre-compiled
.mlmodelcbundles (ready to load, no compilation needed) - Compute: Runs on ANE + GPU + CPU via CoreML
ComputeUnits::All - Input: Raw 16kHz mono audio (15-second fixed window)
- Vocabulary: 1025 BPE tokens + 5 TDT duration classes
Files
| File | Size | Description |
|---|---|---|
encoder.mlmodelc/ |
1.9 GB | Fused preprocessor + encoder (mel spectrogram + conformer) |
decoder.mlmodelc/ |
14 MB | LSTM prediction network (2 layers, 640 hidden) |
joiner.mlmodelc/ |
3.5 MB | Joint network producing 1030 logits (1025 vocab + 5 durations) |
tokens.txt |
10 KB | BPE vocabulary (<token> <id> per line) |
Usage with Swictation
These models are automatically downloaded during npm install swictation on macOS Apple Silicon.
npm install -g swictation
# Models download to ~/.local/share/swictation/models/parakeet-tdt-1.1b-coreml/
Manual download:
pip install huggingface_hub[cli]
hf download jenerallee78/parakeet-tdt-1.1b-coreml --local-dir ~/.local/share/swictation/models/parakeet-tdt-1.1b-coreml
Usage with coreml-native (Rust)
use coreml_native::{Model, ComputeUnits, BorrowedTensor};
let encoder = Model::load("encoder.mlmodelc", ComputeUnits::All)?;
let decoder = Model::load("decoder.mlmodelc", ComputeUnits::All)?;
let joiner = Model::load("joiner.mlmodelc", ComputeUnits::All)?;
// Encoder: raw audio in, features out
let audio = BorrowedTensor::from_f32(&samples, &[1, 240000])?;
let length = BorrowedTensor::from_i32(&[num_samples as i32], &[1])?;
let pred = encoder.predict(&[("audio_signal", &audio), ("audio_length", &length)])?;
let (features, shape) = pred.get_f32("obj_3")?;
See coreml-native for the Rust bindings.
Model I/O Specification
Encoder
- Inputs:
audio_signal[1, 240000] FLOAT32,audio_length[1] INT32 - Outputs:
obj_3[1, 1024, 188] FLOAT16,obj[1] INT32 (valid frame count)
Decoder
- Inputs:
targets[1, 1] INT32,target_length[1] INT32,h[2, 1, 640] FLOAT32,c[2, 1, 640] FLOAT32 - Outputs:
var_47[1, 1, 640] FLOAT16,var_34[2, 1, 640] FLOAT16,var_35[2, 1, 640] FLOAT16
Joiner
- Inputs:
encoder_output[1, 1, 1024] FLOAT32,decoder_output[1, 1, 640] FLOAT32 - Outputs:
var_31[1, 1, 1, 1030] FLOAT16
Requirements
- macOS 14.0 (Sonoma) or newer
- Apple Silicon (M1, M2, M3, M4, M5)
Origin
Converted from the ONNX export of nvidia/parakeet-tdt-1.1b using coremltools. The encoder includes a fused mel-spectrogram preprocessor so raw audio can be passed directly.
License
Apache-2.0 (following the original NVIDIA model license)