Parakeet-TDT 1.1B CoreML

Pre-compiled CoreML model bundles for NVIDIA Parakeet-TDT 1.1B speech recognition, optimized for Apple Silicon (M1-M5) with full Apple Neural Engine (ANE) acceleration.

Model Details

Architecture: Transducer-Duration-Transducer (TDT) with RNN-T greedy search
Parameters: 1.1 billion
Format: Pre-compiled .mlmodelc bundles (ready to load, no compilation needed)
Compute: Runs on ANE + GPU + CPU via CoreML ComputeUnits::All
Input: Raw 16kHz mono audio (15-second fixed window)
Vocabulary: 1025 BPE tokens + 5 TDT duration classes

Files

File	Size	Description
`encoder.mlmodelc/`	1.9 GB	Fused preprocessor + encoder (mel spectrogram + conformer)
`decoder.mlmodelc/`	14 MB	LSTM prediction network (2 layers, 640 hidden)
`joiner.mlmodelc/`	3.5 MB	Joint network producing 1030 logits (1025 vocab + 5 durations)
`tokens.txt`	10 KB	BPE vocabulary (`<token> <id>` per line)

Usage with Swictation

These models are automatically downloaded during npm install swictation on macOS Apple Silicon.

npm install -g swictation
# Models download to ~/.local/share/swictation/models/parakeet-tdt-1.1b-coreml/

Manual download:

pip install huggingface_hub[cli]
hf download jenerallee78/parakeet-tdt-1.1b-coreml --local-dir ~/.local/share/swictation/models/parakeet-tdt-1.1b-coreml

Usage with coreml-native (Rust)

use coreml_native::{Model, ComputeUnits, BorrowedTensor};

let encoder = Model::load("encoder.mlmodelc", ComputeUnits::All)?;
let decoder = Model::load("decoder.mlmodelc", ComputeUnits::All)?;
let joiner = Model::load("joiner.mlmodelc", ComputeUnits::All)?;

// Encoder: raw audio in, features out
let audio = BorrowedTensor::from_f32(&samples, &[1, 240000])?;
let length = BorrowedTensor::from_i32(&[num_samples as i32], &[1])?;
let pred = encoder.predict(&[("audio_signal", &audio), ("audio_length", &length)])?;
let (features, shape) = pred.get_f32("obj_3")?;

See coreml-native for the Rust bindings.

Model I/O Specification

Encoder

Inputs: audio_signal [1, 240000] FLOAT32, audio_length [1] INT32
Outputs: obj_3 [1, 1024, 188] FLOAT16, obj [1] INT32 (valid frame count)

Decoder

Inputs: targets [1, 1] INT32, target_length [1] INT32, h [2, 1, 640] FLOAT32, c [2, 1, 640] FLOAT32
Outputs: var_47 [1, 1, 640] FLOAT16, var_34 [2, 1, 640] FLOAT16, var_35 [2, 1, 640] FLOAT16

Joiner

Inputs: encoder_output [1, 1, 1024] FLOAT32, decoder_output [1, 1, 640] FLOAT32
Outputs: var_31 [1, 1, 1, 1030] FLOAT16

Requirements

macOS 14.0 (Sonoma) or newer
Apple Silicon (M1, M2, M3, M4, M5)

Origin

Converted from the ONNX export of nvidia/parakeet-tdt-1.1b using coremltools. The encoder includes a fused mel-spectrogram preprocessor so raw audio can be passed directly.

License

Apache-2.0 (following the original NVIDIA model license)

Downloads last month: -; Downloads are not tracked for this model. How to track