Parakeet-TDT 1.1B CoreML

Pre-compiled CoreML model bundles for NVIDIA Parakeet-TDT 1.1B speech recognition, optimized for Apple Silicon (M1-M5) with full Apple Neural Engine (ANE) acceleration.

Model Details

  • Architecture: Transducer-Duration-Transducer (TDT) with RNN-T greedy search
  • Parameters: 1.1 billion
  • Format: Pre-compiled .mlmodelc bundles (ready to load, no compilation needed)
  • Compute: Runs on ANE + GPU + CPU via CoreML ComputeUnits::All
  • Input: Raw 16kHz mono audio (15-second fixed window)
  • Vocabulary: 1025 BPE tokens + 5 TDT duration classes

Files

File Size Description
encoder.mlmodelc/ 1.9 GB Fused preprocessor + encoder (mel spectrogram + conformer)
decoder.mlmodelc/ 14 MB LSTM prediction network (2 layers, 640 hidden)
joiner.mlmodelc/ 3.5 MB Joint network producing 1030 logits (1025 vocab + 5 durations)
tokens.txt 10 KB BPE vocabulary (<token> <id> per line)

Usage with Swictation

These models are automatically downloaded during npm install swictation on macOS Apple Silicon.

npm install -g swictation
# Models download to ~/.local/share/swictation/models/parakeet-tdt-1.1b-coreml/

Manual download:

pip install huggingface_hub[cli]
hf download jenerallee78/parakeet-tdt-1.1b-coreml --local-dir ~/.local/share/swictation/models/parakeet-tdt-1.1b-coreml

Usage with coreml-native (Rust)

use coreml_native::{Model, ComputeUnits, BorrowedTensor};

let encoder = Model::load("encoder.mlmodelc", ComputeUnits::All)?;
let decoder = Model::load("decoder.mlmodelc", ComputeUnits::All)?;
let joiner = Model::load("joiner.mlmodelc", ComputeUnits::All)?;

// Encoder: raw audio in, features out
let audio = BorrowedTensor::from_f32(&samples, &[1, 240000])?;
let length = BorrowedTensor::from_i32(&[num_samples as i32], &[1])?;
let pred = encoder.predict(&[("audio_signal", &audio), ("audio_length", &length)])?;
let (features, shape) = pred.get_f32("obj_3")?;

See coreml-native for the Rust bindings.

Model I/O Specification

Encoder

  • Inputs: audio_signal [1, 240000] FLOAT32, audio_length [1] INT32
  • Outputs: obj_3 [1, 1024, 188] FLOAT16, obj [1] INT32 (valid frame count)

Decoder

  • Inputs: targets [1, 1] INT32, target_length [1] INT32, h [2, 1, 640] FLOAT32, c [2, 1, 640] FLOAT32
  • Outputs: var_47 [1, 1, 640] FLOAT16, var_34 [2, 1, 640] FLOAT16, var_35 [2, 1, 640] FLOAT16

Joiner

  • Inputs: encoder_output [1, 1, 1024] FLOAT32, decoder_output [1, 1, 640] FLOAT32
  • Outputs: var_31 [1, 1, 1, 1030] FLOAT16

Requirements

  • macOS 14.0 (Sonoma) or newer
  • Apple Silicon (M1, M2, M3, M4, M5)

Origin

Converted from the ONNX export of nvidia/parakeet-tdt-1.1b using coremltools. The encoder includes a fused mel-spectrogram preprocessor so raw audio can be passed directly.

License

Apache-2.0 (following the original NVIDIA model license)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support