Parakeet TDT 0.6B v3 -- Core ML (4-bit encoder palettization)
Core ML conversion of nvidia/parakeet-tdt-0.6b-v3
tuned for Apple silicon (macOS 14+ / iOS 17+).
- Encoder: 4-bit k-means palettized,
per_grouped_channelgranularity (group_size=16).convops kept at fp16. 424 MB. - Decoder: fp16 LSTM prediction network. 23 MB.
- Joint: fp16 joint network (8198-way
[vocab | durations]head). 10 MB. - Tokenizer: HuggingFace
tokenizersformat (SentencePiece BPE, 8192 pieces +<blank>at id 8192 + 5 durations).
Total: ~457 MB. Stays fully on the Neural Engine or GPU at runtime
(1342 ANE ops / 32 CPU ops on cpu_and_ne; 1374 GPU ops / 0 CPU ops on
cpu_and_gpu -- no accelerator fallback).
Headline numbers on M5 Max
Benchmark: 17.5-min audio (MLK's "I Have a Dream"), 3-run medians.
| Runtime | Target | Inference | RTFx | WER vs fp16 ref |
|---|---|---|---|---|
Python (coremltools) |
CPU | 7.59 s | 138× | 3.10% |
| Python | ANE | 4.27 s | 246× | 2.67% |
| Python | GPU | 2.53 s | 414× | 3.10% |
| Swift (parakeet-coreml-swift) | CPU | 6.43 s | 163× | 3.10% |
| Swift | ANE | 2.60 s | 402× | 2.67% |
| Swift | GPU | 0.92 s | 1145× | 3.10% |
The Swift runtime is 17 / 64 / 176 % faster than Python on CPU / ANE /
GPU against the exact same .mlpackage artifacts. See
OPTIMIZATIONS.md
for how -- pipeline + worker pool + zero-alloc decode loop.
Quick start
Swift (auto-downloads from here on first run)
import ParakeetTDT
let transcriber = try await ParakeetTranscriber.fromHuggingFace(
computeUnits: .gpu // or .ane (default), .cpu, .all
)
let result = try transcriber.transcribe(audioURL: audioURL)
print(result.text) // "At this time I have the honor..."
print(result.rtfx, "x realtime")
Library + CLI live at
mweinbach/parakeet-coreml-swift.
First call downloads the artifacts from this repo to
~/Library/Caches/com.parakeet-tdt/ and compiles each .mlpackage to a
.mlmodelc. Subsequent calls are ~0.2 s cold start.
CLI (no setup needed)
brew install swift # if you don't have it
git clone https://github.com/mweinbach/parakeet-coreml-swift
cd parakeet-coreml-swift
swift run -c release parakeet transcribe /path/to/audio.wav --compute-units gpu
Python (coremltools)
import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download
# Download the full .mlpackage bundle.
root = snapshot_download("mweinbach1/parakeet-tdt-0.6b-v3-coreml")
encoder = ct.models.MLModel(f"{root}/encoder.mlpackage",
compute_units=ct.ComputeUnit.CPU_AND_NE)
decoder = ct.models.MLModel(f"{root}/decoder.mlpackage",
compute_units=ct.ComputeUnit.CPU_AND_NE)
joint = ct.models.MLModel(f"{root}/joint.mlpackage",
compute_units=ct.ComputeUnit.CPU_AND_NE)
# input_features: float32 [1, T, 128] log-mel (ParakeetFeatureExtractor)
# attention_mask: int32 [1, T]
enc_out = encoder.predict({
"input_features": input_features,
"attention_mask": attention_mask,
})
encoder_hidden = enc_out["encoder_hidden"] # float32 [1, T//8, 640]
encoder_mask = enc_out["encoder_mask"] # int32 [1, T//8]
# Greedy TDT decode loop: feed encoder_hidden[:, t, :] into joint along with
# the LSTM decoder's current state, advance t by the argmax duration, etc.
# See https://github.com/mweinbach/parakeet-coreml-swift/blob/main/Sources/ParakeetTDT/GreedyTDTDecoder.swift
# for a clean reference implementation (Swift), or `greedy_tdt_decode` in
# the conversion repo for a Python version.
Full Python reference pipeline (feature extraction, tokenizer, greedy decode loop) lives in mweinbach/parakeet-coreml.
Why this configuration
We tried a lot of alternatives before landing on "encoder-only 4-bit per-grouped-channel palettization". The full optimization log is in OPTIMIZATIONS.md; short version:
- Palettization beat linear quant at the same bit-width (1.16% WER @ int8 palettized vs 2.31% @ int8 linear). Apple tunes the ANE kernels for LUT representations.
per_grouped_channelis the ANE / GPU-friendly knob for aggressive low-bit compression. An earlier attempt atenable_per_channel_scale: true+ 4 bits forced 492 encoder ops back to CPU and dropped RTFx by 5×.per_grouped_channelkeeps everything on the accelerator.- Encoder-only compression is correct. Decoder + joint together are 33 MB of fp16. Compressing them saves ~15 MB at the cost of real WER; not worth it.
Compute plan
Core ML places the encoder entirely on the requested accelerator. No CPU fallback on either GPU or ANE:
| Target | Encoder ANE | Encoder GPU | Encoder CPU |
|---|---|---|---|
cpu_and_ne |
1342 | 0 | 32 |
cpu_and_gpu |
0 | 1374 | 0 |
cpu_only |
0 | 0 | 1374 |
Decoder + joint always run on CPU -- they're small and per-call, so dispatch overhead outweighs any accelerator throughput win.
I/O shapes
Encoder:
| Input | Shape | Dtype | Notes |
|---|---|---|---|
input_features |
[1, 3000, 128] |
float32 |
log-mel, per-feature normalised |
attention_mask |
[1, 3000] |
int32 |
1 = valid, 0 = pad |
| Output | Shape | Dtype |
|---|---|---|
encoder_hidden |
[1, 375, 640] |
float32 |
encoder_mask |
[1, 375] |
int32 |
Decoder (one step):
| Input | Shape | Dtype |
|---|---|---|
input_ids |
[1, 1] |
int32 |
hidden |
[2, 1, 640] |
float32 |
cell |
[2, 1, 640] |
float32 |
| Output | Shape | Dtype |
|---|---|---|
decoder_hidden |
[1, 1, 640] |
float32 |
next_hidden |
[2, 1, 640] |
float32 |
next_cell |
[2, 1, 640] |
float32 |
Joint:
| Input | Shape | Dtype |
|---|---|---|
encoder_frame |
[1, 640] |
float32 |
decoder_state |
[1, 640] |
float32 |
| Output | Shape | Dtype |
|---|---|---|
token_logits |
[1, 8193] |
float32 (vocab + blank at id 8192) |
duration_logits |
[1, 5] |
float32 ({0, 1, 2, 3, 4}) |
Feature extraction
The encoder expects log-mel features produced by
transformers.models.parakeet.feature_extraction_parakeet.ParakeetFeatureExtractor:
- 16 kHz sample rate, 128 mel bins
n_fft=512,win_length=400,hop_length=160- Preemphasis 0.97, Slaney mel norm
log(mel + 2^-24)for numeric stability- Per-mel-bin normalization over time (ignoring padding)
The Swift package ships a MelFeatureExtractor
(Accelerate / vDSP) that reproduces the same numerics.
Deployment target
Built for macOS15 / iOS18 minimum. Uses per_grouped_channel
palettization, which requires those OS versions.
License
- Core ML conversion: CC-BY-4.0, inheriting from
nvidia/parakeet-tdt-0.6b-v3. - Tokenizer: same, from NVIDIA's release.
Credits
- Base model: NVIDIA's Parakeet-TDT-0.6B-v3 (FastConformer encoder + LSTM prediction network + TDT joint head).
- Conversion + optimization work: mweinbach/parakeet-coreml.
- Swift runtime: mweinbach/parakeet-coreml-swift.
Citation
@misc{parakeet-tdt-0.6b-v3-coreml,
title = {Parakeet TDT 0.6B v3 Core ML (4-bit encoder palettization)},
author = {Weinbach, Max},
year = {2026},
url = {https://huggingface.co/mweinbach1/parakeet-tdt-0.6b-v3-coreml}
}
Upstream:
@software{parakeet-tdt-0.6b-v3,
title = {Parakeet-TDT-0.6B-v3},
author = {NVIDIA Corporation},
year = {2025},
url = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}
- Downloads last month
- 128
Model tree for mweinbach1/parakeet-tdt-0.6b-v3-coreml
Base model
nvidia/parakeet-tdt-0.6b-v3