Parakeet TDT 0.6B v3 -- Core ML (4-bit encoder palettization)

Core ML conversion of nvidia/parakeet-tdt-0.6b-v3 tuned for Apple silicon (macOS 14+ / iOS 17+).

Encoder: 4-bit k-means palettized, per_grouped_channel granularity (group_size=16). conv ops kept at fp16. 424 MB.
Decoder: fp16 LSTM prediction network. 23 MB.
Joint: fp16 joint network (8198-way [vocab | durations] head). 10 MB.
Tokenizer: HuggingFace tokenizers format (SentencePiece BPE, 8192 pieces + <blank> at id 8192 + 5 durations).

Total: ~457 MB. Stays fully on the Neural Engine or GPU at runtime (1342 ANE ops / 32 CPU ops on cpu_and_ne; 1374 GPU ops / 0 CPU ops on cpu_and_gpu -- no accelerator fallback).

Headline numbers on M5 Max

Benchmark: 17.5-min audio (MLK's "I Have a Dream"), 3-run medians.

Runtime	Target	Inference	RTFx	WER vs fp16 ref
Python (`coremltools`)	CPU	7.59 s	138×	3.10%
Python	ANE	4.27 s	246×	2.67%
Python	GPU	2.53 s	414×	3.10%
Swift (parakeet-coreml-swift)	CPU	6.43 s	163×	3.10%
Swift	ANE	2.60 s	402×	2.67%
Swift	GPU	0.92 s	1145×	3.10%

The Swift runtime is 17 / 64 / 176 % faster than Python on CPU / ANE / GPU against the exact same .mlpackage artifacts. See OPTIMIZATIONS.md for how -- pipeline + worker pool + zero-alloc decode loop.

Quick start

Swift (auto-downloads from here on first run)

import ParakeetTDT

let transcriber = try await ParakeetTranscriber.fromHuggingFace(
    computeUnits: .gpu   // or .ane (default), .cpu, .all
)
let result = try transcriber.transcribe(audioURL: audioURL)
print(result.text)            // "At this time I have the honor..."
print(result.rtfx, "x realtime")

Library + CLI live at mweinbach/parakeet-coreml-swift. First call downloads the artifacts from this repo to ~/Library/Caches/com.parakeet-tdt/ and compiles each .mlpackage to a .mlmodelc. Subsequent calls are ~0.2 s cold start.

CLI (no setup needed)

brew install swift    # if you don't have it
git clone https://github.com/mweinbach/parakeet-coreml-swift
cd parakeet-coreml-swift
swift run -c release parakeet transcribe /path/to/audio.wav --compute-units gpu

Python (`coremltools`)

import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download

# Download the full .mlpackage bundle.
root = snapshot_download("mweinbach1/parakeet-tdt-0.6b-v3-coreml")

encoder = ct.models.MLModel(f"{root}/encoder.mlpackage",
                            compute_units=ct.ComputeUnit.CPU_AND_NE)
decoder = ct.models.MLModel(f"{root}/decoder.mlpackage",
                            compute_units=ct.ComputeUnit.CPU_AND_NE)
joint   = ct.models.MLModel(f"{root}/joint.mlpackage",
                            compute_units=ct.ComputeUnit.CPU_AND_NE)

# input_features: float32 [1, T, 128] log-mel (ParakeetFeatureExtractor)
# attention_mask: int32  [1, T]
enc_out = encoder.predict({
    "input_features": input_features,
    "attention_mask": attention_mask,
})
encoder_hidden = enc_out["encoder_hidden"]   # float32 [1, T//8, 640]
encoder_mask   = enc_out["encoder_mask"]     # int32   [1, T//8]

# Greedy TDT decode loop: feed encoder_hidden[:, t, :] into joint along with
# the LSTM decoder's current state, advance t by the argmax duration, etc.
# See https://github.com/mweinbach/parakeet-coreml-swift/blob/main/Sources/ParakeetTDT/GreedyTDTDecoder.swift
# for a clean reference implementation (Swift), or `greedy_tdt_decode` in
# the conversion repo for a Python version.

Full Python reference pipeline (feature extraction, tokenizer, greedy decode loop) lives in mweinbach/parakeet-coreml.

Why this configuration

We tried a lot of alternatives before landing on "encoder-only 4-bit per-grouped-channel palettization". The full optimization log is in OPTIMIZATIONS.md; short version:

Palettization beat linear quant at the same bit-width (1.16% WER @ int8 palettized vs 2.31% @ int8 linear). Apple tunes the ANE kernels for LUT representations.
per_grouped_channel is the ANE / GPU-friendly knob for aggressive low-bit compression. An earlier attempt at enable_per_channel_scale: true + 4 bits forced 492 encoder ops back to CPU and dropped RTFx by 5×. per_grouped_channel keeps everything on the accelerator.
Encoder-only compression is correct. Decoder + joint together are 33 MB of fp16. Compressing them saves ~15 MB at the cost of real WER; not worth it.

Compute plan

Core ML places the encoder entirely on the requested accelerator. No CPU fallback on either GPU or ANE:

Target	Encoder ANE	Encoder GPU	Encoder CPU
`cpu_and_ne`	1342	0	32
`cpu_and_gpu`	0	1374	0
`cpu_only`	0	0	1374

Decoder + joint always run on CPU -- they're small and per-call, so dispatch overhead outweighs any accelerator throughput win.

I/O shapes

Encoder:

Input	Shape	Dtype	Notes
`input_features`	`[1, 3000, 128]`	`float32`	log-mel, per-feature normalised
`attention_mask`	`[1, 3000]`	`int32`	1 = valid, 0 = pad

Output	Shape	Dtype
`encoder_hidden`	`[1, 375, 640]`	`float32`
`encoder_mask`	`[1, 375]`	`int32`

Decoder (one step):

Input	Shape	Dtype
`input_ids`	`[1, 1]`	`int32`
`hidden`	`[2, 1, 640]`	`float32`
`cell`	`[2, 1, 640]`	`float32`

Output	Shape	Dtype
`decoder_hidden`	`[1, 1, 640]`	`float32`
`next_hidden`	`[2, 1, 640]`	`float32`
`next_cell`	`[2, 1, 640]`	`float32`

Joint:

Input	Shape	Dtype
`encoder_frame`	`[1, 640]`	`float32`
`decoder_state`	`[1, 640]`	`float32`

Output	Shape	Dtype
`token_logits`	`[1, 8193]`	`float32` (vocab + blank at id 8192)
`duration_logits`	`[1, 5]`	`float32` (`{0, 1, 2, 3, 4}`)

Feature extraction

The encoder expects log-mel features produced by transformers.models.parakeet.feature_extraction_parakeet.ParakeetFeatureExtractor:

16 kHz sample rate, 128 mel bins
n_fft=512, win_length=400, hop_length=160
Preemphasis 0.97, Slaney mel norm
log(mel + 2^-24) for numeric stability
Per-mel-bin normalization over time (ignoring padding)

The Swift package ships a MelFeatureExtractor (Accelerate / vDSP) that reproduces the same numerics.

Deployment target

Built for macOS15 / iOS18 minimum. Uses per_grouped_channel palettization, which requires those OS versions.

License

Core ML conversion: CC-BY-4.0, inheriting from nvidia/parakeet-tdt-0.6b-v3.
Tokenizer: same, from NVIDIA's release.

Credits

Base model: NVIDIA's Parakeet-TDT-0.6B-v3 (FastConformer encoder + LSTM prediction network + TDT joint head).
Conversion + optimization work: mweinbach/parakeet-coreml.
Swift runtime: mweinbach/parakeet-coreml-swift.

Citation

@misc{parakeet-tdt-0.6b-v3-coreml,
  title  = {Parakeet TDT 0.6B v3 Core ML (4-bit encoder palettization)},
  author = {Weinbach, Max},
  year   = {2026},
  url    = {https://huggingface.co/mweinbach1/parakeet-tdt-0.6b-v3-coreml}
}

Upstream:

@software{parakeet-tdt-0.6b-v3,
  title  = {Parakeet-TDT-0.6B-v3},
  author = {NVIDIA Corporation},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}

Downloads last month: 128

Model tree for mweinbach1/parakeet-tdt-0.6b-v3-coreml

Base model

nvidia/parakeet-tdt-0.6b-v3

Quantized

(37)

this model