Parakeet TDT 0.6B v3 -- Core ML (4-bit encoder palettization)

Core ML conversion of nvidia/parakeet-tdt-0.6b-v3 tuned for Apple silicon (macOS 14+ / iOS 17+).

  • Encoder: 4-bit k-means palettized, per_grouped_channel granularity (group_size=16). conv ops kept at fp16. 424 MB.
  • Decoder: fp16 LSTM prediction network. 23 MB.
  • Joint: fp16 joint network (8198-way [vocab | durations] head). 10 MB.
  • Tokenizer: HuggingFace tokenizers format (SentencePiece BPE, 8192 pieces + <blank> at id 8192 + 5 durations).

Total: ~457 MB. Stays fully on the Neural Engine or GPU at runtime (1342 ANE ops / 32 CPU ops on cpu_and_ne; 1374 GPU ops / 0 CPU ops on cpu_and_gpu -- no accelerator fallback).

Headline numbers on M5 Max

Benchmark: 17.5-min audio (MLK's "I Have a Dream"), 3-run medians.

Runtime Target Inference RTFx WER vs fp16 ref
Python (coremltools) CPU 7.59 s 138× 3.10%
Python ANE 4.27 s 246× 2.67%
Python GPU 2.53 s 414× 3.10%
Swift (parakeet-coreml-swift) CPU 6.43 s 163× 3.10%
Swift ANE 2.60 s 402× 2.67%
Swift GPU 0.92 s 1145× 3.10%

The Swift runtime is 17 / 64 / 176 % faster than Python on CPU / ANE / GPU against the exact same .mlpackage artifacts. See OPTIMIZATIONS.md for how -- pipeline + worker pool + zero-alloc decode loop.

Quick start

Swift (auto-downloads from here on first run)

import ParakeetTDT

let transcriber = try await ParakeetTranscriber.fromHuggingFace(
    computeUnits: .gpu   // or .ane (default), .cpu, .all
)
let result = try transcriber.transcribe(audioURL: audioURL)
print(result.text)            // "At this time I have the honor..."
print(result.rtfx, "x realtime")

Library + CLI live at mweinbach/parakeet-coreml-swift. First call downloads the artifacts from this repo to ~/Library/Caches/com.parakeet-tdt/ and compiles each .mlpackage to a .mlmodelc. Subsequent calls are ~0.2 s cold start.

CLI (no setup needed)

brew install swift    # if you don't have it
git clone https://github.com/mweinbach/parakeet-coreml-swift
cd parakeet-coreml-swift
swift run -c release parakeet transcribe /path/to/audio.wav --compute-units gpu

Python (coremltools)

import coremltools as ct
import numpy as np
from huggingface_hub import snapshot_download

# Download the full .mlpackage bundle.
root = snapshot_download("mweinbach1/parakeet-tdt-0.6b-v3-coreml")

encoder = ct.models.MLModel(f"{root}/encoder.mlpackage",
                            compute_units=ct.ComputeUnit.CPU_AND_NE)
decoder = ct.models.MLModel(f"{root}/decoder.mlpackage",
                            compute_units=ct.ComputeUnit.CPU_AND_NE)
joint   = ct.models.MLModel(f"{root}/joint.mlpackage",
                            compute_units=ct.ComputeUnit.CPU_AND_NE)

# input_features: float32 [1, T, 128] log-mel (ParakeetFeatureExtractor)
# attention_mask: int32  [1, T]
enc_out = encoder.predict({
    "input_features": input_features,
    "attention_mask": attention_mask,
})
encoder_hidden = enc_out["encoder_hidden"]   # float32 [1, T//8, 640]
encoder_mask   = enc_out["encoder_mask"]     # int32   [1, T//8]

# Greedy TDT decode loop: feed encoder_hidden[:, t, :] into joint along with
# the LSTM decoder's current state, advance t by the argmax duration, etc.
# See https://github.com/mweinbach/parakeet-coreml-swift/blob/main/Sources/ParakeetTDT/GreedyTDTDecoder.swift
# for a clean reference implementation (Swift), or `greedy_tdt_decode` in
# the conversion repo for a Python version.

Full Python reference pipeline (feature extraction, tokenizer, greedy decode loop) lives in mweinbach/parakeet-coreml.

Why this configuration

We tried a lot of alternatives before landing on "encoder-only 4-bit per-grouped-channel palettization". The full optimization log is in OPTIMIZATIONS.md; short version:

  • Palettization beat linear quant at the same bit-width (1.16% WER @ int8 palettized vs 2.31% @ int8 linear). Apple tunes the ANE kernels for LUT representations.
  • per_grouped_channel is the ANE / GPU-friendly knob for aggressive low-bit compression. An earlier attempt at enable_per_channel_scale: true + 4 bits forced 492 encoder ops back to CPU and dropped RTFx by 5×. per_grouped_channel keeps everything on the accelerator.
  • Encoder-only compression is correct. Decoder + joint together are 33 MB of fp16. Compressing them saves ~15 MB at the cost of real WER; not worth it.

Compute plan

Core ML places the encoder entirely on the requested accelerator. No CPU fallback on either GPU or ANE:

Target Encoder ANE Encoder GPU Encoder CPU
cpu_and_ne 1342 0 32
cpu_and_gpu 0 1374 0
cpu_only 0 0 1374

Decoder + joint always run on CPU -- they're small and per-call, so dispatch overhead outweighs any accelerator throughput win.

I/O shapes

Encoder:

Input Shape Dtype Notes
input_features [1, 3000, 128] float32 log-mel, per-feature normalised
attention_mask [1, 3000] int32 1 = valid, 0 = pad
Output Shape Dtype
encoder_hidden [1, 375, 640] float32
encoder_mask [1, 375] int32

Decoder (one step):

Input Shape Dtype
input_ids [1, 1] int32
hidden [2, 1, 640] float32
cell [2, 1, 640] float32
Output Shape Dtype
decoder_hidden [1, 1, 640] float32
next_hidden [2, 1, 640] float32
next_cell [2, 1, 640] float32

Joint:

Input Shape Dtype
encoder_frame [1, 640] float32
decoder_state [1, 640] float32
Output Shape Dtype
token_logits [1, 8193] float32 (vocab + blank at id 8192)
duration_logits [1, 5] float32 ({0, 1, 2, 3, 4})

Feature extraction

The encoder expects log-mel features produced by transformers.models.parakeet.feature_extraction_parakeet.ParakeetFeatureExtractor:

  • 16 kHz sample rate, 128 mel bins
  • n_fft=512, win_length=400, hop_length=160
  • Preemphasis 0.97, Slaney mel norm
  • log(mel + 2^-24) for numeric stability
  • Per-mel-bin normalization over time (ignoring padding)

The Swift package ships a MelFeatureExtractor (Accelerate / vDSP) that reproduces the same numerics.

Deployment target

Built for macOS15 / iOS18 minimum. Uses per_grouped_channel palettization, which requires those OS versions.

License

Credits

Citation

@misc{parakeet-tdt-0.6b-v3-coreml,
  title  = {Parakeet TDT 0.6B v3 Core ML (4-bit encoder palettization)},
  author = {Weinbach, Max},
  year   = {2026},
  url    = {https://huggingface.co/mweinbach1/parakeet-tdt-0.6b-v3-coreml}
}

Upstream:

@software{parakeet-tdt-0.6b-v3,
  title  = {Parakeet-TDT-0.6B-v3},
  author = {NVIDIA Corporation},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3}
}
Downloads last month
128
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mweinbach1/parakeet-tdt-0.6b-v3-coreml

Quantized
(37)
this model