Parakeet CTC 0.6B Vietnamese β CoreML
CoreML conversion of nvidia/parakeet-ctc-0.6b-Vietnamese
for native Apple Silicon inference (Neural Engine + GPU + CPU).
Used by Degam β a macOS Vietnamese voice input app that pastes polished Vietnamese into any application.
Files
| File | Size | Purpose |
|---|---|---|
FullPipeline.mlpackage.zip |
1.10 GB | Single CoreML model (audio β CTC logits). Unzip to use. |
vocab.json |
~30 KB | SentencePiece vocabulary (1024 tokens, JSON array) |
metadata.json |
<1 KB | Model config (sample rate, max audio, vocab size, etc.) |
Quick Start (Swift)
import CoreML
// 1. Compile and load
let url = URL(fileURLWithPath: "FullPipeline.mlpackage")
let compiledURL = try await MLModel.compileModel(at: url)
let config = MLModelConfiguration()
config.computeUnits = .all // CPU + GPU + Neural Engine
let model = try await MLModel.load(contentsOf: compiledURL, configuration: config)
// 2. Pad audio to 240,000 samples (15s @ 16kHz)
let FIXED = 240_000
var padded = [Float](repeating: 0, count: FIXED)
let copyLen = min(samples.count, FIXED)
samples.withUnsafeBufferPointer { src in
padded.withUnsafeMutableBufferPointer { dst in
dst.baseAddress!.update(from: src.baseAddress!, count: copyLen)
}
}
// 3. Build inputs
let audio = try MLMultiArray(shape: [1, FIXED as NSNumber], dataType: .float32)
padded.withUnsafeBufferPointer { src in
memcpy(audio.dataPointer, src.baseAddress!, FIXED * MemoryLayout<Float>.size)
}
let length = try MLMultiArray(shape: [1], dataType: .int32)
length.dataPointer.assumingMemoryBound(to: Int32.self).pointee = Int32(copyLen)
// 4. Run
let input = try MLDictionaryFeatureProvider(dictionary: [
"audio_signal": MLFeatureValue(multiArray: audio),
"audio_length": MLFeatureValue(multiArray: length),
])
let out = try await model.prediction(from: input)
let logits = out.featureValue(for: "ctc_logits")!.multiArrayValue!
let encLen = Int(out.featureValue(for: "encoder_length")!.multiArrayValue!
.dataPointer.assumingMemoryBound(to: Int32.self).pointee)
// 5. CTC greedy decode using stride-aware indexing
// (CoreML pads rows for SIMD alignment β use logits.strides, not naive t*V)
let strides = logits.strides.map { $0.intValue }
let strideT = strides[1], strideV = strides[2]
let ptr = logits.dataPointer.assumingMemoryBound(to: Float32.self)
let V = logits.shape[2].intValue
let T = encLen
var result = "", prev = -1, blankId = vocab.count
for t in 0..<T {
let base = t * strideT
var best = 0, bestVal = ptr[base]
for v in 1..<V {
let val = ptr[base + v * strideV]
if val > bestVal { bestVal = val; best = v }
}
if best != blankId && best != prev && best < vocab.count {
result += vocab[best]
}
prev = best
}
let text = result.replacingOccurrences(of: "β", with: " ").trimmingCharacters(in: .whitespacesAndNewlines)
Model I/O
FullPipeline.mlpackage
Inputs:
audio_signal : [1, 240000] float32 (16kHz mono, pad/truncate to 15s)
audio_length : [1] int32 (actual samples before padding)
Outputs:
ctc_logits : [1, 188, 1025] float32 (raw logits β apply argmax per frame)
encoder_length : [1] int32 (valid frames; rest is padding)
Important: The CoreML output ctc_logits is stride-padded for SIMD
alignment. For shape [1, 188, 1025] the actual strides are typically
[195520, 1040, 1] β naive t * V indexing gives garbage from frame 1+.
Always use logits.strides or MLShapedArray<Float32>(logits).
Performance (Apple Silicon)
Tested on VIVOS Vietnamese test set (760 utterances, 45 min audio):
| Method | WER | Avg Latency | RTF | Throughput |
|---|---|---|---|---|
| NeMo PyTorch (batch=8, CPU) | 18.5% | 390 ms | 0.110x | 9x realtime |
| CoreML (this) | 19.4% | 34 ms | 0.0095x | 105x realtime |
- 11.5x faster than the PyTorch baseline
- Essentially identical transcription quality (0.9% WER gap from punctuation)
- 105x real-time throughput β processes 45 min of audio in 25 seconds
Conversion
Converted from nvidia/parakeet-ctc-0.6b-Vietnamese (.nemo) using a
Mobius-style export pipeline,
adapted for the Vietnamese pure-CTC model (EncDecCTCModelBPE).
Key conversion details:
- Vietnamese is pure CTC (
EncDecCTCModelBPE), not hybrid TDT+CTC - Decoder accessed via
model.decoder, notmodel.ctc_decoder CTCDecoderWrapperbypasseslog_softmaxfor raw logits (CoreML-friendly)- Fused into single
FullPipeline(preprocessor + encoder + CTC decoder) - Fixed 15-second audio window (240,000 samples @ 16kHz)
- Float32 precision (matches Mobius default)
See https://github.com/ancs21/degam for the conversion script.
License
NVIDIA Open Model License β commercial use allowed with attribution.
Required notice in your distribution:
Licensed by NVIDIA Corporation under the NVIDIA Open Model License.
Citation
@misc{parakeet-ctc-vi-coreml,
title = {Parakeet CTC 0.6B Vietnamese (CoreML)},
author = {ancs21},
year = {2026},
url = {https://huggingface.co/ancs21/parakeet-ctc-0.6b-vi-coreml},
note = {CoreML conversion of nvidia/parakeet-ctc-0.6b-Vietnamese for Apple Silicon}
}
@misc{parakeet-ctc-vi-original,
title = {Parakeet-CTC 0.6B Vietnamese},
author = {NVIDIA},
year = {2025},
url = {https://huggingface.co/nvidia/parakeet-ctc-0.6b-Vietnamese}
}
Model tree for ancs21/parakeet-ctc-0.6b-vi-coreml
Base model
nvidia/parakeet-ctc-0.6b-Vietnamese