Parakeet CTC 0.6B Vietnamese β€” CoreML

CoreML conversion of nvidia/parakeet-ctc-0.6b-Vietnamese for native Apple Silicon inference (Neural Engine + GPU + CPU).

Used by Degam β€” a macOS Vietnamese voice input app that pastes polished Vietnamese into any application.

Files

File Size Purpose
FullPipeline.mlpackage.zip 1.10 GB Single CoreML model (audio β†’ CTC logits). Unzip to use.
vocab.json ~30 KB SentencePiece vocabulary (1024 tokens, JSON array)
metadata.json <1 KB Model config (sample rate, max audio, vocab size, etc.)

Quick Start (Swift)

import CoreML

// 1. Compile and load
let url = URL(fileURLWithPath: "FullPipeline.mlpackage")
let compiledURL = try await MLModel.compileModel(at: url)
let config = MLModelConfiguration()
config.computeUnits = .all  // CPU + GPU + Neural Engine
let model = try await MLModel.load(contentsOf: compiledURL, configuration: config)

// 2. Pad audio to 240,000 samples (15s @ 16kHz)
let FIXED = 240_000
var padded = [Float](repeating: 0, count: FIXED)
let copyLen = min(samples.count, FIXED)
samples.withUnsafeBufferPointer { src in
    padded.withUnsafeMutableBufferPointer { dst in
        dst.baseAddress!.update(from: src.baseAddress!, count: copyLen)
    }
}

// 3. Build inputs
let audio = try MLMultiArray(shape: [1, FIXED as NSNumber], dataType: .float32)
padded.withUnsafeBufferPointer { src in
    memcpy(audio.dataPointer, src.baseAddress!, FIXED * MemoryLayout<Float>.size)
}
let length = try MLMultiArray(shape: [1], dataType: .int32)
length.dataPointer.assumingMemoryBound(to: Int32.self).pointee = Int32(copyLen)

// 4. Run
let input = try MLDictionaryFeatureProvider(dictionary: [
    "audio_signal": MLFeatureValue(multiArray: audio),
    "audio_length": MLFeatureValue(multiArray: length),
])
let out = try await model.prediction(from: input)
let logits = out.featureValue(for: "ctc_logits")!.multiArrayValue!
let encLen = Int(out.featureValue(for: "encoder_length")!.multiArrayValue!
    .dataPointer.assumingMemoryBound(to: Int32.self).pointee)

// 5. CTC greedy decode using stride-aware indexing
// (CoreML pads rows for SIMD alignment β€” use logits.strides, not naive t*V)
let strides = logits.strides.map { $0.intValue }
let strideT = strides[1], strideV = strides[2]
let ptr = logits.dataPointer.assumingMemoryBound(to: Float32.self)

let V = logits.shape[2].intValue
let T = encLen
var result = "", prev = -1, blankId = vocab.count
for t in 0..<T {
    let base = t * strideT
    var best = 0, bestVal = ptr[base]
    for v in 1..<V {
        let val = ptr[base + v * strideV]
        if val > bestVal { bestVal = val; best = v }
    }
    if best != blankId && best != prev && best < vocab.count {
        result += vocab[best]
    }
    prev = best
}
let text = result.replacingOccurrences(of: "▁", with: " ").trimmingCharacters(in: .whitespacesAndNewlines)

Model I/O

FullPipeline.mlpackage

Inputs:
  audio_signal   : [1, 240000]      float32  (16kHz mono, pad/truncate to 15s)
  audio_length   : [1]              int32    (actual samples before padding)

Outputs:
  ctc_logits     : [1, 188, 1025]   float32  (raw logits β€” apply argmax per frame)
  encoder_length : [1]              int32    (valid frames; rest is padding)

Important: The CoreML output ctc_logits is stride-padded for SIMD alignment. For shape [1, 188, 1025] the actual strides are typically [195520, 1040, 1] β€” naive t * V indexing gives garbage from frame 1+. Always use logits.strides or MLShapedArray<Float32>(logits).

Performance (Apple Silicon)

Tested on VIVOS Vietnamese test set (760 utterances, 45 min audio):

Method WER Avg Latency RTF Throughput
NeMo PyTorch (batch=8, CPU) 18.5% 390 ms 0.110x 9x realtime
CoreML (this) 19.4% 34 ms 0.0095x 105x realtime
  • 11.5x faster than the PyTorch baseline
  • Essentially identical transcription quality (0.9% WER gap from punctuation)
  • 105x real-time throughput β€” processes 45 min of audio in 25 seconds

Conversion

Converted from nvidia/parakeet-ctc-0.6b-Vietnamese (.nemo) using a Mobius-style export pipeline, adapted for the Vietnamese pure-CTC model (EncDecCTCModelBPE).

Key conversion details:

  • Vietnamese is pure CTC (EncDecCTCModelBPE), not hybrid TDT+CTC
  • Decoder accessed via model.decoder, not model.ctc_decoder
  • CTCDecoderWrapper bypasses log_softmax for raw logits (CoreML-friendly)
  • Fused into single FullPipeline (preprocessor + encoder + CTC decoder)
  • Fixed 15-second audio window (240,000 samples @ 16kHz)
  • Float32 precision (matches Mobius default)

See https://github.com/ancs21/degam for the conversion script.

License

NVIDIA Open Model License β€” commercial use allowed with attribution.

Required notice in your distribution:

Licensed by NVIDIA Corporation under the NVIDIA Open Model License.

Citation

@misc{parakeet-ctc-vi-coreml,
  title  = {Parakeet CTC 0.6B Vietnamese (CoreML)},
  author = {ancs21},
  year   = {2026},
  url    = {https://huggingface.co/ancs21/parakeet-ctc-0.6b-vi-coreml},
  note   = {CoreML conversion of nvidia/parakeet-ctc-0.6b-Vietnamese for Apple Silicon}
}

@misc{parakeet-ctc-vi-original,
  title  = {Parakeet-CTC 0.6B Vietnamese},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/parakeet-ctc-0.6b-Vietnamese}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ancs21/parakeet-ctc-0.6b-vi-coreml

Finetuned
(1)
this model