Omnilingual ASR — CTC 300M (CoreML INT8)

CoreML (.mlpackage) export of Meta's Omnilingual ASR CTC-300M model with 8-bit weight palettization (k-means). Target deployment: iOS 17+ / macOS 14+ with Apple Neural Engine via the CPU+NE compute unit.

Omnilingual ASR is a wav2vec 2.0-style encoder-only model with a linear CTC head, trained by Meta for speech recognition across 1,600+ languages. The CTC variant is language-agnostic at inference time (no language hint required).

Model


Parameters	326 M
Format	CoreML `.mlpackage` (MLProgram)
Precision	FP16 compute, INT8 palettized weights (k-means)
Min deployment target	iOS 17 / macOS 14
Compute units	CPU + Neural Engine
Input	`audio` float32 `[1, N_samples]` (raw 16 kHz waveform, z-score normalized)
Output	`logits` float32 `[1, T, 10288]` where T = N_samples / 320
Max duration	5 s (configurable at export time)
Languages	1,600+
Vocabulary	10,288 SentencePiece tokens

The input length is fixed at export time. For longer inputs, chunk the waveform or re-export with a larger --max-duration. A 10-second variant is provided in a separate repository.

Files

File	Size	Description
`omnilingual-ctc-300m-int8.mlpackage/`	~312 MB	MLProgram with INT8-palettized weights
`tokenizer.model`	1.2 MB	SentencePiece tokenizer (unk=3, pad=1, eos=2, bos=0)
`config.json`	<1 KB	Architecture + deployment metadata

Inference

import CoreML

let model = try MLModel(contentsOf: URL(fileURLWithPath: ".../omnilingual-ctc-300m-int8.mlpackage"))
let audio = MLMultiArray(shape: [1, 80000], dataType: .float32)   // 5s @ 16kHz
// fill audio from zero-mean unit-var waveform ...
let input = try MLDictionaryFeatureProvider(dictionary: ["audio": audio])
let out = try model.prediction(from: input)
let logits = out.featureValue(for: "logits")!.multiArrayValue!   // [1, 250, 10288]
// argmax over -1, collapse consecutive duplicates, drop blank, detokenize.

Full Swift inference, CTC decoding, and multi-language routing are implemented in speech-swift under Sources/OmnilingualASR/.

Architecture

Raw audio [1, samples]
  → Wav2Vec2FeatureExtractor (7-layer 1D conv, stride 320×)
  → Linear 512 → 1024
  → Wav2Vec2PositionEncoder (weight-normalized conv, kernel 128, groups 16)
  → 24 × StandardTransformerEncoderLayer (pre-norm, dim 1024, heads 16, ffn 4096)
  → LayerNorm
  → Linear 1024 → 10288  (CTC head)
  → logits

Export pipeline: torch.jit.trace with a fixed-length sample input (fairseq2 BatchLayout is constructed inside a wrapper so the tracer only sees plain tensors), followed by coremltools.convert at FP16 compute precision and OpPalettizerConfig(mode="kmeans", nbits=8) weight palettization.

Performance

FLEURS test set, CTC-300M fp32 on CPU (Apple M-series), 30 utterances/language:

Language	WER	Audio	RTF (CPU fp32 reference)
English (en_us)	20.0%	289 s	0.056
French (fr_fr)	23.2%	334 s	0.059
German (de_de)	16.5%	361 s	0.058
Arabic (ar_eg)	19.5%	331 s	0.051
Hindi (hi_in)	22.5%	364 s	0.050

Expect ANE inference to reach RTF < 0.03 after palettization (wav2vec2 attention and ffn map cleanly onto the Neural Engine; only the 1D conv frontend falls back to CPU/GPU). INT8 palettization typically adds < 1% absolute WER on wav2vec2-class models.