Omnilingual ASR β€” CTC 300M (CoreML INT8)

CoreML (.mlpackage) export of Meta's Omnilingual ASR CTC-300M model with 8-bit weight palettization (k-means). Target deployment: iOS 17+ / macOS 14+ with Apple Neural Engine via the CPU+NE compute unit.

Omnilingual ASR is a wav2vec 2.0-style encoder-only model with a linear CTC head, trained by Meta for speech recognition across 1,600+ languages. The CTC variant is language-agnostic at inference time (no language hint required).

Model

Parameters 326 M
Format CoreML .mlpackage (MLProgram)
Precision FP16 compute, INT8 palettized weights (k-means)
Min deployment target iOS 17 / macOS 14
Compute units CPU + Neural Engine
Input audio float32 [1, N_samples] (raw 16 kHz waveform, z-score normalized)
Output logits float32 [1, T, 10288] where T = N_samples / 320
Max duration 5 s (configurable at export time)
Languages 1,600+
Vocabulary 10,288 SentencePiece tokens

The input length is fixed at export time. For longer inputs, chunk the waveform or re-export with a larger --max-duration. A 10-second variant is provided in a separate repository.

Files

File Size Description
omnilingual-ctc-300m-int8.mlpackage/ ~312 MB MLProgram with INT8-palettized weights
tokenizer.model 1.2 MB SentencePiece tokenizer (unk=3, pad=1, eos=2, bos=0)
config.json <1 KB Architecture + deployment metadata

Inference

import CoreML

let model = try MLModel(contentsOf: URL(fileURLWithPath: ".../omnilingual-ctc-300m-int8.mlpackage"))
let audio = MLMultiArray(shape: [1, 80000], dataType: .float32)   // 5s @ 16kHz
// fill audio from zero-mean unit-var waveform ...
let input = try MLDictionaryFeatureProvider(dictionary: ["audio": audio])
let out = try model.prediction(from: input)
let logits = out.featureValue(for: "logits")!.multiArrayValue!   // [1, 250, 10288]
// argmax over -1, collapse consecutive duplicates, drop blank, detokenize.

Full Swift inference, CTC decoding, and multi-language routing are implemented in speech-swift under Sources/OmnilingualASR/.

Architecture

Raw audio [1, samples]
  β†’ Wav2Vec2FeatureExtractor (7-layer 1D conv, stride 320Γ—)
  β†’ Linear 512 β†’ 1024
  β†’ Wav2Vec2PositionEncoder (weight-normalized conv, kernel 128, groups 16)
  β†’ 24 Γ— StandardTransformerEncoderLayer (pre-norm, dim 1024, heads 16, ffn 4096)
  β†’ LayerNorm
  β†’ Linear 1024 β†’ 10288  (CTC head)
  β†’ logits

Export pipeline: torch.jit.trace with a fixed-length sample input (fairseq2 BatchLayout is constructed inside a wrapper so the tracer only sees plain tensors), followed by coremltools.convert at FP16 compute precision and OpPalettizerConfig(mode="kmeans", nbits=8) weight palettization.

Performance

FLEURS test set, CTC-300M fp32 on CPU (Apple M-series), 30 utterances/language:

Language WER Audio RTF (CPU fp32 reference)
English (en_us) 20.0% 289 s 0.056
French (fr_fr) 23.2% 334 s 0.059
German (de_de) 16.5% 361 s 0.058
Arabic (ar_eg) 19.5% 331 s 0.051
Hindi (hi_in) 22.5% 364 s 0.050

Expect ANE inference to reach RTF < 0.03 after palettization (wav2vec2 attention and ffn map cleanly onto the Neural Engine; only the 1D conv frontend falls back to CPU/GPU). INT8 palettization typically adds < 1% absolute WER on wav2vec2-class models.

Source

Links

License

Apache 2.0 (inherited from upstream).


Downloads last month
69
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8

Quantized
(2)
this model

Collection including aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8

Paper for aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8