Omnilingual ASR β CTC 300M (CoreML INT8)
CoreML (.mlpackage) export of Meta's Omnilingual ASR CTC-300M model with
8-bit weight palettization (k-means). Target deployment: iOS 17+ / macOS 14+
with Apple Neural Engine via the CPU+NE compute unit.
Omnilingual ASR is a wav2vec 2.0-style encoder-only model with a linear CTC head, trained by Meta for speech recognition across 1,600+ languages. The CTC variant is language-agnostic at inference time (no language hint required).
Model
| Parameters | 326 M |
| Format | CoreML .mlpackage (MLProgram) |
| Precision | FP16 compute, INT8 palettized weights (k-means) |
| Min deployment target | iOS 17 / macOS 14 |
| Compute units | CPU + Neural Engine |
| Input | audio float32 [1, N_samples] (raw 16 kHz waveform, z-score normalized) |
| Output | logits float32 [1, T, 10288] where T = N_samples / 320 |
| Max duration | 5 s (configurable at export time) |
| Languages | 1,600+ |
| Vocabulary | 10,288 SentencePiece tokens |
The input length is fixed at export time. For longer inputs, chunk the
waveform or re-export with a larger --max-duration. A 10-second variant is
provided in a separate repository.
Files
| File | Size | Description |
|---|---|---|
omnilingual-ctc-300m-int8.mlpackage/ |
~312 MB | MLProgram with INT8-palettized weights |
tokenizer.model |
1.2 MB | SentencePiece tokenizer (unk=3, pad=1, eos=2, bos=0) |
config.json |
<1 KB | Architecture + deployment metadata |
Inference
import CoreML
let model = try MLModel(contentsOf: URL(fileURLWithPath: ".../omnilingual-ctc-300m-int8.mlpackage"))
let audio = MLMultiArray(shape: [1, 80000], dataType: .float32) // 5s @ 16kHz
// fill audio from zero-mean unit-var waveform ...
let input = try MLDictionaryFeatureProvider(dictionary: ["audio": audio])
let out = try model.prediction(from: input)
let logits = out.featureValue(for: "logits")!.multiArrayValue! // [1, 250, 10288]
// argmax over -1, collapse consecutive duplicates, drop blank, detokenize.
Full Swift inference, CTC decoding, and multi-language routing are implemented in
speech-swift under Sources/OmnilingualASR/.
Architecture
Raw audio [1, samples]
β Wav2Vec2FeatureExtractor (7-layer 1D conv, stride 320Γ)
β Linear 512 β 1024
β Wav2Vec2PositionEncoder (weight-normalized conv, kernel 128, groups 16)
β 24 Γ StandardTransformerEncoderLayer (pre-norm, dim 1024, heads 16, ffn 4096)
β LayerNorm
β Linear 1024 β 10288 (CTC head)
β logits
Export pipeline: torch.jit.trace with a fixed-length sample input (fairseq2
BatchLayout is constructed inside a wrapper so the tracer only sees plain
tensors), followed by coremltools.convert at FP16 compute precision and
OpPalettizerConfig(mode="kmeans", nbits=8) weight palettization.
Performance
FLEURS test set, CTC-300M fp32 on CPU (Apple M-series), 30 utterances/language:
| Language | WER | Audio | RTF (CPU fp32 reference) |
|---|---|---|---|
| English (en_us) | 20.0% | 289 s | 0.056 |
| French (fr_fr) | 23.2% | 334 s | 0.059 |
| German (de_de) | 16.5% | 361 s | 0.058 |
| Arabic (ar_eg) | 19.5% | 331 s | 0.051 |
| Hindi (hi_in) | 22.5% | 364 s | 0.050 |
Expect ANE inference to reach RTF < 0.03 after palettization (wav2vec2 attention and ffn map cleanly onto the Neural Engine; only the 1D conv frontend falls back to CPU/GPU). INT8 palettization typically adds < 1% absolute WER on wav2vec2-class models.
Source
- Upstream model: facebook/omniASR-CTC-300M
- Paper: Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
- Meta blog: Omnilingual ASR announcement
Links
- speech-swift β Apple SDK
- soniqo.audio β website
- blog
License
Apache 2.0 (inherited from upstream).
- Guide: soniqo.audio/guides/omnilingual
- Docs: soniqo.audio
- GitHub: soniqo/speech-swift
- Downloads last month
- 69
Model tree for aufklarer/Omnilingual-ASR-CTC-300M-CoreML-INT8
Base model
facebook/omniASR-CTC-300M